Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Brian Bushnell View Post
    If you know the sequence numbers, you can use getreads.sh:
    I have tried it, but i get the java error:

    Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
    at jgi.GetReads.<init>(GetReads.java:321)
    at jgi.GetReads.main(GetReads.java:37)

    what could be the cause? the input file is normal .fasta file with 22000 reads from Illumina MiSeq.

    Comment


    • Can you please post the exact command you used, and the entire screen output?

      Comment


      • Originally posted by Brian Bushnell View Post
        Can you please post the exact command you used, and the entire screen output?
        getreads.sh in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594

        java -ea -Xmx200m -cp /common/WORK/bbmap/current/ jgi.GetReads in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594
        Executing jgi.GetReads [in=af_svi_0.05Perc_22633.fasta, out=bla.fasta, id=6272,19594]

        Input is unpaired
        java.lang.AssertionError: Attempting to read from a closed file. Current header: MISEQ:70:000000000-A4FTJ:1:2111:9262:20597 1:N:0:
        at stream.FastaReadInputStream.nextBases(FastaReadInputStream.java:357)
        at stream.FastaReadInputStream.generateRead(FastaReadInputStream.java:237)
        at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:210)
        at stream.FastaReadInputStream.nextList(FastaReadInputStream.java:121)
        at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
        at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
        Time: 0.167 seconds.
        Reads Processed: 19595 117.09k reads/sec
        Bases Processed: 4621k 27.62m bases/sec
        Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
        at jgi.GetReads.<init>(GetReads.java:321)
        at jgi.GetReads.main(GetReads.java:37)

        I get the output file and it seems normal, but I don't know why do I get this error.
        Last edited by JeanLove; 05-29-2016, 05:08 AM. Reason: forgot to write something

        Comment


        • BBMap

          1) I am interested in understanding how BBmap is simultaneously aligning reads to multiple genomes? Is this different from using bwa to align to multiple genomes sequencially?

          2) I used Bbsplit to split my fastq to hg19 reads and mouse reads. The output was only in samformat ( although I tried to name the o/p as _1.fastq.gz and 2.fastq.gz). Is it possible to directly get fastq outputs after aligning to multiple genomes?
          with something like this :
          out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq

          3) The sam output header for both the .mm10.sam and .hg19.sam had both mouse and human chromosomes. I had to manually edit this for samtools to do post processing. Is there a way out of this?

          Comment


          • 1) The difference is only in highly conserved regions. If you first map to mouse, then map the unmapped reads to human, human reads from regions with high identity to mouse will already have been removed, because they mapped to the mouse. When mapping to both genomes at once, the reads will go to the organism they actually came from because they (should) match that one better.

            2) Can you give your entire command line? I just tried it and verified that I got fastq output when I used this command:

            bbsplit.sh in=reads.fq basename=o_%.fq

            3) Sorry, that's kind of inevitable. Actually I do not recommend using sam output from BBSplit, only fastq output. Sam output will have the ambiguously-mapped reads listed as mapped to only one organism if you are in ambig=all mode. For example, if a read mappes to human and mouse equally well, it will be output in both sam files, but in each case it will be listed as mapped to the same location (which is either human or mouse).

            Comment


            • Hi Brian,

              Thanks for responding.

              1) That is what I assumed, wanted to hear it from someone else here

              2) These are things I am trying with bbsplit:

              Trial 1 :
              Code:
               
              java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6
              in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19=SL118119_t4_hg19_gatkorder_R1.fastq, out_hg19=SL118119_t4_hg19_gatkorder_R2.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R1.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss
              Trial 2:
              Code:
              java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss
              3) Thanks for clarifying the bam issue. I saw the fastq as recommended output

              Thanks,
              Deepthi

              Comment


              • Oh, looks like the problem is the commas. There should not be any commas in the command line. BBTools is interpreting "SL118119_S5_L001_R2_001.fastq.gz," as a filename, and it does not know what format "fastq.gz," is so it defaults to sam. I'll change it to default to fastq for that program.

                Comment


                • Hmmm.. so I pasted that from the stderr/out file of the program, maybe that is how you formatted it there? This is how my command looks
                  Code:
                  java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss
                  The only comma is in between ref files. Does this look okay to you?

                  Also, thanks for fixing the default input.

                  Comment


                  • Hmmm, that command looks fine. Just to clarify, is that the specific command that produced sam output?

                    Comment


                    • Hi Brian,
                      This command below worked for me. It looks like if I name the output files, that doesn't work (e.g out1_hg19=,out2_hg19=,out1_mm10= etc)
                      .
                      If I use the basename command, the fastqs look fine!

                      java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss

                      Comment


                      • Originally posted by moistplus
                        Is it possible to compute the insert size (and make a histogram distribution) from a .bam file ?
                        I think so (use unsorted bam). If reads are spliced you will get some odd results. @Brian may have a suggestion later on how to handle that.

                        Make sure samtools is available in your $PATH.

                        Code:
                        reformat.sh in=your_file.bam ihist=hist.txt

                        Comment


                        • As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.

                          Comment


                          • Originally posted by Brian Bushnell View Post
                            As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.
                            Hi Brian and Geno, I didn't want to start a new thread, so I figured I'd quote and post to get your attention.

                            I started using BBDeDupe and Normalization in Geneious on Merged Paired Reads. These reads are not yet assembled, but will be with De Novo after DeDupe/Norm. I'm doing it because I have about 50,000 reads that cover a 10kb PCR amplicon, and I think the depth is causing assembly issues.

                            I wasn't able to find much information on DeDupe. Does it remove all 100% matching reads that are encompassed in another read, thereby reducing my coverage to 1? If so, is there a way to increase my coverage, to say 50-100 instead?

                            I'm getting a few areas with low coverage, less than 10, and I generally require that I have a minimum coverage of 15 to call a base in the consensus sequence, otherwise I call a gap. I generally have >500-1000 coverage in all places, but occasionally I get some places that have <10 coverage. I don't think this is amplicon bias, rather I think it is contaminating sequence. I use a core service for the illumina sequencing, but nearly everyone that submits samples is working on HIV, hence my need for a coverage threshold.

                            Thoughts on this?

                            Thanks,
                            Jake

                            Comment


                            • Hi Jake,

                              By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

                              If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

                              Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.

                              Comment


                              • Originally posted by Brian Bushnell View Post
                                Hi Jake,

                                By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

                                If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

                                Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.
                                Thanks Brian,

                                I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

                                Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

                                Are you available for hire as a consultant?

                                Thanks,
                                Jake
                                Last edited by JVGen; 09-07-2016, 11:34 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X