Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi, Brian,

    Thanks for building this amazing assembler. I have been using it from within the Geneious software package for a few weeks now with great success. Quick question: is it possible to enter a command line under 'custom Tadpole options' that will cause the assembler to only use a specific percentage of the data? Also, is it possible to specify usage of maximum number of reads when performing the assembly (as opposed to a percent)?

    I ask because I am trying to assembly a large number of plasmids from MiSeq FASTQ data, and I've found that the results are much better when only using 10% of the data. However, there is a bug wherein this option is missing from the workflow version of Tadpole; and another where it defaults to 100% when attempting to assemble each sequence list separately. They know about the first bug and are fixing it, but I thought there might be a quick workaround via the command line.

    Thanks!

    Rob

    Comment


    • #32
      Hi Rob,

      Tadpole does not do subsampling; you'd have to first sample 10% of the reads with another tool (such as Reformat, with the "samplerate=0.1" flag). However, you CAN restrict the input to a certain number of reads. "reads=400k", for example, will only use the first 400,000 reads (or if the reads are paired, the first 400,000 pairs).

      Tadpole also has "mincr" and "maxcr" flags which allow you to ignore kmers with counts outside of some band, and is sometimes useful for ignoring low-depth junk than can occur when you have too much data:

      mincountretain=0 (mincr) Discard kmers with count below this.
      maxcountretain=INF (maxcr) Discard kmers with count above this.
      Also, I recommend you try normalization (targeting maybe 100x coverage) - there are many situations where normalization gives a better assembly than subsampling, particularly when the coverage of features is highly variable.

      Good luck!

      Comment


      • #33
        Hi, Brian,

        Thanks very much for your quick feedback. That does help, thank you. I was able to specify reads=1000 and that approximated the 10% for some test cases. I took a quick shot at normalizing, and I'm having some issues with it, but they may have more to do with Geneious than anything.

        Thanks again!

        Rob

        Comment


        • #34
          Originally posted by Brian Bushnell View Post
          Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
          How do you identify the mitochondrial band? Could you share the script?

          Comment


          • #35
            Sure:

            Code:
            #First link reference as ref.fa and reads as reads.fa
            
            /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt
            
            primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
            cutoff=$(( $primary * 3 ))
            
            bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
            reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45
            
            kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2
            
            mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`
            
            upper=$((mitopeak * 6 / 3))
            lower=$((mitopeak * 3 / 7))
            mcs=$((mitopeak * 3 / 4))
            mincov=$((mitopeak * 2 / 3))
            
            tadpole.sh in=highpass_gc.fq.gz out=contigs78.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=78
            tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100
            tadpole.sh in=highpass_gc.fq.gz out=contigs120.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=120
            
            bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05
            
            tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6
            
            ln -s contigs_bbd.fa contigs.fa
            This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.

            Comment


            • #36
              Originally posted by Brian Bushnell View Post
              This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.
              ok, noted. I will test on Illumina reads and report back later. Many thanks.

              Comment


              • #37
                Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
                As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
                I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!

                Comment


                • #38
                  Originally posted by JeanLove View Post
                  Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
                  As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
                  I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!
                  Provide a number that you want to use e.g. reads=55000000. I don't know if that randomly subsamples though. It may be first 5.5M reads. @Brian will have to clarify.

                  If you want to independently subsample the reads you could also use reformat.sh or seqtk (sample) from Heng Li.

                  Comment


                  • #39
                    Hi, I am using tadpole from BBMap v36.11 and received some java error messages. The command I used is:

                    tadpole.sh -Xmx1000g in1=read1.corrected.fq.gz in2=read2.corrected.fq.gz out=contigs.fa.gz

                    Code:
                    java.io.EOFException: Unexpected end of ZLIB input stream
                            at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
                            at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
                            at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
                            at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
                            at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
                            at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
                    open=true
                    
                    java.io.EOFException: Unexpected end of ZLIB input stream
                            at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
                            at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
                            at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
                            at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
                            at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
                            at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
                    open=true
                    
                    Exception in thread "Thread-4" java.lang.AssertionError:
                    There appear to be different numbers of reads in the paired input files.
                    The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
                            at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:480)
                            at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:345)
                            at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:189)
                            at java.lang.Thread.run(Thread.java:745)

                    Comment


                    • #40
                      From the error messages, your input read files are corrupted:

                      ...Unexpected end of ZLIB input stream
                      ...Unexpected end of ZLIB input stream
                      ...The pairing may have been corrupted by an upstream process.
                      Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

                      Code:
                      wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
                      [numbers should be identical, and this should produce no errors]

                      Comment


                      • #41
                        Originally posted by gringer View Post
                        From the error messages, your input read files are corrupted:



                        Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

                        Code:
                        wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
                        [numbers should be identical, and this should produce no errors]
                        Gringer is correct; the files appear to be corrupt. But just to add a small correction, it does not look like wc can handle gzipped files, so I'd suggest either:

                        Code:
                        zcat read1.corrected.fq.gz | wc -l; zcat read2.corrected.fq.gz | wc -l
                        ...which I have no idea how it would behave on a corrupt file, but should produce a correct answer for a correct file. Or this:
                        Code:
                        gzip -t read1.corrected.fq.gz; gzip -t read2.corrected.fq.gz
                        ...which will test the integrity of the files. It seems to print nothing if the file is OK; if the file is truncated, it will print:
                        Code:
                        gzip: x.gz: unexpected end of file
                        Probably, the process generating the corrected files (probably Tadpole) was killed or timed out before it was finished.

                        Comment


                        • #42
                          Hi gringer, thanks, I didn't realize the multiple fastq files wasn't concatenated properly before inputting them to tadpole. I'll fix it and try again.

                          Comment


                          • #43
                            I just saw this comment from Brian Bushnell:

                            P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.

                            I recently used Tadpole for contig extension. After reading this comment, I am not sure Tadpole is a reliable tool for this. I have full-length 16S rRNA gene sequences from clone libraries. I took the variable region of 16S rRNA gene sequences (first 300 nt nucleotides) and mapped on metagenome contigs which are taxonomically classified as taxa of interest. These metagenome contigs were assembled from randomly-sheared metagenomic fragment libraries (sequencing platfrom: HiSeq). 300 nt nucleotide of 16S rRNA gene sequence perfectly mapped on a contig. I then run Tadpole to extend this contig using all pair-end reads in this metagenome. I could extend the contig up to 6500 nucleotide.

                            Is Tadpole reliable for such usages?

                            Comment


                            • #44
                              Since you are pulling the kmer information from randomly-sheared metagenomic reads, and using the variable region as a seed, it should be safe. However, there is a possibility of running into a tandem repeat region during extension when you give it a very high extension limit, so I recommend also adding the flag "ibb=f". It probably won't make any difference though.

                              Comment


                              • #45
                                Thanks for noting that; I've corrected it. Originally, for error-correction, you had to say "oute" for the output but I changed that a while ago so the current syntax is "out1" and "out2".

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X