SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMerge: A paired-end read merger Brian Bushnell Bioinformatics 112 10-14-2017 01:54 PM
Introducing BBNorm, a read normalization and error-correction tool Brian Bushnell Bioinformatics 45 01-13-2017 01:09 AM
Introducing Reformat, a fast read format converter Brian Bushnell Bioinformatics 18 06-15-2016 01:51 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM

Reply
 
Thread Tools
Old 07-22-2015, 06:27 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default Introducing Tadpole: an assembler, error-corrector, and read-extender

Tadpole, a new BBTool, is an extremely fast kmer-based assembler. How fast is it? Around 250x faster than SPAdes with --careful (which is how we generally run it); it can assemble E.coli on my 4-core desktop in about 12 seconds, and scales near-linearly with CPU cores. It supports arbitrarily long kmer lengths. Usage is simple:
tadpole.sh in=reads.fq out=contigs.fa

Tadpole is very conservative and optimized for correctness rather than length; which is to say, it stops at every branch, and condenses every repeat. Also, it does not currently do scaffolding. So it will typically produce an L50 substantially lower than, say, SPAdes, but also a much lower misassembly rate. This is because while Tadpole is an assembler, my primary design goals were for read extension and error-correction; and specifically, to allow BBMerge to effectively merge and/or produce insert size histograms for non-overlapping libraries. As such, it is integrated into BBMerge in addition to being a standalone tool. Tadpole’s error-correction is substantially better than BBNorm’s error-correction, largely because it uses exact rather than approximate kmer counts.

To error-correct reads:
tadpole.sh in=reads.fq out=corrected.fq mode=correct

To extend reads by 50bp in each direction:
tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50

To error-correct and extend at the same time, using a kmer length of 62:
tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 k=62 ecc=t

One of my goals with read extension is to allow the usage of longer kmer lengths in assembly (either with Tadpole or something else), as longer kmers require longer reads for a given level of coverage.

While fairly memory-efficient by default, Tadpole has various options for reducing memory consumption; unlike BBNorm, Tadpole's memory consumption increases with input size. “prealloc” uses fixed data structures rather than growable ones, which increases both speed and memory efficiency when near the maximum amount of memory (in other words, for assembling a tiny genome prealloc=f is faster, but for a big genome prealloc=t is faster). “prefilter=2” uses an additional pass with a count-min sketch to avoid storing kmers that occur at most 2 times, which are generally error kmers that waste space. “minprob=0.8” ignores kmers that according to quality scores have less than 80% chance of being error-free. “k”, of course, controls kmer length; shorter kmers are more memory-efficient (and faster). Specifically, k=1-31 uses about 20 bytes per kmer; k=32-62 uses about 30, etc.

There are several options that determine aggressiveness of extension, like “branchmult1” and “mindepthextend”. These affect contig assembly and read error-correction/extension in the same way, as error-correction is implemented by assembling through an error and replacing the error with the assembled base.

A standard BBMerge command looks like this:
bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt

Tadpole integration is handled with a few extra flags, and using the "bbmerge-auto.sh" script which attempts to allocate all of the memory on the node (like Tadpole does):
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct

This will try to merge each pair of reads via overlap. If they do not merge, error-correct them with Tadpole and try again (“ecct” flag; note that this is distinct from the “ecco” flag). If they still don’t merge, extend each read to the right by 20bp (stopping early if a branch is encountered) and try again; repeat at most 10 times. There is also an “extend” flag, which extends the reads BEFORE trying to merge them, and only happens once. If the reads don’t merge, extensions rolled back and the original reads are sent to outu.

Particularly with longer kmers and highly-amplified libraries (like single cell), Tadpole may generate lots of short, typically low-coverage degenerate contigs. You can get rid of these by, for example, setting "mincontig=250 mincov=3", which will throw away all contigs under 250bp and with average coverage below 3.

Because it’s so fast, Tadpole can be useful for generating genome size estimates simply to determine resource requirements for another assembler. For any normal fragment library of an isolate genome, I recommend using KmerCountExact’s “peaks” output for genome size estimation. However, that depends on fairly uniform coverage and will not work on long-mate libraries, metagenomes, amplified single cells, or contaminated samples. In those cases, a quick assembly with Tadpole at k=31 – ignoring the degenerate contigs – should give a fairly accurate genome size estimation.

Please let me know if you have any interesting experiences with Tadpole, either positive or negative!

P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.

Last edited by Brian Bushnell; 10-14-2015 at 06:49 PM.
Brian Bushnell is offline   Reply With Quote
Old 07-23-2015, 06:35 AM   #2
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Hi Brian,


Thanks again for your time in developing these tools. Could you clarify this statement from a previous post (http://seqanswers.com/forums/showpos...ostcount=222):

Quote:
For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.
Example for error correction and extending 100 nt for PE files:
Code:
java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r1.fastq.gz extend=r1.fastq.gz oute=r1.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t

java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r2.fastq.gz extend=r2.fastq.gz oute=r2.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t
You still recommending the previois approach or is better to Interleave the pair-end files (r1/r2) and follow the following command?

Code:
tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 ecc

Thanks again
Vicente

Last edited by GenoMax; 07-23-2015 at 07:02 AM. Reason: Fixed CODE tag
vingomez is offline   Reply With Quote
Old 07-23-2015, 09:34 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Vicente,

It's much better to interleave them, because that way you use all the kmers in both files.

For input and output in two files, though, you can set "in1" and "in2":

tadpole.sh in1=r1.fq in2=r2.fq oute1=ext1.fq oute2=ext2.fq mode=extend extendright=100 ecc=t


The "oute" and "out" flags are kind of synonymous, but kind of not (there is no out2); I'll rectify that in the next release and get rid of "oute" as it's confusing. "el" and "er" are short for "extendleft" and "extendright", and there's no reason to extend left if all you want is to make the reads overlap, but it is useful if you want longer reads so that you can assemble with a larger K, or use a string-graph assembler, or whatever.
Brian Bushnell is offline   Reply With Quote
Old 07-23-2015, 12:56 PM   #4
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

How does Tadpole compare to a short read assembler such as Trinity? Is the output of Tadpole more like the results of the inchworm stage of the Trinity pipeline?

I tried Tadpole out on some PE-100 reads that failed to align to the mouse transcriptome and it assembled them ridiculously fast and created some sequences that in fact matched up with many mouse/human/rat sequences in the uniprot database (via blastx). So clearly it works...just curious about my question above.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 07-23-2015, 01:20 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I'm not really sure about Trinity, as I've never used it; I would assume that Tadpole would assemble the individual exons of differentially-spliced genes if you ran it on RNA-seq data, or the full transcripts of genes with a single isoform. From looking at a brief description of Inchworm, that sounds about like what Tadpole should produce. It's also similar to the output of the "uucontig" phase of Meraculous.

Currently, I don't have much information about the relative performance of Tadpole vs other assemblers; I've only directly tested it against SPAdes. Tadpole yields lower continuity and a lower misassembly rate, but a similar genome completeness according to Quast.

It is only a contig-builder - it assemblers kmers into contigs until it reaches a branch or dead-end, then truncates them. It does not generate the explicit DeBruijn graph and try to remove heterozygous bubbles, or find a perfect traversal, or anything like that, so it will stop at any repeat longer than K. I plan to add a scaffolding phase later which may implement some of these things.
Brian Bushnell is offline   Reply With Quote
Old 07-28-2015, 10:29 AM   #6
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Hi Brian,


This is a general question for Tadpole (but also apply to every tool in the BBMap package). Per our conversation you mentioned that:

Quote:
It's much better to interleave them, because that way you use all the kmers in both files.
Is better to interleave the PE read files before any downstream processing/analysis to obtain better results/outcomes (i.e interleave the PE files as step #1) or this observation apply for certain commands/analysis (e.g. ecct)?

Thanks again
Vicente
vingomez is offline   Reply With Quote
Old 07-28-2015, 10:51 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:

tadpole.sh mode=correct in=reads.fq out=corrected.fq

tadpole.sh mode=correct in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq

...will give identical results, but this:

tadpole.sh mode=correct in=read1.fq out=corrected1.fq ordered
tadpole.sh mode=correct in=read2.fq out=corrected2.fq ordered


...would give inferior results. Furthermore, corrected1 and corrected2 in that case would end up with reads in different orders if you forget to add the "ordered" flag.

Many programs - such as BBDuk, BBNorm, BBMap, Seal, Tadpole, Dedupe, CalcTrueQuality - will give superior output when processing paired reads together rather than separately, and some, like BBMerge, require them to be processed together. There are a few, like Reformat, that don't care, but generally I recommend processing pairs together whenever possible. Again, though, it doesn't matter if they are in 2 files or interleaved into 1 file. If you are reading compressed files, then dual files have a higher theoretical max speed, but I normally find using a single interleaved file more convenient.

Last edited by Brian Bushnell; 12-16-2016 at 08:42 AM.
Brian Bushnell is offline   Reply With Quote
Old 10-12-2015, 03:33 PM   #8
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Will Tadpole (or more generally, your other mapping programs) work on a circular genome?
gringer is offline   Reply With Quote
Old 10-12-2015, 03:36 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Yes, it works fine on a circular genome. For error-correction or extension, it does not matter whether the genome is circular. For assembly, if it produced a single contig, the break would be at some random location and the ends would not overlap by more than K-1 bases (though in practice, it won't produce a single-contig assembly on anything much larger than a mitochondria, for most data).
Brian Bushnell is offline   Reply With Quote
Old 10-12-2015, 04:31 PM   #10
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Thanks, that's good to know. I'm trying to assemble a 15-18kb virus (and possibly mitochondria in the future), so that should be fine

Last edited by gringer; 10-12-2015 at 04:34 PM.
gringer is offline   Reply With Quote
Old 10-12-2015, 06:07 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Good - I've found it performs quite well on both. For mitochondria, it's quite handy in that you can assemble a kmer band (e.g. only the kmers with depth between 500x and 700x). And for a virus, I've had trouble with Spades assembling dozens of copies, each slightly different, presumably due to the presence of a highly variable area (even though these were supposed to be clonal isolates). Tadpole was able to assemble it to 1x coverage of the reference with no duplications, right at the correct size (38kbp), though it was in multiple contigs.

For mitochondria, I usually used K=93 (with >=150bp reads). For the virus, I used K=50 and the flag "bm1=8", I think, to get the best assembly. That second lowers the stringency of branch detection from the default, which is fairly conservative for a rapidly-mutating virus.

Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.
Brian Bushnell is offline   Reply With Quote
Old 10-12-2015, 09:45 PM   #12
fahmida
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 54
Default

Hi Brian,

It seems like a powerful addition to BBTools.
Is it possible to use Tadpole for PacBio data (with accompanying illumina data).

Regards.
fahmida is offline   Reply With Quote
Old 10-12-2015, 10:06 PM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I have assembled mitochondria from error-corrected PacBio data with Tadpole. But, the only reason I did that was because I needed to specifically assemble the components at a much higher coverage than the main genome. Other than assembling organelles, I don't think Tadpole currently has much utility for PacBio data; you would certainly get a better assembly out of HGAP/Celera or Falcon, for the main genome. Tadpole currently only does error-correction of substitutions, not indels, so it's not useful with raw PacBio data. Possibly, if I add in support for correcting indels, it may become useful with PacBio plus Illumina, but it's not there yet.
Brian Bushnell is offline   Reply With Quote
Old 10-13-2015, 03:40 AM   #14
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Hmm... option "rinse" for removing bubbles. Very clever!
gringer is offline   Reply With Quote
Old 10-13-2015, 04:38 AM   #15
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Quote:
Originally Posted by Brian Bushnell View Post
Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.
For a first-pass effort, I tried just assembling after only trimming (i.e. no host sequence filtering), working off MiSeq 250bp paired-end data:

Code:
tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended.fq mode=extend el=50 er=50 k=31 ecc=t
Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.
gringer is offline   Reply With Quote
Old 10-14-2015, 06:17 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by gringer View Post
Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.
That's fine, and expected - with "mode=extend el=50 er=50" reads will be extended at most 50bp in each direction, then stop. So for 2x250bp data, you could at best generate 350bp sequences. The point of this is not to generate contigs, but to lengthen the reads prior to merging them or feeding them into an assembler, so that a longer kmer can be used - thus reducing the disadvantage of long kmers, which is locally low coverage.
Brian Bushnell is offline   Reply With Quote
Old 10-14-2015, 06:23 PM   #17
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Oh, okay. I guess I missed the "merge" step of the assembly then. I just looked at the first sentence and didn't realise Tadpole was only an error corrector / extender:

Quote:
Tadpole, a new BBTool, is an extremely fast kmer-based assembler

Last edited by gringer; 10-14-2015 at 06:42 PM.
gringer is offline   Reply With Quote
Old 10-14-2015, 06:46 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

You can merge reads like this:

bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct qtrim2=r trimq=12 strict

BBMerge will then attempt to merge each read pair. If unsuccessful, it will quality-trim the right end of each read to Q12, and try again (qtrim2=r trimq=12). If still unsuccessful, it will try to extend the reads by up to 20bp on the right end only, and try merging again, up to 10 times (extend2=20 iterations=10). This allows up to 200bp extension for each read, so that 2x250 reads can still merge even with an insert size approaching 900bp, near the limit of Illumina bridge-amplification. I recommend this over extending first then merging.

Note: The only difference between bbmerge.sh and bbmerge-auto.sh is that bbmerge.sh will try to grab a fixed amount of memory (because it doesn't need much) while bbmerge-auto.sh will try to grab all of the memory on the computer (because Tadpole will need it for storing the kmers).
Brian Bushnell is offline   Reply With Quote
Old 10-14-2015, 06:53 PM   #19
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Ruh roh. Looks like I can't do the merge with Java 1.6:

Code:
/media/disk2/bbtools/bbmap/bbmerge.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=merged_assembled.fq ihist=ihist.txt extend2=50 iterations=10 k=31 ecct extend
java -Djava.library.path=/media/disk2/bbtools/bbmap/jni/ -ea -Xmx1000m -cp /media/disk2/bbtools/bbmap/current/ jgi.BBMerge in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=merged_assembled.fq ihist=ihist.txt extend2=50 iterations=10 k=31 ecct extend
Executing jgi.BBMerge [in=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, out=merged_assembled.fq, ihist=ihist.txt, extend2=50, iterations=10, k=31, ecct, extend]

BBMerge version 8.82
Executing assemble.Tadpole1 [in1=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=31, prealloc=false, prefilter=false]

Using 32 threads.
Executing kmer.KmerTableSet [in1=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=31, prealloc=false, prefilter=false]

Exception in thread "main" java.lang.NoSuchMethodError: java.lang.Character.isAlphabetic(I)Z
        at kmer.KmerTableSet.<init>(KmerTableSet.java:167)
        at assemble.Tadpole1.<init>(Tadpole1.java:78)
        at assemble.Tadpole.makeTadpole(Tadpole.java:76)
        at jgi.BBMerge.<init>(BBMerge.java:668)
        at jgi.BBMerge.main(BBMerge.java:45)
gringer is offline   Reply With Quote
Old 10-14-2015, 07:11 PM   #20
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Thanks for reporting that... I didn't realize Tadpole required Java 1.7+. I'll look into it tomorrow - I may be able to switch to something supported in 1.6. Or, of course, just write the method myself
Brian Bushnell is offline   Reply With Quote
Reply

Tags
assembler, bbmap, bbmerge, bbnorm, bbtools, error correction, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:23 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO