SEQanswers (
-   Bioinformatics (
-   -   Introducing Tadpole: an assembler, error-corrector, and read-extender (

Brian Bushnell 07-22-2015 05:27 PM

Introducing Tadpole: an assembler, error-corrector, and read-extender
Tadpole, a new BBTool, is an extremely fast kmer-based assembler. How fast is it? Around 250x faster than SPAdes with --careful (which is how we generally run it); it can assemble E.coli on my 4-core desktop in about 12 seconds, and scales near-linearly with CPU cores. It supports arbitrarily long kmer lengths. Usage is simple: in=reads.fq out=contigs.fa

Tadpole is very conservative and optimized for correctness rather than length; which is to say, it stops at every branch, and condenses every repeat. Also, it does not currently do scaffolding. So it will typically produce an L50 substantially lower than, say, SPAdes, but also a much lower misassembly rate. This is because while Tadpole is an assembler, my primary design goals were for read extension and error-correction; and specifically, to allow BBMerge to effectively merge and/or produce insert size histograms for non-overlapping libraries. As such, it is integrated into BBMerge in addition to being a standalone tool. Tadpole’s error-correction is substantially better than BBNorm’s error-correction, largely because it uses exact rather than approximate kmer counts.

To error-correct reads: in=reads.fq out=corrected.fq mode=correct

To extend reads by 50bp in each direction: in=reads.fq out=extended.fq mode=extend el=50 er=50

To error-correct and extend at the same time, using a kmer length of 62: in=reads.fq out=extended.fq mode=extend el=50 er=50 k=62 ecc=t

One of my goals with read extension is to allow the usage of longer kmer lengths in assembly (either with Tadpole or something else), as longer kmers require longer reads for a given level of coverage.

While fairly memory-efficient by default, Tadpole has various options for reducing memory consumption; unlike BBNorm, Tadpole's memory consumption increases with input size. “prealloc” uses fixed data structures rather than growable ones, which increases both speed and memory efficiency when near the maximum amount of memory (in other words, for assembling a tiny genome prealloc=f is faster, but for a big genome prealloc=t is faster). “prefilter=2” uses an additional pass with a count-min sketch to avoid storing kmers that occur at most 2 times, which are generally error kmers that waste space. “minprob=0.8” ignores kmers that according to quality scores have less than 80% chance of being error-free. “k”, of course, controls kmer length; shorter kmers are more memory-efficient (and faster). Specifically, k=1-31 uses about 20 bytes per kmer; k=32-62 uses about 30, etc.

There are several options that determine aggressiveness of extension, like “branchmult1” and “mindepthextend”. These affect contig assembly and read error-correction/extension in the same way, as error-correction is implemented by assembling through an error and replacing the error with the assembled base.

A standard BBMerge command looks like this: in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt

Tadpole integration is handled with a few extra flags, and using the "" script which attempts to allocate all of the memory on the node (like Tadpole does): in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct

This will try to merge each pair of reads via overlap. If they do not merge, error-correct them with Tadpole and try again (“ecct” flag; note that this is distinct from the “ecco” flag). If they still don’t merge, extend each read to the right by 20bp (stopping early if a branch is encountered) and try again; repeat at most 10 times. There is also an “extend” flag, which extends the reads BEFORE trying to merge them, and only happens once. If the reads don’t merge, extensions rolled back and the original reads are sent to outu.

Particularly with longer kmers and highly-amplified libraries (like single cell), Tadpole may generate lots of short, typically low-coverage degenerate contigs. You can get rid of these by, for example, setting "mincontig=250 mincov=3", which will throw away all contigs under 250bp and with average coverage below 3.

Because it’s so fast, Tadpole can be useful for generating genome size estimates simply to determine resource requirements for another assembler. For any normal fragment library of an isolate genome, I recommend using KmerCountExact’s “peaks” output for genome size estimation. However, that depends on fairly uniform coverage and will not work on long-mate libraries, metagenomes, amplified single cells, or contaminated samples. In those cases, a quick assembly with Tadpole at k=31 – ignoring the degenerate contigs – should give a fairly accurate genome size estimation.

Please let me know if you have any interesting experiences with Tadpole, either positive or negative!

P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.

vingomez 07-23-2015 05:35 AM

Hi Brian,

Thanks again for your time in developing these tools. Could you clarify this statement from a previous post (


For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.
Example for error correction and extending 100 nt for PE files:

java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r1.fastq.gz extend=r1.fastq.gz oute=r1.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t

java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r2.fastq.gz extend=r2.fastq.gz oute=r2.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t

You still recommending the previois approach or is better to Interleave the pair-end files (r1/r2) and follow the following command?

Code: in=reads.fq out=extended.fq mode=extend el=50 er=50 ecc

Thanks again

Brian Bushnell 07-23-2015 08:34 AM

Hi Vicente,

It's much better to interleave them, because that way you use all the kmers in both files.

For input and output in two files, though, you can set "in1" and "in2": in1=r1.fq in2=r2.fq oute1=ext1.fq oute2=ext2.fq mode=extend extendright=100 ecc=t

The "oute" and "out" flags are kind of synonymous, but kind of not (there is no out2); I'll rectify that in the next release and get rid of "oute" as it's confusing. "el" and "er" are short for "extendleft" and "extendright", and there's no reason to extend left if all you want is to make the reads overlap, but it is useful if you want longer reads so that you can assemble with a larger K, or use a string-graph assembler, or whatever.

sdriscoll 07-23-2015 11:56 AM

How does Tadpole compare to a short read assembler such as Trinity? Is the output of Tadpole more like the results of the inchworm stage of the Trinity pipeline?

I tried Tadpole out on some PE-100 reads that failed to align to the mouse transcriptome and it assembled them ridiculously fast and created some sequences that in fact matched up with many mouse/human/rat sequences in the uniprot database (via blastx). So clearly it works...just curious about my question above.

Brian Bushnell 07-23-2015 12:20 PM

I'm not really sure about Trinity, as I've never used it; I would assume that Tadpole would assemble the individual exons of differentially-spliced genes if you ran it on RNA-seq data, or the full transcripts of genes with a single isoform. From looking at a brief description of Inchworm, that sounds about like what Tadpole should produce. It's also similar to the output of the "uucontig" phase of Meraculous.

Currently, I don't have much information about the relative performance of Tadpole vs other assemblers; I've only directly tested it against SPAdes. Tadpole yields lower continuity and a lower misassembly rate, but a similar genome completeness according to Quast.

It is only a contig-builder - it assemblers kmers into contigs until it reaches a branch or dead-end, then truncates them. It does not generate the explicit DeBruijn graph and try to remove heterozygous bubbles, or find a perfect traversal, or anything like that, so it will stop at any repeat longer than K. I plan to add a scaffolding phase later which may implement some of these things.

vingomez 07-28-2015 09:29 AM

Hi Brian,

This is a general question for Tadpole (but also apply to every tool in the BBMap package). Per our conversation you mentioned that:


It's much better to interleave them, because that way you use all the kmers in both files.
Is better to interleave the PE read files before any downstream processing/analysis to obtain better results/outcomes (i.e interleave the PE files as step #1) or this observation apply for certain commands/analysis (e.g. ecct)?

Thanks again

Brian Bushnell 07-28-2015 09:51 AM

BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these: mode=correct in=reads.fq out=corrected.fq mode=correct in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq

...will give identical results, but this: mode=correct in=read1.fq out=corrected1.fq ordered mode=correct in=read2.fq out=corrected2.fq ordered

...would give inferior results. Furthermore, corrected1 and corrected2 in that case would end up with reads in different orders if you forget to add the "ordered" flag.

Many programs - such as BBDuk, BBNorm, BBMap, Seal, Tadpole, Dedupe, CalcTrueQuality - will give superior output when processing paired reads together rather than separately, and some, like BBMerge, require them to be processed together. There are a few, like Reformat, that don't care, but generally I recommend processing pairs together whenever possible. Again, though, it doesn't matter if they are in 2 files or interleaved into 1 file. If you are reading compressed files, then dual files have a higher theoretical max speed, but I normally find using a single interleaved file more convenient.

gringer 10-12-2015 02:33 PM

Will Tadpole (or more generally, your other mapping programs) work on a circular genome?

Brian Bushnell 10-12-2015 02:36 PM

Yes, it works fine on a circular genome. For error-correction or extension, it does not matter whether the genome is circular. For assembly, if it produced a single contig, the break would be at some random location and the ends would not overlap by more than K-1 bases (though in practice, it won't produce a single-contig assembly on anything much larger than a mitochondria, for most data).

gringer 10-12-2015 03:31 PM

Thanks, that's good to know. I'm trying to assemble a 15-18kb virus (and possibly mitochondria in the future), so that should be fine ;)

Brian Bushnell 10-12-2015 05:07 PM

Good - I've found it performs quite well on both. For mitochondria, it's quite handy in that you can assemble a kmer band (e.g. only the kmers with depth between 500x and 700x). And for a virus, I've had trouble with Spades assembling dozens of copies, each slightly different, presumably due to the presence of a highly variable area (even though these were supposed to be clonal isolates). Tadpole was able to assemble it to 1x coverage of the reference with no duplications, right at the correct size (38kbp), though it was in multiple contigs.

For mitochondria, I usually used K=93 (with >=150bp reads). For the virus, I used K=50 and the flag "bm1=8", I think, to get the best assembly. That second lowers the stringency of branch detection from the default, which is fairly conservative for a rapidly-mutating virus.

Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.

fahmida 10-12-2015 08:45 PM

Hi Brian,

It seems like a powerful addition to BBTools.
Is it possible to use Tadpole for PacBio data (with accompanying illumina data).


Brian Bushnell 10-12-2015 09:06 PM

I have assembled mitochondria from error-corrected PacBio data with Tadpole. But, the only reason I did that was because I needed to specifically assemble the components at a much higher coverage than the main genome. Other than assembling organelles, I don't think Tadpole currently has much utility for PacBio data; you would certainly get a better assembly out of HGAP/Celera or Falcon, for the main genome. Tadpole currently only does error-correction of substitutions, not indels, so it's not useful with raw PacBio data. Possibly, if I add in support for correcting indels, it may become useful with PacBio plus Illumina, but it's not there yet.

gringer 10-13-2015 02:40 AM

Hmm... option "rinse" for removing bubbles. Very clever!

gringer 10-13-2015 03:38 AM


Originally Posted by Brian Bushnell (Post 182312)
Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.

For a first-pass effort, I tried just assembling after only trimming (i.e. no host sequence filtering), working off MiSeq 250bp paired-end data:

Code: in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended.fq mode=extend el=50 er=50 k=31 ecc=t
Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.

Brian Bushnell 10-14-2015 05:17 PM


Originally Posted by gringer (Post 182350)
Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.

That's fine, and expected - with "mode=extend el=50 er=50" reads will be extended at most 50bp in each direction, then stop. So for 2x250bp data, you could at best generate 350bp sequences. The point of this is not to generate contigs, but to lengthen the reads prior to merging them or feeding them into an assembler, so that a longer kmer can be used - thus reducing the disadvantage of long kmers, which is locally low coverage.

gringer 10-14-2015 05:23 PM

Oh, okay. I guess I missed the "merge" step of the assembly then. I just looked at the first sentence and didn't realise Tadpole was only an error corrector / extender:


Tadpole, a new BBTool, is an extremely fast kmer-based assembler

Brian Bushnell 10-14-2015 05:46 PM

You can merge reads like this: in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct qtrim2=r trimq=12 strict

BBMerge will then attempt to merge each read pair. If unsuccessful, it will quality-trim the right end of each read to Q12, and try again (qtrim2=r trimq=12). If still unsuccessful, it will try to extend the reads by up to 20bp on the right end only, and try merging again, up to 10 times (extend2=20 iterations=10). This allows up to 200bp extension for each read, so that 2x250 reads can still merge even with an insert size approaching 900bp, near the limit of Illumina bridge-amplification. I recommend this over extending first then merging.

Note: The only difference between and is that will try to grab a fixed amount of memory (because it doesn't need much) while will try to grab all of the memory on the computer (because Tadpole will need it for storing the kmers).

gringer 10-14-2015 05:53 PM

Ruh roh. Looks like I can't do the merge with Java 1.6:


/media/disk2/bbtools/bbmap/ in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=merged_assembled.fq ihist=ihist.txt extend2=50 iterations=10 k=31 ecct extend
java -Djava.library.path=/media/disk2/bbtools/bbmap/jni/ -ea -Xmx1000m -cp /media/disk2/bbtools/bbmap/current/ jgi.BBMerge in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=merged_assembled.fq ihist=ihist.txt extend2=50 iterations=10 k=31 ecct extend
Executing jgi.BBMerge [in=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, out=merged_assembled.fq, ihist=ihist.txt, extend2=50, iterations=10, k=31, ecct, extend]

BBMerge version 8.82
Executing assemble.Tadpole1 [in1=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=31, prealloc=false, prefilter=false]

Using 32 threads.
Executing kmer.KmerTableSet [in1=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=31, prealloc=false, prefilter=false]

Exception in thread "main" java.lang.NoSuchMethodError: java.lang.Character.isAlphabetic(I)Z
        at kmer.KmerTableSet.<init>(
        at assemble.Tadpole1.<init>(
        at assemble.Tadpole.makeTadpole(
        at jgi.BBMerge.<init>(
        at jgi.BBMerge.main(

Brian Bushnell 10-14-2015 06:11 PM

Thanks for reporting that... I didn't realize Tadpole required Java 1.7+. I'll look into it tomorrow - I may be able to switch to something supported in 1.6. Or, of course, just write the method myself :)

All times are GMT -8. The time now is 05:37 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.