Seqanswers Leaderboard Ad

**vingomez** · 07-23-2015, 05:35 AM

Hi Brian,

Thanks again for your time in developing these tools. Could you clarify this statement from a previous post (http://seqanswers.com/forums/showpos...ostcount=222):

For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.

Example for error correction and extending 100 nt for PE files:

Code:

java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r1.fastq.gz extend=r1.fastq.gz oute=r1.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t

java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r2.fastq.gz extend=r2.fastq.gz oute=r2.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t

You still recommending the previois approach or is better to Interleave the pair-end files (r1/r2) and follow the following command?

Code:

tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 ecc

Thanks again
Vicente

**Brian Bushnell** · 07-23-2015, 08:34 AM

Hi Vicente,

It's much better to interleave them, because that way you use all the kmers in both files.

For input and output in two files, though, you can set "in1" and "in2":

tadpole.sh in1=r1.fq in2=r2.fq oute1=ext1.fq oute2=ext2.fq mode=extend extendright=100 ecc=t

The "oute" and "out" flags are kind of synonymous, but kind of not (there is no out2); I'll rectify that in the next release and get rid of "oute" as it's confusing. "el" and "er" are short for "extendleft" and "extendright", and there's no reason to extend left if all you want is to make the reads overlap, but it is useful if you want longer reads so that you can assemble with a larger K, or use a string-graph assembler, or whatever.

**sdriscoll** · 07-23-2015, 11:56 AM

How does Tadpole compare to a short read assembler such as Trinity? Is the output of Tadpole more like the results of the inchworm stage of the Trinity pipeline?

I tried Tadpole out on some PE-100 reads that failed to align to the mouse transcriptome and it assembled them ridiculously fast and created some sequences that in fact matched up with many mouse/human/rat sequences in the uniprot database (via blastx). So clearly it works...just curious about my question above.

**Brian Bushnell** · 07-23-2015, 12:20 PM

I'm not really sure about Trinity, as I've never used it; I would assume that Tadpole would assemble the individual exons of differentially-spliced genes if you ran it on RNA-seq data, or the full transcripts of genes with a single isoform. From looking at a brief description of Inchworm, that sounds about like what Tadpole should produce. It's also similar to the output of the "uucontig" phase of Meraculous.

Currently, I don't have much information about the relative performance of Tadpole vs other assemblers; I've only directly tested it against SPAdes. Tadpole yields lower continuity and a lower misassembly rate, but a similar genome completeness according to Quast.

It is only a contig-builder - it assemblers kmers into contigs until it reaches a branch or dead-end, then truncates them. It does not generate the explicit DeBruijn graph and try to remove heterozygous bubbles, or find a perfect traversal, or anything like that, so it will stop at any repeat longer than K. I plan to add a scaffolding phase later which may implement some of these things.

**vingomez** · 07-28-2015, 09:29 AM

Hi Brian,

This is a general question for Tadpole (but also apply to every tool in the BBMap package). Per our conversation you mentioned that:

It's much better to interleave them, because that way you use all the kmers in both files.

Is better to interleave the PE read files before any downstream processing/analysis to obtain better results/outcomes (i.e interleave the PE files as step #1) or this observation apply for certain commands/analysis (e.g. ecct)?

Thanks again
Vicente

**Brian Bushnell** · 07-28-2015, 09:51 AM

BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:

tadpole.sh mode=correct in=reads.fq out=corrected.fq

tadpole.sh mode=correct in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq

...will give identical results, but this:

tadpole.sh mode=correct in=read1.fq out=corrected1.fq ordered
tadpole.sh mode=correct in=read2.fq out=corrected2.fq ordered

...would give inferior results. Furthermore, corrected1 and corrected2 in that case would end up with reads in different orders if you forget to add the "ordered" flag.

Many programs - such as BBDuk, BBNorm, BBMap, Seal, Tadpole, Dedupe, CalcTrueQuality - will give superior output when processing paired reads together rather than separately, and some, like BBMerge, require them to be processed together. There are a few, like Reformat, that don't care, but generally I recommend processing pairs together whenever possible. Again, though, it doesn't matter if they are in 2 files or interleaved into 1 file. If you are reading compressed files, then dual files have a higher theoretical max speed, but I normally find using a single interleaved file more convenient.

**gringer** · 10-12-2015, 02:33 PM

Will Tadpole (or more generally, your other mapping programs) work on a circular genome?

**Brian Bushnell** · 10-12-2015, 02:36 PM

Yes, it works fine on a circular genome. For error-correction or extension, it does not matter whether the genome is circular. For assembly, if it produced a single contig, the break would be at some random location and the ends would not overlap by more than K-1 bases (though in practice, it won't produce a single-contig assembly on anything much larger than a mitochondria, for most data).

**gringer** · 10-12-2015, 03:31 PM

Thanks, that's good to know. I'm trying to assemble a 15-18kb virus (and possibly mitochondria in the future), so that should be fine

**Brian Bushnell** · 10-12-2015, 05:07 PM

Good - I've found it performs quite well on both. For mitochondria, it's quite handy in that you can assemble a kmer band (e.g. only the kmers with depth between 500x and 700x). And for a virus, I've had trouble with Spades assembling dozens of copies, each slightly different, presumably due to the presence of a highly variable area (even though these were supposed to be clonal isolates). Tadpole was able to assemble it to 1x coverage of the reference with no duplications, right at the correct size (38kbp), though it was in multiple contigs.

For mitochondria, I usually used K=93 (with >=150bp reads). For the virus, I used K=50 and the flag "bm1=8", I think, to get the best assembly. That second lowers the stringency of branch detection from the default, which is fairly conservative for a rapidly-mutating virus.

Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.

**fahmida** · 10-12-2015, 08:45 PM

Hi Brian,

It seems like a powerful addition to BBTools.
Is it possible to use Tadpole for PacBio data (with accompanying illumina data).

Regards.

**Brian Bushnell** · 10-12-2015, 09:06 PM

I have assembled mitochondria from error-corrected PacBio data with Tadpole. But, the only reason I did that was because I needed to specifically assemble the components at a much higher coverage than the main genome. Other than assembling organelles, I don't think Tadpole currently has much utility for PacBio data; you would certainly get a better assembly out of HGAP/Celera or Falcon, for the main genome. Tadpole currently only does error-correction of substitutions, not indels, so it's not useful with raw PacBio data. Possibly, if I add in support for correcting indels, it may become useful with PacBio plus Illumina, but it's not there yet.

**gringer** · 10-13-2015, 02:40 AM

Hmm... option "rinse" for removing bubbles. Very clever!

**gringer** · 10-13-2015, 03:38 AM

Originally posted by Brian Bushnell View Post

Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.

For a first-pass effort, I tried just assembling after only trimming (i.e. no host sequence filtering), working off MiSeq 250bp paired-end data:

Code:

tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended.fq mode=extend el=50 er=50 k=31 ecc=t

Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Introducing Tadpole: an assembler, error-corrector, and read-extender

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News