Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Bbtools/BBmaps output questions SDPA_Pet Bioinformatics 1 10-21-2016 05:32 AM
Additional question regarding BBtools steepale Bioinformatics 3 10-06-2016 12:53 PM
Ion torrent error correction skbrimer Ion Torrent 5 11-16-2015 01:56 PM
pacbio sequence error correction [email protected] Pacific Biosciences 5 11-22-2012 09:17 AM

Thread Tools
Old 06-06-2017, 05:59 AM   #1
Junior Member
Location: Germany

Join Date: Jun 2017
Posts: 2
Default Error correction in BBTools

Dear All,

Since this is my first post in this forum, I would like to start by greeting all of you!

Striving to improve my preprocessing pipeline for shotgun metagenomic data, I came across read error correction functionality offered by the BBTools package. Specifically,,, and have these options. Brian Bushnell suggested using all three of them in this post. Could anybody explain to me the differences of these three approaches and why it is desirable to combine them?

Many thanks in advance for your help!

Sven is offline   Reply With Quote
Old 06-06-2017, 01:09 PM   #2
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Each one handles errors differently, and in my testing, I've found combining them works best for assembly with Tadpole or Spades (note that if you are using Megahit for assembly, I have not found any of these to be effective in increasing assembly quality). Specifically -

BBMerge will only address sequencing errors that are in one read but not the other, and only in those read pairs that overlap by a sufficient degree. This is very useful because it is completely immune to certain kinds of mis-corrections. For example, Tadpole and other kmer-based error-correction tools will typically mis-correct SNPs that occur in a rare (say, 1% abundance) organism in a metagenome into the allele of the dominant (99%) organism, since they use overall coverage. BBMerge will not do this since it examines each read pair individually.

Clumpify has a more advanced ability to resolve individual alleles than Tadpole, and makes better use of quality-score information. Once a clump is formed of all of the reads sharing a kmer, it then recursively splits the clump into sub-clumps when it detects what appear to be common alleles. It can do this in various ways; the best is when multiple reads in the clump co-segregate by shared SNPs. If you have a clump of 100 reads, and 3 of them have an A where the rest have a C... probably, the 3 As are errors. But if those 3 reads also have a T where the other 97 have a G... then, they are co-segregating and it becomes much more likely that neither are errors, but rather, there are two alleles present. So, they are split and the two alleles are processed independently, which reduces the chances of false-positive corrections. Kmer-based correctors can generally only deal with a single position in a single read at a time which makes this kind of allele-splitting impossible. Also, Clumpify can look at the average quality of a position. Say those 3 reads with A's had an average quality score of 5 at that position, while the consensus had an average quality score of 30; that makes an error look much more likely than a different genomic allele. Kmer-based error-correctors generally do not track average quality of a kmer, just the total count, so that analysis can only be done using the quality score of a single read. The ability of Clumpify to form correct clumps, though, depends on the reads being mostly-correct already, which is why pre-correcting with BBMerge is helpful prior to Clumpify.

Tadpole, then; what advantages does it have? Well - it's still the most thorough of the three. And it does an increasingly good job (more complete, fewer false positives) when the input data is as correct as possible. Which is why I like BBMerge (with ecco) -> Clumpify -> Tadpole. I did empirically test them in different orders, and that worked the best, as well.

Specifically huge metageomes, it's worth noting that the memory consumption of BBMerge is not affected at all by the data volume; Clumpify goes slower when the data can no longer fit in memory (but should otherwise be unaffected); and Tadpole needs more memory as the data volume increases. You can dramatically reduce Tadpole's memory consumption using the flag "prefilter=X" to ignore kmers occurring X or fewer times, but that also means it won't be able to error-correct portions of the metagenome with coverage of X or less. Which is normally fine if you set X to 1 or 2.

So, for maximal error-correction, I recommend all 3 (and again, that gives the best metagenomic assembly results for Spades in my tests). For the most conservative error-correction, you could just use BBMerge since it never makes false-positive corrections on the basis of coverage, but it will only correct a small fraction of the errors (and as such, might be useful in somatic variant-calling). I also wrote another tool called consect which can accept the output of multiple error-correction tools and only keep the corrections that they all agree on, which is even more conservative. But more conservative error correction generally does not improve the assembly quality as much.
Brian Bushnell is offline   Reply With Quote
Old 06-12-2017, 07:36 AM   #3
Junior Member
Location: Germany

Join Date: Jun 2017
Posts: 2

Thank you so much Brian for your comprehensive answer (and sorry for my late acknowledgement). I can now see the benefit of implementing all three error correction methods and will definitely include them into my pipeline in the order you suggested.
Sven is offline   Reply With Quote
Old 06-12-2017, 10:54 AM   #4
Location: Louisiana

Join Date: Nov 2013
Posts: 29

Hi Brian,

Like Sven, I am also interested in using error correction- albeit for a different purpose, which may or may not be appropriate given the context.

So, I am analyzing SuperSage data generated from an Illumina NextSeq (50bp SE reads). To test mapping "accuracy", I simulated 1,000,000 28bp reads using, from a mouse mm10 mRNA reference where the mRNA had been cut synthetically with the NlaIII restriction enzyme (recognizes CATG) followed by 28bp (partially simulating the restriction enzyme EcoP15I which cuts 25-27 bases). I then converted the simulated reads into a count table representing 'actual' gene counts to mm10. Next, I experimented with error correcting using, using kmers 15-28,
Code: in=reads.fq out=k.ecc.fq mode=correct k=15-28
. Then, extracted all sequences with CATG followed by 21-23 bases or 21-23 bases followed by CATG (because also produces reverse complement of reference)
grep -Po "CATG\w{21,23}|\w{21,23}CATG"
and converted these to FASTA format. I then mapped these reads to the reference that I used to simulate the reads using blastn
blastn -db mouse-sage-ref.fa -query k.ecc.fq -perc_identity 100 -qcov_hsp_perc 100 -outfmt 6 -dust no -word_size 25 -evalue 1000
and converted the BLAST hits to gene counts. Finally, I compared the number of gene counts from the 'actual' data and the reads mapped with BLAST.

What I am curious about is whether error correction is appropriate here, and if so, should I incorporate clumpify then tadpole, and if using tadpole what to use as the appropriate K value given that varying K influenced the number of "correct" gene counts.

When I say the number of correct gene counts, I compared the sum of the actual gene counts (getting rid of tags that mapped to multiple genes so not 1,000,0000), which was 974,552 and subtracted it by the sum of the gene counts for a particular K value.

num differences	10486	8342	8323	8802	9387	9997	10657	11323	12012	12695	13432	14165	14983	15342
K       =	15	16	17	18	19	20	21	22	23	24	25	26	27	28
The other problem would be is what would be the "correct" value for K with actual 50bp SE reads from the NextSeq.

Gopo is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 11:48 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO