Seqanswers Leaderboard Ad

**Brian Bushnell** · 06-06-2017, 12:09 PM

Each one handles errors differently, and in my testing, I've found combining them works best for assembly with Tadpole or Spades (note that if you are using Megahit for assembly, I have not found any of these to be effective in increasing assembly quality). Specifically -

BBMerge will only address sequencing errors that are in one read but not the other, and only in those read pairs that overlap by a sufficient degree. This is very useful because it is completely immune to certain kinds of mis-corrections. For example, Tadpole and other kmer-based error-correction tools will typically mis-correct SNPs that occur in a rare (say, 1% abundance) organism in a metagenome into the allele of the dominant (99%) organism, since they use overall coverage. BBMerge will not do this since it examines each read pair individually.

Clumpify has a more advanced ability to resolve individual alleles than Tadpole, and makes better use of quality-score information. Once a clump is formed of all of the reads sharing a kmer, it then recursively splits the clump into sub-clumps when it detects what appear to be common alleles. It can do this in various ways; the best is when multiple reads in the clump co-segregate by shared SNPs. If you have a clump of 100 reads, and 3 of them have an A where the rest have a C... probably, the 3 As are errors. But if those 3 reads also have a T where the other 97 have a G... then, they are co-segregating and it becomes much more likely that neither are errors, but rather, there are two alleles present. So, they are split and the two alleles are processed independently, which reduces the chances of false-positive corrections. Kmer-based correctors can generally only deal with a single position in a single read at a time which makes this kind of allele-splitting impossible. Also, Clumpify can look at the average quality of a position. Say those 3 reads with A's had an average quality score of 5 at that position, while the consensus had an average quality score of 30; that makes an error look much more likely than a different genomic allele. Kmer-based error-correctors generally do not track average quality of a kmer, just the total count, so that analysis can only be done using the quality score of a single read. The ability of Clumpify to form correct clumps, though, depends on the reads being mostly-correct already, which is why pre-correcting with BBMerge is helpful prior to Clumpify.

Tadpole, then; what advantages does it have? Well - it's still the most thorough of the three. And it does an increasingly good job (more complete, fewer false positives) when the input data is as correct as possible. Which is why I like BBMerge (with ecco) -> Clumpify -> Tadpole. I did empirically test them in different orders, and that worked the best, as well.

Specifically huge metageomes, it's worth noting that the memory consumption of BBMerge is not affected at all by the data volume; Clumpify goes slower when the data can no longer fit in memory (but should otherwise be unaffected); and Tadpole needs more memory as the data volume increases. You can dramatically reduce Tadpole's memory consumption using the flag "prefilter=X" to ignore kmers occurring X or fewer times, but that also means it won't be able to error-correct portions of the metagenome with coverage of X or less. Which is normally fine if you set X to 1 or 2.

So, for maximal error-correction, I recommend all 3 (and again, that gives the best metagenomic assembly results for Spades in my tests). For the most conservative error-correction, you could just use BBMerge since it never makes false-positive corrections on the basis of coverage, but it will only correct a small fraction of the errors (and as such, might be useful in somatic variant-calling). I also wrote another tool called consect which can accept the output of multiple error-correction tools and only keep the corrections that they all agree on, which is even more conservative. But more conservative error correction generally does not improve the assembly quality as much.

**Sven** · 06-12-2017, 06:36 AM

Thank you so much Brian for your comprehensive answer (and sorry for my late acknowledgement). I can now see the benefit of implementing all three error correction methods and will definitely include them into my pipeline in the order you suggested.

**Gopo** · 06-12-2017, 09:54 AM

Hi Brian,

Like Sven, I am also interested in using error correction- albeit for a different purpose, which may or may not be appropriate given the context.

So, I am analyzing SuperSage data generated from an Illumina NextSeq (50bp SE reads). To test mapping "accuracy", I simulated 1,000,000 28bp reads using randomreads.sh, from a mouse mm10 mRNA reference where the mRNA had been cut synthetically with the NlaIII restriction enzyme (recognizes CATG) followed by 28bp (partially simulating the restriction enzyme EcoP15I which cuts 25-27 bases). I then converted the simulated reads into a count table representing 'actual' gene counts to mm10. Next, I experimented with error correcting using tadpole.sh, using kmers 15-28,

Code:

tadpole.sh in=reads.fq out=k.ecc.fq mode=correct k=15-28

. Then, extracted all sequences with CATG followed by 21-23 bases or 21-23 bases followed by CATG (because randomread.sh also produces reverse complement of reference)

Code:

grep -Po "CATG\w{21,23}|\w{21,23}CATG"

and converted these to FASTA format. I then mapped these reads to the reference that I used to simulate the reads using blastn

Code:

blastn -db mouse-sage-ref.fa -query k.ecc.fq -perc_identity 100 -qcov_hsp_perc 100 -outfmt 6 -dust no -word_size 25 -evalue 1000

and converted the BLAST hits to gene counts. Finally, I compared the number of gene counts from the 'actual' data and the reads mapped with BLAST.

What I am curious about is whether error correction is appropriate here, and if so, should I incorporate clumpify then tadpole, and if using tadpole what to use as the appropriate K value given that varying K influenced the number of "correct" gene counts.

When I say the number of correct gene counts, I compared the sum of the actual gene counts (getting rid of tags that mapped to multiple genes so not 1,000,0000), which was 974,552 and subtracted it by the sum of the gene counts for a particular K value.

Code:

num differences	10486	8342	8323	8802	9387	9997	10657	11323	12012	12695	13432	14165	14983	15342
K       =	15	16	17	18	19	20	21	22	23	24	25	26	27	28

The other problem would be is what would be the "correct" value for K with actual 50bp SE reads from the NextSeq.

Best,
Gopo

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Error correction in BBTools

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News