Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error correction in BBTools

    Dear All,

    Since this is my first post in this forum, I would like to start by greeting all of you!

    Striving to improve my preprocessing pipeline for shotgun metagenomic data, I came across read error correction functionality offered by the BBTools package. Specifically, bbmerge.sh, clumpify.sh, and tadpole.sh have these options. Brian Bushnell suggested using all three of them in this post. Could anybody explain to me the differences of these three approaches and why it is desirable to combine them?

    Many thanks in advance for your help!

    Best
    Sven

  • #2
    Each one handles errors differently, and in my testing, I've found combining them works best for assembly with Tadpole or Spades (note that if you are using Megahit for assembly, I have not found any of these to be effective in increasing assembly quality). Specifically -

    BBMerge will only address sequencing errors that are in one read but not the other, and only in those read pairs that overlap by a sufficient degree. This is very useful because it is completely immune to certain kinds of mis-corrections. For example, Tadpole and other kmer-based error-correction tools will typically mis-correct SNPs that occur in a rare (say, 1% abundance) organism in a metagenome into the allele of the dominant (99%) organism, since they use overall coverage. BBMerge will not do this since it examines each read pair individually.

    Clumpify has a more advanced ability to resolve individual alleles than Tadpole, and makes better use of quality-score information. Once a clump is formed of all of the reads sharing a kmer, it then recursively splits the clump into sub-clumps when it detects what appear to be common alleles. It can do this in various ways; the best is when multiple reads in the clump co-segregate by shared SNPs. If you have a clump of 100 reads, and 3 of them have an A where the rest have a C... probably, the 3 As are errors. But if those 3 reads also have a T where the other 97 have a G... then, they are co-segregating and it becomes much more likely that neither are errors, but rather, there are two alleles present. So, they are split and the two alleles are processed independently, which reduces the chances of false-positive corrections. Kmer-based correctors can generally only deal with a single position in a single read at a time which makes this kind of allele-splitting impossible. Also, Clumpify can look at the average quality of a position. Say those 3 reads with A's had an average quality score of 5 at that position, while the consensus had an average quality score of 30; that makes an error look much more likely than a different genomic allele. Kmer-based error-correctors generally do not track average quality of a kmer, just the total count, so that analysis can only be done using the quality score of a single read. The ability of Clumpify to form correct clumps, though, depends on the reads being mostly-correct already, which is why pre-correcting with BBMerge is helpful prior to Clumpify.

    Tadpole, then; what advantages does it have? Well - it's still the most thorough of the three. And it does an increasingly good job (more complete, fewer false positives) when the input data is as correct as possible. Which is why I like BBMerge (with ecco) -> Clumpify -> Tadpole. I did empirically test them in different orders, and that worked the best, as well.

    Specifically huge metageomes, it's worth noting that the memory consumption of BBMerge is not affected at all by the data volume; Clumpify goes slower when the data can no longer fit in memory (but should otherwise be unaffected); and Tadpole needs more memory as the data volume increases. You can dramatically reduce Tadpole's memory consumption using the flag "prefilter=X" to ignore kmers occurring X or fewer times, but that also means it won't be able to error-correct portions of the metagenome with coverage of X or less. Which is normally fine if you set X to 1 or 2.

    So, for maximal error-correction, I recommend all 3 (and again, that gives the best metagenomic assembly results for Spades in my tests). For the most conservative error-correction, you could just use BBMerge since it never makes false-positive corrections on the basis of coverage, but it will only correct a small fraction of the errors (and as such, might be useful in somatic variant-calling). I also wrote another tool called consect which can accept the output of multiple error-correction tools and only keep the corrections that they all agree on, which is even more conservative. But more conservative error correction generally does not improve the assembly quality as much.

    Comment


    • #3
      Thank you so much Brian for your comprehensive answer (and sorry for my late acknowledgement). I can now see the benefit of implementing all three error correction methods and will definitely include them into my pipeline in the order you suggested.

      Comment


      • #4
        Hi Brian,

        Like Sven, I am also interested in using error correction- albeit for a different purpose, which may or may not be appropriate given the context.

        So, I am analyzing SuperSage data generated from an Illumina NextSeq (50bp SE reads). To test mapping "accuracy", I simulated 1,000,000 28bp reads using randomreads.sh, from a mouse mm10 mRNA reference where the mRNA had been cut synthetically with the NlaIII restriction enzyme (recognizes CATG) followed by 28bp (partially simulating the restriction enzyme EcoP15I which cuts 25-27 bases). I then converted the simulated reads into a count table representing 'actual' gene counts to mm10. Next, I experimented with error correcting using tadpole.sh, using kmers 15-28,
        Code:
        tadpole.sh in=reads.fq out=k.ecc.fq mode=correct k=15-28
        . Then, extracted all sequences with CATG followed by 21-23 bases or 21-23 bases followed by CATG (because randomread.sh also produces reverse complement of reference)
        Code:
        grep -Po "CATG\w{21,23}|\w{21,23}CATG"
        and converted these to FASTA format. I then mapped these reads to the reference that I used to simulate the reads using blastn
        Code:
        blastn -db mouse-sage-ref.fa -query k.ecc.fq -perc_identity 100 -qcov_hsp_perc 100 -outfmt 6 -dust no -word_size 25 -evalue 1000
        and converted the BLAST hits to gene counts. Finally, I compared the number of gene counts from the 'actual' data and the reads mapped with BLAST.

        What I am curious about is whether error correction is appropriate here, and if so, should I incorporate clumpify then tadpole, and if using tadpole what to use as the appropriate K value given that varying K influenced the number of "correct" gene counts.

        When I say the number of correct gene counts, I compared the sum of the actual gene counts (getting rid of tags that mapped to multiple genes so not 1,000,0000), which was 974,552 and subtracted it by the sum of the gene counts for a particular K value.

        Code:
        num differences	10486	8342	8323	8802	9387	9997	10657	11323	12012	12695	13432	14165	14983	15342
        K       =	15	16	17	18	19	20	21	22	23	24	25	26	27	28
        The other problem would be is what would be the "correct" value for K with actual 50bp SE reads from the NextSeq.

        Best,
        Gopo

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X