Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK Base Quality Recalibration

    Hi all, I haven't seen anyone else mention this so I was wondering if someone could clear something up for me. The GATK base quality recalibration process basically works as follows: The CountCovariates walker goes through all your reads, looking for locations that mismatch the reference at locations that are not listed as polymorphic in dbSNP. These locations are presumed to be mostly sequencing errors.

    A table is built up containing, for every possible combination of [dinucleotide, base position in read, quality score], a count of the total number of times that combination was encountered, and the number of those times that the base mismatched the reference at a non-polymorphic site. This ratio (mismatches/total observations) is taken to represent the empirical error rate of those base calls, and the quality score for that set of bases is overwritten with the shiny new empirical quality score.

    I understand completely why this is helpful in dealing with systemic biases in the assignment of quality scores by the sequencer, but doesn't it possibly introduce a bias of its own? For example, a certain nucleotide or dinucleotide is more likely to spontaneously mutate (e.g. methylcytosine -> thymine). Wouldn't this create a bias in the empirical quality scores for that nucleotide or dinucleotide? It would be reported as being less accurate, despite the fact that it was being correctly sequenced.

    Is this actually likely to be a significant problem, or would the number of reads with mutations like this be tiny compared to the number with mismatches due to sequencing errors?

  • #2
    Hi Rocketnight,

    A little late on the reply, but I'm contemplating some of these issues myself. The mutation bias issue you mentioned seems like it might be a problem only if this class of mutations is under-represented in dbsnp (and therefore over-represented in the CountCovariates table). For human data, I can't think why this would be expected but I'm curious if you have seen evidence of this?

    Comment


    • #3
      Since CG is extremely likely to be methylated at the cytosine and is prone to mutation to thymine once methylated, the four dinucleotides we'd expect to be affected if this were occuring would be CG/GC and TG/AC. In the Broad's sample data for Base Quality Score Recalibration CG, GC and AC were the three dinucleotides with the lowest empirical scores. (TG actually had quite a good empirical quality score, which suggests my theory here might not be perfect)
      See: http://www.broadinstitute.org/gsa/wi..._recalibration

      As for the dbsnp issue, I don't think CpG mutations will be particularly underrepresented in dbsnp - in fact, I expect they will be very common, simply because they're so likely to occur. The problem is that all using dbsnp will do is eliminate all known sites from consideration, leaving you with the set of sequencing errors, de novo mutations and very rare mutations not found in dbsnp. If CG -> TG mutations are more likely to occur, then they will be overrepresented in this set regardless of how well-represented they are in dbsnp, leading to a reduced empirical quality score (which is what we see in the above Broad data).

      Of course, I could be totally wrong about this. Is there a known issue with Illumina sequencing of these dinucleotides?

      Comment


      • #4
        Rocketknight,

        After thinking about it further, I think your right. Ultimately, it seems like the empirical mismatch rate would be higher for any class of mutations (possibly including the class associated with cytosine deamination) that is at very low frequency in humans (and therefore less likely to be found in dbsnp).

        Ultimately, it seems to me that the question comes down to population genetics and the frequency spectrum of mutations in the population. If certain classes of mutation tend to be found more frequently at very low frequencies compared with other classes then this could introduce a bias.

        The question of the allele frequency spectrum for each class of mutations is an empirical question.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        9 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X