Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Base Quality Score Recalibration

    Hi,
    I work on Staphylococcus aureus genomics, and trying to generate a list of high-quality SNPs/InDels for our sample using BWA-GATK. I am relatively new in Bioinformatics and slowly catching up with the field. I have read that one should always re-calibarte base-quality scores using GATK's CountCovariates and TableCalibration walkers before calling SNPs/InDels. Re-calibration step also requires dbsnp rod files of the known SNPs.
    There is no dbsnp data available for Staph aureus. So, what are the options for re-calibrating base-quality scores when there is no dbsnp rod file? Can I use "--run_without_dbsnp_potentially_ruining_quality" argument. Or, should I skip re-calibration step? Please suggest!

    Thanks

  • #2
    I honestly don't think you lose much by skipping it. It's not something I would worry about.

    Comment


    • #3
      I have used recalibration with human data, but haven't done any formal comparisons on its effects. I do know someone who did little bit of comparisons where it did seem that recalibration had little effect on the accuracy of the quality scores like Heisman (when Illumina GTs where compared against SNP chip GTs).

      In BQSR afaik 'true' variants are defined as those found in the SNP rod, so I don't think you can run the algorithm without the file - your best bet is to look at the SNP format of the human SNP rod found in the GATK file bundle site online, and create your own SNP rod from the known variants of Staphylococcus aureus.

      What seems to me way more useful than BQSR is variant quality score recalibration (also in GATK).

      Comment


      • #4
        You should definitely not run base quality score recalibration without a dbSNP reference. BQSR works as follows:

        1) Run through all mapped reads looking for reads mismatching the reference genome at a position not listed in dbSNP. (GATK assumes that mismatches that occur in dbSNP are real variations that are being sequenced correctly, and mismatches that aren't in dbSNP are sequencing errors. This is a decent approximation for statistics like these.)

        2) Compute statistics on where these mismatches occur (i.e. do they occur near the ends of reads, do they occur in certain dinucleotide pairs, do they occur on bases with low quality scores, etc.)

        3) Using the statistics gathered above, the quality scores for all bases in your reads are rewritten with new empirical quality scores. To give an example, take the set of all CT dinucleotides with quality 20 at read position 10. A quality score of 20 indicates a 1/100 chance of error. But let's say these dinucleotides actually only mismatch the reference at a rate of 1/500. This would mean their 'empirical' score is a higher value than the quality score reported by the machine, and so the quality score for those bases is overwritten with a higher score (27, in this example).

        Basically, this process serves to eliminate systemic biases in quality score assignment from sequencers. It's quite helpful, but it's absolutely dependent on having an accurate database of polymorphisms for your organism. If you don't have that database, GATK can't tell the difference between polymorphisms and sequencing errors, and so it'll assume all mismatches are sequencing errors, which will cause it to assign incredibly low quality scores to all your bases. This will ruin all downstream analysis. Don't do it!

        Comment


        • #5
          With more thought, here's a simple way to test both the efficacy of it and to do it correctly:

          Do a standard analysis of your sample without base quality recalibration first.Then, create an artificial "dbSNP" file using highly quality snps from that sample (ie, 15x coverage or greater for homozygous SNPs with no large degree of strand/base quality bias, 20x coverage or greater for heterozygous SNPs with no large degree of strand/base quality bias, assuming good coverage). Then, restart the analysis from the beginning and do the base quality recalibration using this file. It should at least let you know if it has any real impact on SNP calling.

          Comment


          • #6
            Thank you very much for your suggestions! The idea of creating my own dbSNP file is really great, I am going to do that. Thanks!

            Comment


            • #7
              There is a suggested method for non-model organisms (without a snp db) on the GATK wiki:



              The proposed method is to start with what Heisman suggested, that is
              (1) genotype
              (2) obtain highest quality snp set
              (3) CountCovariates/TableRecalibration
              (4) genotype again using recalibrated BAM

              Then, repeat many times until the empirical PHRED scores and and the called scores in the input BAM converge. We are attempting this on our data.

              To evaluate performance independent of the convergence criterion, we are using the ts/tv ratio. We are expecting (fingers-crossed) that observed reductions in the ts/tv at low coverage sites (due to high FPR) will be at least partially alleviated in the recalibrated dataset.

              That said, for a sobering experience, plot the ts/tv of your SNPs by coverage depth with uncalibrated base quality scores. We see a very low ratio at depths < ~20X which motivated us to attempt recalibration without a snp db.

              Comment


              • #8
                Hello,

                I'm completely new in the alignment process. I have Illumina paired-end data that I intend to align with BWA. I read that recalibration of the basecalls is a good practice. I have 2 questions about this step:
                1. First, there are several tools available to do it: GATK and Novoalign at least. Is there a "best/consensus" tool?
                2. Is recalibration a good practice in all kind of studies?


                I mean, when looking for new mutations, isn't there a risk not to detect it?

                Originally posted by Rocketknight View Post
                1) Run through all mapped reads looking for reads mismatching the reference genome at a position not listed in dbSNP. (GATK assumes that mismatches that occur in dbSNP are real variations that are being sequenced correctly, and mismatches that aren't in dbSNP are sequencing errors. This is a decent approximation for statistics like these.)
                From what Rocketknight said, I understand that all SNP not present in dbSNP file is considered as sequencing errors ...

                Thank you,
                Jane

                Comment


                • #9
                  Don't worry, you can do Base Quality Score Recalibration when you're searching for new mutations.

                  BQSR only counts mutations not in dbSNP as 'sequencing errors' for statistical purposes. It does not mark those mutations as being errors in your data, and you will still be able to find new mutations in data that you have run BQSR on.

                  I would guess that the most widely-used tools for what you want are BWA for alignment followed by GATK for realignment/recalibration/variant calling, with Picard used for some steps like MarkDuplicates.

                  A good guide to get you started is at http://www.broadinstitute.org/gatk/g...pic?name=intro (click "Best Practice Variant Detection")

                  Comment


                  • #10
                    Thank you for your answer. I checked the link you mentioned, it will help me to run the recalibration step.
                    Nevertheless, there is no detail about recalibration. I would like to be sure that when using it, I do not decrease my ability to detect rare unknown mutations. I do not understand why it is not the case since mismatches present in dbSNP get a good score and the other not... Is there a clear documentation regarding the procedure of calibration?

                    I intend to use BWA for alignment, samtools for de-duplicate, GATK for realignment/recalibration and VarScan2 for variant calling. I've alignments for my data done (not by myself) via Elandv2. So I tried VarScan2 and it seems to be well adapted to my needs.

                    Jane

                    Comment


                    • #11
                      Recalibration is applied to the entire BAM equally. Here is an example.

                      GATK counts all examples of the sequence 'CG' in your BAM file with quality score 20 at read position 5, and it finds 1000 of them. It then counts how many of them mismatch the reference sequence at positions that are not in dbSNP. It finds only 4 of them that mismatch. This means the empirical error rate is 4/1000. This is converted to a quality score and all of these reads have their old quality score replaced with the new quality score.

                      GATK then repeats this process for all dinucleotides, all quality score and all read positions.

                      So if you have a rare mutation that is not in dbSNP, the recalibration will count it as an error. But this will not affect your ability to find that mutation later - it will only change the new quality score very very slightly.

                      Hope that explains everything!

                      Comment


                      • #12
                        Let me play devil's advocate and argue this step is completely unnecessary for bacterial genomes where high (>30x) genome coverage is easy to obtain and you are interested in calling 100% frequency variants (equivalent of a homozygous allele in diploid genomes). The effect of getting base quality scores recalibrated should make very little difference to your ability to confidently call SNPs.

                        If, however, you are interested in calling low-frequency variants from a population then I could maybe see the point.

                        Comment


                        • #13
                          I am calling low and high-frequency variants from human exomes where the mean coverage is between 80 and 100. I am looking for somatic mutations and LOH in leukemia.

                          Comment


                          • #14
                            I decided to perform these local realignment and quality score recalibration steps in my analysis.
                            In the first step (Count covariates) of recalibration, we have to provide a dbSNP file.
                            In my opinion, the best is to provide the most up-to-date release of dbSNP, because there are more "reliable" variants to perform the recalibration. Am I right?

                            In contradiction, when I annotate the variant after the variant calling step, I use the oldest dbSNP version because the newest (from version 130) contain real mutations.

                            Comment


                            • #15
                              Hey,

                              Do you know why GATK does it on dinucleotides and not mono- or tri- nucleotides.. are the reasons computation or the reagent chemistry, etc?

                              thanks!

                              Originally posted by Rocketknight View Post
                              Recalibration is applied to the entire BAM equally. Here is an example.

                              GATK counts all examples of the sequence 'CG' in your BAM file with quality score 20 at read position 5, and it finds 1000 of them. It then counts how many of them mismatch the reference sequence at positions that are not in dbSNP. It finds only 4 of them that mismatch. This means the empirical error rate is 4/1000. This is converted to a quality score and all of these reads have their old quality score replaced with the new quality score.

                              GATK then repeats this process for all dinucleotides, all quality score and all read positions.

                              So if you have a rare mutation that is not in dbSNP, the recalibration will count it as an error. But this will not affect your ability to find that mutation later - it will only change the new quality score very very slightly.

                              Hope that explains everything!
                              --
                              bioinfosm

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X