SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GATK Base Quality Recalibration Rocketknight Bioinformatics 3 05-18-2012 07:20 AM
GATK Base Quality Recalibration ashuchawla Bioinformatics 5 02-09-2012 10:12 AM
GATK: Base quality score recalibration fabrice Bioinformatics 2 11-05-2011 03:54 AM
GATK base quality recalibration suppose to keep old and new quality scores? Heisman Bioinformatics 2 10-21-2011 07:40 AM
Base quality recalibration christophpale Bioinformatics 0 06-24-2010 01:05 PM

Reply
 
Thread Tools
Old 04-09-2012, 11:44 AM   #1
baika
Member
 
Location: USA

Join Date: Apr 2012
Posts: 12
Default Base Quality Score Recalibration

Hi,
I work on Staphylococcus aureus genomics, and trying to generate a list of high-quality SNPs/InDels for our sample using BWA-GATK. I am relatively new in Bioinformatics and slowly catching up with the field. I have read that one should always re-calibarte base-quality scores using GATK's CountCovariates and TableCalibration walkers before calling SNPs/InDels. Re-calibration step also requires dbsnp rod files of the known SNPs.
There is no dbsnp data available for Staph aureus. So, what are the options for re-calibrating base-quality scores when there is no dbsnp rod file? Can I use "--run_without_dbsnp_potentially_ruining_quality" argument. Or, should I skip re-calibration step? Please suggest!

Thanks
baika is offline   Reply With Quote
Old 04-09-2012, 07:53 PM   #2
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

I honestly don't think you lose much by skipping it. It's not something I would worry about.
Heisman is offline   Reply With Quote
Old 04-10-2012, 03:27 AM   #3
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default

I have used recalibration with human data, but haven't done any formal comparisons on its effects. I do know someone who did little bit of comparisons where it did seem that recalibration had little effect on the accuracy of the quality scores like Heisman (when Illumina GTs where compared against SNP chip GTs).

In BQSR afaik 'true' variants are defined as those found in the SNP rod, so I don't think you can run the algorithm without the file - your best bet is to look at the SNP format of the human SNP rod found in the GATK file bundle site online, and create your own SNP rod from the known variants of Staphylococcus aureus.

What seems to me way more useful than BQSR is variant quality score recalibration (also in GATK).
Yilong Li is offline   Reply With Quote
Old 04-10-2012, 04:18 AM   #4
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

You should definitely not run base quality score recalibration without a dbSNP reference. BQSR works as follows:

1) Run through all mapped reads looking for reads mismatching the reference genome at a position not listed in dbSNP. (GATK assumes that mismatches that occur in dbSNP are real variations that are being sequenced correctly, and mismatches that aren't in dbSNP are sequencing errors. This is a decent approximation for statistics like these.)

2) Compute statistics on where these mismatches occur (i.e. do they occur near the ends of reads, do they occur in certain dinucleotide pairs, do they occur on bases with low quality scores, etc.)

3) Using the statistics gathered above, the quality scores for all bases in your reads are rewritten with new empirical quality scores. To give an example, take the set of all CT dinucleotides with quality 20 at read position 10. A quality score of 20 indicates a 1/100 chance of error. But let's say these dinucleotides actually only mismatch the reference at a rate of 1/500. This would mean their 'empirical' score is a higher value than the quality score reported by the machine, and so the quality score for those bases is overwritten with a higher score (27, in this example).

Basically, this process serves to eliminate systemic biases in quality score assignment from sequencers. It's quite helpful, but it's absolutely dependent on having an accurate database of polymorphisms for your organism. If you don't have that database, GATK can't tell the difference between polymorphisms and sequencing errors, and so it'll assume all mismatches are sequencing errors, which will cause it to assign incredibly low quality scores to all your bases. This will ruin all downstream analysis. Don't do it!
Rocketknight is offline   Reply With Quote
Old 04-10-2012, 04:59 AM   #5
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

With more thought, here's a simple way to test both the efficacy of it and to do it correctly:

Do a standard analysis of your sample without base quality recalibration first.Then, create an artificial "dbSNP" file using highly quality snps from that sample (ie, 15x coverage or greater for homozygous SNPs with no large degree of strand/base quality bias, 20x coverage or greater for heterozygous SNPs with no large degree of strand/base quality bias, assuming good coverage). Then, restart the analysis from the beginning and do the base quality recalibration using this file. It should at least let you know if it has any real impact on SNP calling.
Heisman is offline   Reply With Quote
Old 04-11-2012, 12:29 PM   #6
baika
Member
 
Location: USA

Join Date: Apr 2012
Posts: 12
Default

Thank you very much for your suggestions! The idea of creating my own dbSNP file is really great, I am going to do that. Thanks!
baika is offline   Reply With Quote
Old 05-17-2012, 11:58 AM   #7
jflowers
Member
 
Location: New York, NY

Join Date: Oct 2011
Posts: 41
Default

There is a suggested method for non-model organisms (without a snp db) on the GATK wiki:

http://www.broadinstitute.org/gsa/wi..._recalibration

The proposed method is to start with what Heisman suggested, that is
(1) genotype
(2) obtain highest quality snp set
(3) CountCovariates/TableRecalibration
(4) genotype again using recalibrated BAM

Then, repeat many times until the empirical PHRED scores and and the called scores in the input BAM converge. We are attempting this on our data.

To evaluate performance independent of the convergence criterion, we are using the ts/tv ratio. We are expecting (fingers-crossed) that observed reductions in the ts/tv at low coverage sites (due to high FPR) will be at least partially alleviated in the recalibrated dataset.

That said, for a sobering experience, plot the ts/tv of your SNPs by coverage depth with uncalibrated base quality scores. We see a very low ratio at depths < ~20X which motivated us to attempt recalibration without a snp db.
jflowers is offline   Reply With Quote
Old 08-22-2012, 02:58 AM   #8
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Hello,

I'm completely new in the alignment process. I have Illumina paired-end data that I intend to align with BWA. I read that recalibration of the basecalls is a good practice. I have 2 questions about this step:
  1. First, there are several tools available to do it: GATK and Novoalign at least. Is there a "best/consensus" tool?
  2. Is recalibration a good practice in all kind of studies?

I mean, when looking for new mutations, isn't there a risk not to detect it?

Quote:
Originally Posted by Rocketknight View Post
1) Run through all mapped reads looking for reads mismatching the reference genome at a position not listed in dbSNP. (GATK assumes that mismatches that occur in dbSNP are real variations that are being sequenced correctly, and mismatches that aren't in dbSNP are sequencing errors. This is a decent approximation for statistics like these.)
From what Rocketknight said, I understand that all SNP not present in dbSNP file is considered as sequencing errors ...

Thank you,
Jane
Jane M is offline   Reply With Quote
Old 08-22-2012, 03:44 AM   #9
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

Don't worry, you can do Base Quality Score Recalibration when you're searching for new mutations.

BQSR only counts mutations not in dbSNP as 'sequencing errors' for statistical purposes. It does not mark those mutations as being errors in your data, and you will still be able to find new mutations in data that you have run BQSR on.

I would guess that the most widely-used tools for what you want are BWA for alignment followed by GATK for realignment/recalibration/variant calling, with Picard used for some steps like MarkDuplicates.

A good guide to get you started is at http://www.broadinstitute.org/gatk/g...pic?name=intro (click "Best Practice Variant Detection")
Rocketknight is offline   Reply With Quote
Old 08-22-2012, 04:27 AM   #10
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Thank you for your answer. I checked the link you mentioned, it will help me to run the recalibration step.
Nevertheless, there is no detail about recalibration. I would like to be sure that when using it, I do not decrease my ability to detect rare unknown mutations. I do not understand why it is not the case since mismatches present in dbSNP get a good score and the other not... Is there a clear documentation regarding the procedure of calibration?

I intend to use BWA for alignment, samtools for de-duplicate, GATK for realignment/recalibration and VarScan2 for variant calling. I've alignments for my data done (not by myself) via Elandv2. So I tried VarScan2 and it seems to be well adapted to my needs.

Jane
Jane M is offline   Reply With Quote
Old 08-22-2012, 07:17 AM   #11
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

Recalibration is applied to the entire BAM equally. Here is an example.

GATK counts all examples of the sequence 'CG' in your BAM file with quality score 20 at read position 5, and it finds 1000 of them. It then counts how many of them mismatch the reference sequence at positions that are not in dbSNP. It finds only 4 of them that mismatch. This means the empirical error rate is 4/1000. This is converted to a quality score and all of these reads have their old quality score replaced with the new quality score.

GATK then repeats this process for all dinucleotides, all quality score and all read positions.

So if you have a rare mutation that is not in dbSNP, the recalibration will count it as an error. But this will not affect your ability to find that mutation later - it will only change the new quality score very very slightly.

Hope that explains everything!
Rocketknight is offline   Reply With Quote
Old 08-22-2012, 12:52 PM   #12
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Let me play devil's advocate and argue this step is completely unnecessary for bacterial genomes where high (>30x) genome coverage is easy to obtain and you are interested in calling 100% frequency variants (equivalent of a homozygous allele in diploid genomes). The effect of getting base quality scores recalibrated should make very little difference to your ability to confidently call SNPs.

If, however, you are interested in calling low-frequency variants from a population then I could maybe see the point.
nickloman is offline   Reply With Quote
Old 08-23-2012, 12:33 AM   #13
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

I am calling low and high-frequency variants from human exomes where the mean coverage is between 80 and 100. I am looking for somatic mutations and LOH in leukemia.
Jane M is offline   Reply With Quote
Old 08-31-2012, 01:22 AM   #14
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

I decided to perform these local realignment and quality score recalibration steps in my analysis.
In the first step (Count covariates) of recalibration, we have to provide a dbSNP file.
In my opinion, the best is to provide the most up-to-date release of dbSNP, because there are more "reliable" variants to perform the recalibration. Am I right?

In contradiction, when I annotate the variant after the variant calling step, I use the oldest dbSNP version because the newest (from version 130) contain real mutations.
Jane M is offline   Reply With Quote
Old 02-07-2013, 09:42 AM   #15
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Hey,

Do you know why GATK does it on dinucleotides and not mono- or tri- nucleotides.. are the reasons computation or the reagent chemistry, etc?

thanks!

Quote:
Originally Posted by Rocketknight View Post
Recalibration is applied to the entire BAM equally. Here is an example.

GATK counts all examples of the sequence 'CG' in your BAM file with quality score 20 at read position 5, and it finds 1000 of them. It then counts how many of them mismatch the reference sequence at positions that are not in dbSNP. It finds only 4 of them that mismatch. This means the empirical error rate is 4/1000. This is converted to a quality score and all of these reads have their old quality score replaced with the new quality score.

GATK then repeats this process for all dinucleotides, all quality score and all read positions.

So if you have a rare mutation that is not in dbSNP, the recalibration will count it as an error. But this will not affect your ability to find that mutation later - it will only change the new quality score very very slightly.

Hope that explains everything!
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 05-02-2013, 07:19 PM   #16
Magas8282
Junior Member
 
Location: Gainesville, FL

Join Date: Nov 2012
Posts: 2
Default

Hello all,

Just wanted to see about an update for the original question. I too am working with S. aureus and am looking for best practices to reduce false positive SNP calls. Were you able to successfully create a dbSNP file based on the methods described? And if so, how are you filtering your variant calls after creating your VCF?

Thanks!
Magas8282 is offline   Reply With Quote
Reply

Tags
bioinformatics

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO