Seqanswers Leaderboard Ad

**P-Richmond** · 08-30-2012, 08:43 AM

I don't believe that any "standard" VCF considers the quality to be phred scored. (Samtools isn't, GATK UnifiedGenotyper isn't, Freebayes isn't, VARiD isn't). In my experience the SNP/Indel quality seems to be not the best thing to filter SNPs on. Granted, a very high quality SNP is usually so due to other important factors: Supporting reads in both directions, the SNP lying in the middle of reads, supported by a certain read depth, avg QV score of the bases at the position, etc. All these things are described in the INFO/FORMAT column so I would start there instead of with the QV score.

Cheers,
Phil

**chrishah** · 08-30-2012, 09:24 AM

Hi Phil,

Thanks for your reply! According to this the QUAL is in phred scale (http://www.1000genomes.org/wiki/Anal...t-version-41):

QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)

I also think SAMtools is using it. Don t have experience with other SNP callers. I am sure there is several other things that one could base the filtering on, it just seemed natural to me to use QUAL in a first step. So, I wanted to understand what it means.
Maybe I can take the opportunity to extend my question a little bit:
What do you use for filtering your SNPs?
Looking at the freebayes INFO/FORMAT column MQM (##INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">) seems to be a good candidate.

cheers

**P-Richmond** · 08-30-2012, 09:41 AM

So if GATK claims that their QV scores are PHRED then they make some interesting assertions on the % chance that something is wrong, since usually you see phred scores on a reasonable scale (0-40, 0-60, even 0-100), where a phred quality score of 100 is a 99.99999999% chance of being correct. So then how can GATK UnifiedGenotyper report QV scores such as (present in my data): 2004.24, 1005.62, 1111.51, etc.?

But the calculation of those QV scores is besides the point. To filter variants I usually look for:
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the
probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=SAP,Number=A,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the
probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">

Although, in the end, it is also good to bring them up and look at them manually. I also recommend comparing across different alignments/different SNPcallers.

Cheers,
Phil

**aeonsim** · 08-30-2012, 10:29 PM

Based on the SNP Validation work we've done the QUAL score is pretty meaningless as a filter for Valid snps. We've had SNPs with a QUAL of 17 turn out to be real while SNPs with a QUAL of 60000 fail.

There is also very little correlation between QUAL scores for different variant callers for 700 SNPs we've validated the R^2 when plotting QUALS for the same SNPs from different pipelines was around 0.5 for unfiltered variants.

Also as I understand it QUAL is the probability that there is something at the site, not the probability/confidence that the Genotype calls made for the samples is correct.

From our work for filtering Variant calls P-Richmond's list of filters is a good place to start. I'd also recommend looking at if your genotypes follow Mendelian rules (assuming you have Pedigree/family info). Also using GATK's VQSR tool on your VCF's can provide a decent filter, you'll need to reannotate the VCF with GATK Annotator first, and it doesn't seem to work quiet as well on non-GATK VCFs but it still provides a decent filter. Finally running multiple algorithms/variant callers (ie freebayes, GATK, Samtools, Realtime Genomics) and taking the intersection of them (ie Variants called by multiple tools) can help filter out false variants fairly well.

**P-Richmond** · 08-31-2012, 10:52 AM

I wholeheartedly agree with aeonsim, and another important factor in the variation calling process at determining true variants lies in the usage of a few different mapping tools. You'd be amazed at the number of variants you get from the same variant caller on two different read alignments.

**ekg** · 12-20-2012, 01:37 AM

As aeonism says, the QUAL represents the probability of polymorphism at the site, not the probability that the genotype call which is described is correct. There are various ways to attempt to get at this, but in general terms it's ideal to work with likelihoods rather than called genotypes, as this incorporates uncertainty into downstream analysis.

The QUAL can only incorporate elements in the model. In the case that the reference is not correct, or there is systematic bias such as in strand direction or misalignment, the QUAL value may not reflect the underlying sources of error, and thus will not be well-calibrated.

If these can be incorporated into the base quality estimates in the alignments, using recalibration methods like BAQ or the GATK's base quality recalibration module, then you can do better.

Post-hoc, you can use variant quality recalibration methods to incorporate other features into the QUAL.

**rob123king** · 05-07-2013, 04:48 AM

I'm performing samtools mpileup for SNP calling then using VarFilter -D100 to filter on coverage. How would you do this (command examples please)? Also would you filter based upon quality score as well as coverage?

Example of code used:
/home/rob/Downloads/samtools-0.1.19/samtools mpileup -ugf S_lycopersicum.fa IB2975_LDI2549.bam | /home/rob/Downloads/samtools-0.1.19/bcftools/bcftools view -bvcg - > IB2975_LDI2549var.raw.bcf

/home/rob/Downloads/samtools-0.1.19/bcftools/bcftools view IB2975_LDI2549var.raw.bcf | /home/rob/Downloads/samtools-0.1.19/bcftools/vcfutils.pl varFilter -D100 > IB2975_LDI2549var.flt.vcf

Also I'm testing FreeBayes, how would I filter this? By -D100 again? and/or quality score?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

freebayes SNP QUAL

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News