Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • freebayes SNP QUAL

    Hi there,

    I am exploring freebayes for SNP calling. It works nicely and the next step will be to filter the SNPs based on quality, etc., but I am a little confused about how to interpret the QUAL value in the freebayes *.vcf.
    In standard VCF format the QUAL is phred scaled, right, so a QUAL value of 30 means a probability of 0.001 that the SNP is called incorrectly.
    The freebayes QUAL values are in the range of some hundred to several thousand. Can anyone tell me how to interpret that?

    much obliged!!!

  • #2
    I don't believe that any "standard" VCF considers the quality to be phred scored. (Samtools isn't, GATK UnifiedGenotyper isn't, Freebayes isn't, VARiD isn't). In my experience the SNP/Indel quality seems to be not the best thing to filter SNPs on. Granted, a very high quality SNP is usually so due to other important factors: Supporting reads in both directions, the SNP lying in the middle of reads, supported by a certain read depth, avg QV score of the bases at the position, etc. All these things are described in the INFO/FORMAT column so I would start there instead of with the QV score.

    Cheers,
    Phil

    Comment


    • #3
      Hi Phil,

      Thanks for your reply! According to this the QUAL is in phred scale (http://www.1000genomes.org/wiki/Anal...t-version-41):
      QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)
      I also think SAMtools is using it. Don t have experience with other SNP callers. I am sure there is several other things that one could base the filtering on, it just seemed natural to me to use QUAL in a first step. So, I wanted to understand what it means.
      Maybe I can take the opportunity to extend my question a little bit:
      What do you use for filtering your SNPs?
      Looking at the freebayes INFO/FORMAT column MQM (##INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">) seems to be a good candidate.

      cheers

      Comment


      • #4
        So if GATK claims that their QV scores are PHRED then they make some interesting assertions on the % chance that something is wrong, since usually you see phred scores on a reasonable scale (0-40, 0-60, even 0-100), where a phred quality score of 100 is a 99.99999999% chance of being correct. So then how can GATK UnifiedGenotyper report QV scores such as (present in my data): 2004.24, 1005.62, 1111.51, etc.?

        But the calculation of those QV scores is besides the point. To filter variants I usually look for:
        ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
        ##INFO=<ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the
        probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
        ##INFO=<ID=SAP,Number=A,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the
        probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
        ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
        ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">

        Although, in the end, it is also good to bring them up and look at them manually. I also recommend comparing across different alignments/different SNPcallers.

        Cheers,
        Phil

        Comment


        • #5
          Based on the SNP Validation work we've done the QUAL score is pretty meaningless as a filter for Valid snps. We've had SNPs with a QUAL of 17 turn out to be real while SNPs with a QUAL of 60000 fail.

          There is also very little correlation between QUAL scores for different variant callers for 700 SNPs we've validated the R^2 when plotting QUALS for the same SNPs from different pipelines was around 0.5 for unfiltered variants.

          Also as I understand it QUAL is the probability that there is something at the site, not the probability/confidence that the Genotype calls made for the samples is correct.

          From our work for filtering Variant calls P-Richmond's list of filters is a good place to start. I'd also recommend looking at if your genotypes follow Mendelian rules (assuming you have Pedigree/family info). Also using GATK's VQSR tool on your VCF's can provide a decent filter, you'll need to reannotate the VCF with GATK Annotator first, and it doesn't seem to work quiet as well on non-GATK VCFs but it still provides a decent filter. Finally running multiple algorithms/variant callers (ie freebayes, GATK, Samtools, Realtime Genomics) and taking the intersection of them (ie Variants called by multiple tools) can help filter out false variants fairly well.

          Comment


          • #6
            I wholeheartedly agree with aeonsim, and another important factor in the variation calling process at determining true variants lies in the usage of a few different mapping tools. You'd be amazed at the number of variants you get from the same variant caller on two different read alignments.

            Comment


            • #7
              As aeonism says, the QUAL represents the probability of polymorphism at the site, not the probability that the genotype call which is described is correct. There are various ways to attempt to get at this, but in general terms it's ideal to work with likelihoods rather than called genotypes, as this incorporates uncertainty into downstream analysis.

              The QUAL can only incorporate elements in the model. In the case that the reference is not correct, or there is systematic bias such as in strand direction or misalignment, the QUAL value may not reflect the underlying sources of error, and thus will not be well-calibrated.

              If these can be incorporated into the base quality estimates in the alignments, using recalibration methods like BAQ or the GATK's base quality recalibration module, then you can do better.

              Post-hoc, you can use variant quality recalibration methods to incorporate other features into the QUAL.

              Comment


              • #8
                I'm performing samtools mpileup for SNP calling then using VarFilter -D100 to filter on coverage. How would you do this (command examples please)? Also would you filter based upon quality score as well as coverage?

                Example of code used:
                /home/rob/Downloads/samtools-0.1.19/samtools mpileup -ugf S_lycopersicum.fa IB2975_LDI2549.bam | /home/rob/Downloads/samtools-0.1.19/bcftools/bcftools view -bvcg - > IB2975_LDI2549var.raw.bcf

                /home/rob/Downloads/samtools-0.1.19/bcftools/bcftools view IB2975_LDI2549var.raw.bcf | /home/rob/Downloads/samtools-0.1.19/bcftools/vcfutils.pl varFilter -D100 > IB2975_LDI2549var.flt.vcf

                Also I'm testing FreeBayes, how would I filter this? By -D100 again? and/or quality score?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X