Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK undefined variable problem

    I used GATK UnifiedGenotyper (latest build) to call variants.
    Now I have a vcf-file which I want to filter. Although the corresponding fields are correctly defined in the header:

    ##fileformat=VCFv4.1
    ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
    ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">

    I always get an "undefined variable" error when I filter for MQRankSum or ReadPosRankSum .

    Any suggestions are very welcome!

  • #2
    Okay, I've come a bit further. The error is due to some lines in the vcf where the annotation for MQRankSum or ReadPosRankSum is missing. When I delete those, it is running fine.
    Anyone an idea why UnifiedGenotyper makes these 'mistakes' ?
    Is there a way of correcting this ?
    Letting the VariantFilter run although displaying this error gives me a filtered file. Does it mean that these specific annotations are skipped for all positions or only for those which are missing in it ?

    Comment


    • #3
      ReadPosRankSum isn't generated for every variant (at least, I see the same behaviour as you). About 43% of my variants get annotated with ReadPosRankSum and MQRankSum (indeed the same variants are annotated with both, they do not appear independently of each other in my data).

      Consequently what I get when filtering variants is :

      WARN 23:03:05,494 Interpreter - org.broadinstitute.sting.utils.variantcontext.VariantContextUtils.initializeMatchExps@310![0,14]: 'ReadPosRankSum < -8.0;' undefined variable ReadPosRankSum

      This seems to be expected behaviour, if you're using 'complex' JEXL expressions as discussed in http://www.broadinstitute.org/gsa/wi...XL_expressions

      I was a bit concerned about this as well, but it's because we have split filtering into separate expressions rather than using the old "one filter || another filter || a third filter".

      From that page

      "It is very important to note that JEXL cannot correctly handle missing annotations and throws an exception when it encounters this scenario. The default behavior of the GATK is to handle this by having the entire expression evaluate to FALSE in such cases"

      Comment


      • #4
        Just to be sure I correctly understand the recommendations....

        If I use the "--missingValuesInExpressionsShouldEvaluateAsFailing" option should solve the problem ??

        Comment


        • #5
          Yes, that will almost certainly exclude the variants missing ReadPosRankSum and MQRankSum entries, but I think you need to be clear if this is desirable or not. What is being spat as warnings are just that - warnings, not errors.

          If I look at a VCF file from our pipeline, I can pick out any number of SNPs that have failed the ReadPosRankSum or MQRankSum JEXL expressions, but *passed* one of the other expressions, that are sufficiently well covered and sufficiently different to the reference for me to be happy to include them in downstream analysis.

          This is all IMHO, but I'd rather report variants and deal with working out if they're pathogenic later rather than throwing away 40% of my dataset at this stage. It's your call though.

          I do agree that the wording of this sentence:

          "When evaluating the JEXL expressions, missing values should be considered failing the expression. By default, if JEXL cannot evaluate your expression for a particular record because one of the annotations is not present, the whole expression evaluates as PASSing. Use this argument to have it evaluate as failing filters instead for these cases."

          Makes it sound like missing values should be failed, not passed based on other expressions. I think I'm going to have a closer look at this today and see how it works in our context.
          Last edited by Bukowski; 01-12-2012, 02:48 AM.

          Comment


          • #6
            Hm - good point !
            So to the get the filtering done correctly like suggested here and here , one would have to set one filter for every expression ? Like:

            --filterExpression "QD <2.0"
            --filterName "QDfilter"
            --filterExpression "MQ <40.0"
            --filterName "MQfilter"

            rather than combining them with || ?
            because including "ReadPosRankSum" would omit the whole entry although the other criteria are fine ??

            This would mean that different combinations of filter criteria like for SNPs and INDELs would have to run separately ?

            Comment


            • #7
              Originally posted by doc.ramses View Post
              Hm - good point !
              So to the get the filtering done correctly like suggested here and here , one would have to set one filter for every expression ? Like:

              --filterExpression "QD <2.0"
              --filterName "QDfilter"
              --filterExpression "MQ <40.0"
              --filterName "MQfilter"

              rather than combining them with || ?
              because including "ReadPosRankSum" would omit the whole entry although the other criteria are fine ??

              This would mean that different combinations of filter criteria like for SNPs and INDELs would have to run separately ?
              That's effectively what we do. And yes we filter SNPs and indels with different criteria, we believe we are matching the GATKv3 recommendations in this way (for non-VQSR filtering anyway).

              Comment


              • #8
                Originally posted by Bukowski View Post
                That's effectively what we do. And yes we filter SNPs and indels with different criteria, we believe we are matching the GATKv3 recommendations in this way (for non-VQSR filtering anyway).
                Thanks for your contribution I will give that a try!

                Comment


                • #9
                  I also have the problem that for some variants the ReadPosRankSum values are missing.

                  An explanation why this occurs only in some lines, can be found here:
                  If you are unable to find something or have a question about our new website, please email [email protected]. For other inquiries related to the Broad Institute, the necessary contact information can be found here.

                  Note that the read position rank sum test can not be calculated for homozygous sites.
                  The ReadPosRankSum (as well as the MQRankSum) value is only available for heterozygous sites.

                  If the JEXL expressions support lazy evaluation, something like this could be helpful:
                  Code:
                  --filterExpression "isHet == 1 && ReadPosRankSum < -20.0"
                  If the variant is not heterozygous, the expression is not evaluated further since it can not be true. Unfortunately, the above does not work and I don't know a way to check if the variant is heterozygous.

                  Comment


                  • #10
                    You can select calls MQRankSum and ReadPosRankSum by

                    java GenomeAnalysisTK.jar -l INFO -R ref.fasta -T SelectVariants --variant input.vcf -o with_mq_rp.vcf -select 'vc.hasAttribute("MQRankSum") && vc.hasAttribute("ReadPosRankSum")'

                    and its complementer by

                    -select '!vc.hasAttribute("MQRankSum") || !vc.hasAttribute("ReadPosRankSum")' to filter out calls with the needed attributes.

                    Still suboptimal, but can help.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      Yesterday, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    55 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    45 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    55 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X