Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Alex Renwick
    Member
    • Jul 2011
    • 44

    gatk indel calling problem

    I'm trying to get variant calling to work using a small set of simulated reads of the phi X genome. The problem I'm seeing is that indels are assigned a quality of zero and their vcf entries are incorrect.

    I used wgsim from samtools to generate simulated paired-end fastq read files of varying depths of coverage, then used bwa to align the simulated reads to the phi X reference. I followed the GATK recommendations to realign around possible indels, then to recalibrate, then I ran the UnifiedGenotyper using the -glm INDEL option along with --output-mode EMIT_ALL_SITES.

    The analysis info shows something like this:

    INFO 12:20:08,764 UnifiedGenotyper - Visited bases 5416
    INFO 12:20:08,765 UnifiedGenotyper - Callable bases 7
    INFO 12:20:08,765 UnifiedGenotyper - Confidently called bases 0
    INFO 12:20:08,765 UnifiedGenotyper - % callable bases of all loci 0.129
    INFO 12:20:08,765 UnifiedGenotyper - % confidently called bases of all loci 0.000
    INFO 12:20:08,766 UnifiedGenotyper - % confidently called bases of callable loci 0.000
    INFO 12:20:08,766 UnifiedGenotyper - Actual calls made 7
    INFO 12:20:08,766 TraversalEngine - Total runtime 1.08 secs, 0.02 min, 0.00 hours

    The corresponding vcf file shows (I added a space after the "GT:" to keep from happening):

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT somesample
    phix.fa 869 . T . 0 LowQual DP=183;MQ=48.18;MQ0=0;QD=0.00 GT: DP:PL ./.:183:0,0,0
    phix.fa 875 . GCT . 0 LowQual DP=190;MQ=49.29;MQ0=0;QD=0.00 GT: DP:PL ./.:190:0,0,0
    phix.fa 1408 . T . 0 LowQual DP=209;MQ=52.38;MQ0=0;QD=0.00 GT: DP:PL ./.:209:0,0,0
    phix.fa 1409 . A . 0 LowQual DP=213;MQ=52.17;MQ0=0;QD=0.00 GT: DP:PL ./.:213:0,0,0
    phix.fa 1949 . G . 0 LowQual DP=205;MQ=57.70;MQ0=0;QD=0.00 GT: DP:PL ./.:205:0,0,0
    phix.fa 1950 . A . 0 LowQual DP=205;MQ=57.82;MQ0=0;QD=0.00 GT: DP:PL ./.:205:0,0,0
    phix.fa 4225 . AC . 0 LowQual DP=179;MQ=57.51;MQ0=0;QD=0.00 GT: DP:PL ./.:179:0,0,0
    The vcf entries correspond to actual indels in the simulated data. For example, at position 870 there is an insertion of CC, and 876 there is a deletion of CT. The problem is the the "ALT" column shows a "." where it should have the actual alternate allele (TCC and G in those two cases).

    What can I do to 1) make the quality not zero; and 2) get the vcf file to correctly report the alternative allele?
  • NGSfan
    Senior Member
    • Apr 2009
    • 181

    #2
    Originally posted by Alex Renwick View Post
    I'm trying to get variant calling to work using a small set of simulated reads of the phi X genome. The problem I'm seeing is that indels are assigned a quality of zero and their vcf entries are incorrect.

    I used wgsim from samtools to generate simulated paired-end fastq read files of varying depths of coverage, then used bwa to align the simulated reads to the phi X reference. I followed the GATK recommendations to realign around possible indels, then to recalibrate, then I ran the UnifiedGenotyper using the -glm INDEL option along with --output-mode EMIT_ALL_SITES.

    The analysis info shows something like this:




    The corresponding vcf file shows (I added a space after the "GT:" to keep from happening):



    The vcf entries correspond to actual indels in the simulated data. For example, at position 870 there is an insertion of CC, and 876 there is a deletion of CT. The problem is the the "ALT" column shows a "." where it should have the actual alternate allele (TCC and G in those two cases).

    What can I do to 1) make the quality not zero; and 2) get the vcf file to correctly report the alternative allele?
    Hmmm.. you want to call all variants regardless of quality? try removing EMIT ALL SITES and lower the stand_emit_conf to 1

    Comment

    • Alex Renwick
      Member
      • Jul 2011
      • 44

      #3
      Originally posted by NGSfan View Post
      Hmmm.. you want to call all variants regardless of quality? try removing EMIT ALL SITES and lower the stand_emit_conf to 1
      I tried that--lowering the standard to zero, in fact--and without EMIT_ALL_SITES there is no output. It's like the indels are treated as nonvariant sites.

      Comment

      • oiiio
        Senior Member
        • Jan 2011
        • 105

        #4
        Hi guys,
        I have this exact problem. And I have tried -mmq 0 -mbq 0 -stand_emit_conf 0 -stand_call_conf 0 without seeing anything. I welcome any verification of my (non) findings with these parameters.

        I also tried using real base qualities from real data without seeing any change... still actively working on this, and I would like to hear any solution or suggestion people have.
        Last edited by oiiio; 11-18-2011, 03:33 PM.

        Comment

        • Alex Renwick
          Member
          • Jul 2011
          • 44

          #5
          I've done some experimenting and discovered that the problem comes from the amount of sequencing error in the input data.

          The data I am using is simulated using wgsim (distributed with samtools). The default error rate is .02. With that rate, UnifiedGenotyper makes no indel calls, irrespective of setting call / emit confidences to zero and minimum base quality to zero. Snp calls work just fine, though.

          UnifiedGenotyper is happy to make indel calls when the error rate is set to zero, and it works about half the time with an error rate of 0.01.

          I haven't found any parameters that allow the user to specify sensitivity to sequencing errors. Anybody else have ideas?

          Comment

          • oiiio
            Senior Member
            • Jan 2011
            • 105

            #6
            Nice find Alex,

            When you say it works half the time, do you mean that it is calling about half the indels in the sample? Or that half the samples you run UnifiedGenotyper on are making calls at all?

            Comment

            • Alex Renwick
              Member
              • Jul 2011
              • 44

              #7
              I've done around a dozen trials, and it seems to be all or nothing. Either all candidate sites are called, or none of them are.

              edit: An error rate of 0.005 seems ok, too.
              Last edited by Alex Renwick; 12-02-2011, 12:28 PM.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              36 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              100 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              121 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              113 views
              0 reactions
              Last Post SEQadmin2  
              Working...