Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • MurielGB
    Member
    • Oct 2013
    • 51

    Calling invariant sites with GATK

    Hello,

    I would like to obtain a vcf with variant AND INVARIANT sites using GATK.
    I am not sure about the options to use in order to obtain invariant sites.
    I tried a few but still I have only variants.

    Anyone can help ?

    Thanks a lot !

    Muriel
  • vdauwera
    Member
    • Apr 2012
    • 43

    #2
    Hi Muriel,

    What you want is to run the GATK's HaplotypeCaller in GVCF mode, with the arguments --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 added to your command line. Have a look at the documentation here: http://www.broadinstitute.org/gatk/g...-discovery-ovw and here: http://www.broadinstitute.org/gatk/g...rticle?id=3893

    These documents cover joint analysis of cohorts, but the basic principle is the same for single samples, you just skip the joint genotyping step.

    Comment

    • MurielGB
      Member
      • Oct 2013
      • 51

      #3
      Hi,
      Thank you vdauwera for your answer.
      The thing is that when you do that, you can't have a bam containing multiple samples.
      Muriel

      Comment

      • cariboudoug
        Member
        • Oct 2013
        • 17

        #4
        If you want to use GATK's UnifiedGenotyper you just need to add: --output EMIT_ALL_SITES

        Comment

        • MurielGB
          Member
          • Oct 2013
          • 51

          #5
          Thank you cariboudoug !
          Indeed, I tried this already but I was thinking that it's probably better to use HaplotypeCaller since this is the new version of UnifiedGenotyper...

          Comment

          • vdauwera
            Member
            • Apr 2012
            • 43

            #6
            Originally posted by MurielGB View Post
            Hi,
            Thank you vdauwera for your answer.
            The thing is that when you do that, you can't have a bam containing multiple samples.
            Muriel
            Oh, if you have multiple samples, you can just call each in turn using the GVCF mode, then you do the joint genotyping to get a single multi-sample VCF with the results.

            If your problem is that you have all the samples in one bam, the simplest is to split it into per-sample bams, or you can pass the bam with a read filter to specify a single sample to run on in each run.

            Comment

            • MurielGB
              Member
              • Oct 2013
              • 51

              #7
              Hi !

              I had a look on this new pipeline.

              I launched variant calling on a bam file containing only one sample with the options mentioned before :
              java -Xmx300g -jar /GenomeAnalysisTK/3.0.0/GenomeAnalysisTK.jar \
              -T HaplotypeCaller -R ref.fasta \
              -I ${INPUT}.bam \
              --genotyping_mode DISCOVERY \
              -stand_emit_conf 0 \
              -stand_call_conf 0 \
              -o ${INPUT}\_VC2.vcf \
              --emitRefConfidence GVCF \
              --variant_index_type LINEAR \
              --variant_index_parameter 128000 \
              -nct 16
              I still don't have all positions in the output vcd as you can see :
              KE340296.1 822 . C <NON_REF> . . END=823GTP:GQ:MIN_DP:PL 0/0:7:18:7:0,18,270
              KE340296.1 824 . C <NON_REF> . . END=824GTP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,219
              KE340296.1 825 . C <NON_REF> . . END=827GTP:GQ:MIN_DP:PL 0/0:7:18:7:0,18,270
              KE340296.1 828 . G <NON_REF> . . END=829GTP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,199
              KE340296.1 830 . T <NON_REF> . . END=854GTP:GQ:MIN_DP:PL 0/0:6:18:6:0,12,180
              KE340296.1 855 . C <NON_REF> . . END=855GTP:GQ:MIN_DP:PL 0/0:7:1:7:0,1,233
              KE340296.1 856 . T <NON_REF> . . END=890GTP:GQ:MIN_DP:PL 0/0:4:6:3:0,6,90
              KE340296.1 891 . G <NON_REF> . . END=905GTP:GQ:MIN_DP:PL 0/0:4:0:0:0,0,0
              KE340296.1 906 . T <NON_REF> . . END=944GTP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,37
              KE340296.1 945 . C <NON_REF> . . END=104GTP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
              Do you know why ?

              Muriel

              Comment

              • vdauwera
                Member
                • Apr 2012
                • 43

                #8
                Sure, if you want a call for every single position (instead of averaged blocks, which are used to compress the information in order to minimize file size) you just change -ERC GVCF to -ERC BP_RESOLUTION.

                Comment

                • MurielGB
                  Member
                  • Oct 2013
                  • 51

                  #9
                  Oh great !!
                  I was wondering what was the difference between these two options !
                  Thanks a lot for your help vdauwera !

                  Comment

                  • MurielGB
                    Member
                    • Oct 2013
                    • 51

                    #10
                    Well... I am wondering if I won't have another problem using -ERC BP_RESOLUTION.
                    Because after that I obtain a vcf file per sample containing both variant and invariant sites, I want to run the joint analysis to have a vcf combining all my samples.

                    In the documentation regarding the GenotypeGVCF option that perform the joint analysis (here : https://www.broadinstitute.org/gatk/...typeGVCFs.html), it is written that : GenotypeGVCFs merges gVCF records that were produced as part of the "single sample discovery" pipeline using the '-ERC GVCF' mode of the Haplotype Caller.

                    So I am wondering if this is gonna work since I use -ERC BP_RESOLUTION rather than GVCF ?

                    Thanks a lot !

                    Muriel

                    Comment

                    • vdauwera
                      Member
                      • Apr 2012
                      • 43

                      #11
                      That's completely fine actually, the BP_RESOLUTION gvcfs are also a valid input for the joint genotyping step. I'll update the documentation to be more accurate and inclusive. Feel free to post any further questions or comments about GATK tools on our support forum

                      Comment

                      • MurielGB
                        Member
                        • Oct 2013
                        • 51

                        #12
                        So cooool !!!
                        Thanks a lot, I am very happy of this new tool, it's gonna allow population geneticists to do so much !

                        Comment

                        • MurielGB
                          Member
                          • Oct 2013
                          • 51

                          #13
                          I have another problem, sorry vdauwera

                          I did variant calling on two samples individually using the option described and it works well, I have both variant and invariant sites.

                          However, the QUAL and INFO fields are missing in most cases :
                          KE332545.1 64 . G <NON_REF> . . . GT:ADP:GQ:PL 0/0:17,0:17:51:0,51,660
                          KE332545.1 65 . G <NON_REF> . . . GT:ADP:GQ:PL 0/0:17,0:17:51:0,51,655
                          KE332545.1 66 . G T,<NON_REF> 1.14 . BaseQRankSum=-2.602;ClippingRankSum=-0.903;DP=24;MLEAC=1,0;MLEAF=0.500,0.00;MQ=38.57;MQ0=0;MQRankSum=1.115;ReadPosRankSum=0.372 GT:ADP:GQ:PL:SB 0/1:14,4,0:18:23:23,0,449,64,460,524:10,4,3,1
                          KE332545.1 67 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:53:0,53,583
                          KE332545.1 68 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:54:0,54,659
                          KE332545.1 69 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:54:0,54,699
                          It seems that these two fields are present only in case of an alternative allele existing right ?
                          Why ? How I can I use my filtering pipeline for the invariant sites !

                          Then I performed the joint analysis and this is the same.
                          This is a shame because I really hoped I will also have these informations for invariant sites...
                          Is there something that I missed ?

                          Thank you,

                          Muriel

                          Comment

                          • rach16
                            Junior Member
                            • Oct 2014
                            • 2

                            #14
                            Hi Muriel

                            Were you able to find an answer to your question?

                            Thanks

                            Comment

                            • gshieh2
                              Junior Member
                              • Jun 2017
                              • 2

                              #15
                              We need the "QUAL" of invariant sites (homogeneous reference sites) from GATK. Can we just plug "0" into the formula -10 log_10 Prob (0 variant) since ALT is ". "?

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              40 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              102 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              123 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              114 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...