Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling invariant sites with GATK

    Hello,

    I would like to obtain a vcf with variant AND INVARIANT sites using GATK.
    I am not sure about the options to use in order to obtain invariant sites.
    I tried a few but still I have only variants.

    Anyone can help ?

    Thanks a lot !

    Muriel

  • #2
    Hi Muriel,

    What you want is to run the GATK's HaplotypeCaller in GVCF mode, with the arguments --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 added to your command line. Have a look at the documentation here: http://www.broadinstitute.org/gatk/g...-discovery-ovw and here: http://www.broadinstitute.org/gatk/g...rticle?id=3893

    These documents cover joint analysis of cohorts, but the basic principle is the same for single samples, you just skip the joint genotyping step.

    Comment


    • #3
      Hi,
      Thank you vdauwera for your answer.
      The thing is that when you do that, you can't have a bam containing multiple samples.
      Muriel

      Comment


      • #4
        If you want to use GATK's UnifiedGenotyper you just need to add: --output EMIT_ALL_SITES

        Comment


        • #5
          Thank you cariboudoug !
          Indeed, I tried this already but I was thinking that it's probably better to use HaplotypeCaller since this is the new version of UnifiedGenotyper...

          Comment


          • #6
            Originally posted by MurielGB View Post
            Hi,
            Thank you vdauwera for your answer.
            The thing is that when you do that, you can't have a bam containing multiple samples.
            Muriel
            Oh, if you have multiple samples, you can just call each in turn using the GVCF mode, then you do the joint genotyping to get a single multi-sample VCF with the results.

            If your problem is that you have all the samples in one bam, the simplest is to split it into per-sample bams, or you can pass the bam with a read filter to specify a single sample to run on in each run.

            Comment


            • #7
              Hi !

              I had a look on this new pipeline.

              I launched variant calling on a bam file containing only one sample with the options mentioned before :
              java -Xmx300g -jar /GenomeAnalysisTK/3.0.0/GenomeAnalysisTK.jar \
              -T HaplotypeCaller -R ref.fasta \
              -I ${INPUT}.bam \
              --genotyping_mode DISCOVERY \
              -stand_emit_conf 0 \
              -stand_call_conf 0 \
              -o ${INPUT}\_VC2.vcf \
              --emitRefConfidence GVCF \
              --variant_index_type LINEAR \
              --variant_index_parameter 128000 \
              -nct 16
              I still don't have all positions in the output vcd as you can see :
              KE340296.1 822 . C <NON_REF> . . END=823GTP:GQ:MIN_DP:PL 0/0:7:18:7:0,18,270
              KE340296.1 824 . C <NON_REF> . . END=824GTP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,219
              KE340296.1 825 . C <NON_REF> . . END=827GTP:GQ:MIN_DP:PL 0/0:7:18:7:0,18,270
              KE340296.1 828 . G <NON_REF> . . END=829GTP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,199
              KE340296.1 830 . T <NON_REF> . . END=854GTP:GQ:MIN_DP:PL 0/0:6:18:6:0,12,180
              KE340296.1 855 . C <NON_REF> . . END=855GTP:GQ:MIN_DP:PL 0/0:7:1:7:0,1,233
              KE340296.1 856 . T <NON_REF> . . END=890GTP:GQ:MIN_DP:PL 0/0:4:6:3:0,6,90
              KE340296.1 891 . G <NON_REF> . . END=905GTP:GQ:MIN_DP:PL 0/0:4:0:0:0,0,0
              KE340296.1 906 . T <NON_REF> . . END=944GTP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,37
              KE340296.1 945 . C <NON_REF> . . END=104GTP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
              Do you know why ?

              Muriel

              Comment


              • #8
                Sure, if you want a call for every single position (instead of averaged blocks, which are used to compress the information in order to minimize file size) you just change -ERC GVCF to -ERC BP_RESOLUTION.

                Comment


                • #9
                  Oh great !!
                  I was wondering what was the difference between these two options !
                  Thanks a lot for your help vdauwera !

                  Comment


                  • #10
                    Well... I am wondering if I won't have another problem using -ERC BP_RESOLUTION.
                    Because after that I obtain a vcf file per sample containing both variant and invariant sites, I want to run the joint analysis to have a vcf combining all my samples.

                    In the documentation regarding the GenotypeGVCF option that perform the joint analysis (here : https://www.broadinstitute.org/gatk/...typeGVCFs.html), it is written that : GenotypeGVCFs merges gVCF records that were produced as part of the "single sample discovery" pipeline using the '-ERC GVCF' mode of the Haplotype Caller.

                    So I am wondering if this is gonna work since I use -ERC BP_RESOLUTION rather than GVCF ?

                    Thanks a lot !

                    Muriel

                    Comment


                    • #11
                      That's completely fine actually, the BP_RESOLUTION gvcfs are also a valid input for the joint genotyping step. I'll update the documentation to be more accurate and inclusive. Feel free to post any further questions or comments about GATK tools on our support forum

                      Comment


                      • #12
                        So cooool !!!
                        Thanks a lot, I am very happy of this new tool, it's gonna allow population geneticists to do so much !

                        Comment


                        • #13
                          I have another problem, sorry vdauwera

                          I did variant calling on two samples individually using the option described and it works well, I have both variant and invariant sites.

                          However, the QUAL and INFO fields are missing in most cases :
                          KE332545.1 64 . G <NON_REF> . . . GT:ADP:GQ:PL 0/0:17,0:17:51:0,51,660
                          KE332545.1 65 . G <NON_REF> . . . GT:ADP:GQ:PL 0/0:17,0:17:51:0,51,655
                          KE332545.1 66 . G T,<NON_REF> 1.14 . BaseQRankSum=-2.602;ClippingRankSum=-0.903;DP=24;MLEAC=1,0;MLEAF=0.500,0.00;MQ=38.57;MQ0=0;MQRankSum=1.115;ReadPosRankSum=0.372 GT:ADP:GQ:PL:SB 0/1:14,4,0:18:23:23,0,449,64,460,524:10,4,3,1
                          KE332545.1 67 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:53:0,53,583
                          KE332545.1 68 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:54:0,54,659
                          KE332545.1 69 . T <NON_REF> . . . GT:ADP:GQ:PL 0/0:18,0:18:54:0,54,699
                          It seems that these two fields are present only in case of an alternative allele existing right ?
                          Why ? How I can I use my filtering pipeline for the invariant sites !

                          Then I performed the joint analysis and this is the same.
                          This is a shame because I really hoped I will also have these informations for invariant sites...
                          Is there something that I missed ?

                          Thank you,

                          Muriel

                          Comment


                          • #14
                            Hi Muriel

                            Were you able to find an answer to your question?

                            Thanks

                            Comment


                            • #15
                              We need the "QUAL" of invariant sites (homogeneous reference sites) from GATK. Can we just plug "0" into the formula -10 log_10 Prob (0 variant) since ALT is ". "?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X