Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK variant recalibrator input files

    Dear all,

    I am new to NGS and have been trying to run through the GATK variant calling pipeline on exome sequencing data. I'm currently having an issue with the variant quality score recalibrator, I have the following error message.

    ERROR MESSAGE: Bad input: Values for HaplotypeScore annotation not detected for ANY training variant in the input callset.

    Ive tried using using variant annotator on my UnifiedGenotyper vcf file, but that does not seem to correct the problem. I am also unsure as to whether my UnifiedGenotyper vcf file, or my hapmap, dbSNP and omni1000g resource files, are missing the annotations? Any help/advice on this issue would be much appreciated.

    Thanks,
    Elliott

  • #2
    Also, do people usually obtain the resource files from the broad resource bundle, and if so I guess these should be annotated appropriately?

    Comment


    • #3
      Hey reeso123,

      Could you give us the command used in GATK to produce the given error?

      It might be that HaplotypeScore is not part of the default annotation, hence you need to specify in the UnifiedGenotyper to add that annotation.

      Comment


      • #4
        Hi Boel,

        The commands I used for the UnifiedGenotyper function were

        java -jar GenomeAnalysisTK-1.2-64-gf62af02/GenomeAnalysisTK.jar
        -glm BOTH
        -R reference_genome/HGC/Homo_sapiens_GRCh37_53.fasta
        -T UnifiedGenotyper
        -I ./test_trio/reads.10462.recal.bam
        -D DBsnp/b37/dbsnp_132_b37_sanger.vcf
        -o ./test_trio/SNP/chr22_snps.vcf
        -metrics ./test_trio/SNP/chr22metrics.metrics
        -stand_call_conf 50.0
        -stand_emit_conf 10.0
        -L ./test_trio/Target_Intervals/chr22_target_interval.bed

        and the commands for variant recalibration were

        java -jar GenomeAnalysisTK-1.2-64-gf62af02/GenomeAnalysisTK.jar
        -T VariantRecalibrator
        -R reference_genome/HGC/Homo_sapiens_GRCh37_53.fasta
        -input ./test_trio/SNP/chr22_snps.vcf
        -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ./hapMap/hapmap_3.3.b37.sites_sanger.vcf
        -resourcemni,known=false,training=true,truth=false,prior=12.0 ./omni/1000G_omni2.5.b37.sites_sanger.vcf
        -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 ./DBsnp/b37/dbsnp_132_b37_sanger.vcf
        -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ
        -recalFile ./test_trio/SNP/output.recal
        -tranchesFile ./test_trio/SNP/output.tranches
        -rscriptFile ./test_trio/SNP/output.plots.R


        As far as I'm aware, my vcf file created by the UnifiedGenotyper contains the annotations called upon in the variant recalibrator. Iv also used GATK variant annotator to try and add them in should they not be present!

        Iv attached a subset of the vcf file used should this help to identify the problem.

        Your help is much appreciated,
        Elliott
        Attached Files

        Comment


        • #5
          Originally posted by reeso123 View Post
          Also, do people usually obtain the resource files from the broad resource bundle, and if so I guess these should be annotated appropriately?
          I would recommend doing this. It will relieve a lot of stress. If your own annotation files are the slightest bit incorrect, GATK will likely throw errors.

          Comment


          • #6
            Hi Elliott,

            ERROR MESSAGE: Bad input: Values for HaplotypeScore annotation not detected for ANY training variant in the input callset.
            I am not sure what is going on, but the error might indicate that none of the known variants (hapmap, 1000g or dbsnp) are present in your VCF file. Could that be the case?

            Comment


            • #7
              Hi all,

              Thanks so much for your input. I think I may have corrected the problem, my hapmap, 1000g and dbSNP files were incorrect in that instead of a snp being located at chr22, it was chr2chr2! This is an error on my behalf from a bug in a perl script I wrote that tried to match the bam contig names with the SNP names in the resource files. It generally seems to be a bit of a nightmare obtaining the appropriate reference, hapmap, 1000g etc to match the bam, when the data that I have received has already been processed elsewhere.

              Elliott

              Comment


              • #8
                Glad you solved it!
                And to answer an earlier question: I also use much of the data from the Broad resource bundle.

                Comment


                • #9
                  Hi all,

                  I try to use GATK as well, but I receive the following error message, when I start the VariantRecalibrator: "Argument with name '--cluster_file' (-clusterFile) is missing."

                  My command is similar to the previous mentioned ones:
                  java -Xmx4g -jar GenomeAnalysisTK.jar \
                  -T VariantRecalibrator \
                  -R hg19.fasta \
                  -mode SNP \
                  --maxGaussians 6 \
                  -B:input,VCF snps.raw.vcf \
                  -B:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \
                  -Bmni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.vcf \
                  -B:dbsnp,known=true,training=false,truth=false,prior=8.0 dbsnp.vcf \
                  -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
                  -recalFile out.recal \
                  -tranchesFile out.tranches \
                  -rscriptFile out.plots.R


                  Does someone see the mistake? Does someone else need to use the clusterFile-argument? What is that exactly? I would be really happy for any help or recommendations.

                  Comment


                  • #10
                    Hi all,

                    I'm also getting a similar error, in my case:
                    Code:
                    MESSAGE: Bad input: Values for FisherStrand annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.
                    I'm using the resource files from Broad GATK bundle. My VCF file to be recalibrated does have this annotations, which I added with "Variant Annotator" tool. Do I have to add them to the bundle files also? I can see they don't have it.

                    hapmap_3.3.hg19.sites.vcf:
                    Code:
                    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
                    chr1	566875	rs2185539	C	T	.	PASS	AC=66;AF=0.02369;AN=2786;set=MKK-YRI
                    chr1	567753	rs11510103	A	G	.	PASS	AC=11;AF=0.00404;AN=2724;set=TSI-GIH-CHD-CEU-JPT
                    chr1	728951	rs11240767	C	T	.	PASS	AC=139;AF=0.05044;AN=2756;set=MKK-YRI-LWK-MEX-ASW
                    chr1	752721	rs3131972	A	G	.	PASS	AC=1660;AF=0.59456;AN=2792;set=Intersection
                    1000G_omni2.5.hg19.sites.vcf:
                    Code:
                    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
                    chr1	534247	SNP1-524110	C	T	.	PASS	CR=99.93414;GentrainScore=0.7423;HW=1.0
                    chr1	565286	SNP1-555149	C	T	.	PASS	CR=98.8266;GentrainScore=0.7029;HW=1.0
                    chr1	569624	SNP1-559487	T	C	.	PASS	CR=97.8022;GentrainScore=0.8070;HW=1.0
                    chr1	689186	rs4000335	G	A	.	NOT_POLY_IN_1000G	CR=99.86885;GentrainScore=0.7934;HW=1.0
                    dbsnp_132.hg19.vcf:
                    Code:
                    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	
                    chrM	64	rs3883917	C	T	.	PASS	ASP;RSPOS=64;SAO=0;SCS=0;SLO;SSR=0;VC=SNP;VP=050100000005000000000100;WGT=1;dbSNPBuildID=108
                    chrM	146	rs72619361	T	C	.	PASS	ASP;G5;G5A;GNO;RSPOS=146;SAO=0;SCS=0;SSR=0;VC=SNP;VP=050000000005030100000100;WGT=1;dbSNPBuildID=130
                    chrM	152	rs117135796	T	C	.	PASS	ASP;GNO;RSPOS=152;SAO=0;SCS=0;SSR=0;VC=SNP;VP=050000000005000100000100;WGT=1;dbSNPBuildID=132
                    Thanks,
                    Carlos

                    Comment


                    • #11
                      Problem in running VariantRecalibrator

                      Hello Everyone,

                      I am trying to run the GATK variantRecalibrator but getting an error message.
                      command I am using to run it is


                      java -jar GenomeAnalysisTK.jar -R results/test_human.fasta -T VariantRecalibrator -input results/exome_snp.vcf -resource:hapmap, known=false,training=true,truth=true,prior=15.0 results/hapmap_3.3.hg19.vcf -resourcemni, known=false,training=true,truth=false,prior=12.0 results/1000G_omni2.5.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 results/00-All.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -recalFile results/exome_variantscore.recal -tranchesFile exomeoutput.tranches -rscriptFile exomeoutput.plots.R

                      error message is ERROR MESSAGE: Invalid argument value 'results/hapmap_3.3.hg19.vcf' at position 8.
                      ##### ERROR Invalid argument value 'results/1000G_omni2.5.hg19.sites.vcf' at position 11.

                      I have downloaded both hapmap and 1000 genomes vcf file from GATK resource bundle.

                      Any help would be appreciated.

                      Thanks in advance
                      Neha

                      Comment


                      • #12
                        Can you post the first 20 lines of your VCF file ?

                        results/1000G_omni2.5.hg19.sites.vcf

                        results/hapmap_3.3.hg19.vcf

                        Comment


                        • #13
                          Originally posted by raonyguimaraes View Post
                          Can you post the first 20 lines of your VCF file ?

                          results/1000G_omni2.5.hg19.sites.vcf

                          results/hapmap_3.3.hg19.vcf
                          I am attaching the doc file for hapmap3.3 and 1000_genome_file. By seeing Hapmap3.3 file I am guessing there is something wrong with this file or may be I am confused this file looks like this only.

                          Neha
                          Attached Files

                          Comment


                          • #14
                            Originally posted by neha View Post
                            I am attaching the doc file for hapmap3.3 and 1000_genome_file. By seeing Hapmap3.3 file I am guessing there is something wrong with this file or may be I am confused this file looks like this only.

                            Neha
                            Hey did You get the chance to see the files.

                            Any help would be appreciated.

                            Neha

                            Comment


                            • #15
                              Hello Everyone,

                              When using VariantRecalibrator walker of GATK I am facing a small problem.

                              I am using the following command

                              java -jar ./../GenomeAnalysisTK.jar -T VariantRecalibrator -R test_human.fasta -input exome_snp.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resourcemni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 00-All.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -recalFile exome_variantscore.recal -tranchesFile exomeoutput.tranches -rscriptFile exomeoutput.plots

                              In this I get the warning message that

                              Rscript not found in environment path. exomeoutput.plots will be generated but PDF plots will not.

                              Can anyone please guide me how to include the R script path. I am getting bit confused about it.

                              Thanks in advance.
                              Neha

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X