I have five human exome data sets that I am putting through the GATK "Best Practices" pipeline. After running things through UnifiedGenotyper, I separated the SNPs from the indels and then ran VariantRecalibrator on the SNPs:
java -Xmx36g -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar -T VariantRecalibrator -R /mnt/fred/home/efoss/gene_databases/human_genome/human_g1k_v37/human_g1k_v37.fasta --percentBadVariants 0.05 --maxGaussians 6 -B:input,VCF /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.raw.vcf -B:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 /mnt/fred/home/efoss/gene_databases/GATK_resource_files/hapmap_3.3.b37.sites.vcf -Bmni,VCF,known=false,training=true,truth=false,prior=12.0 /mnt/fred/home/efoss/gene_databases/GATK_resource_files/1000G_omni2.5.b37.sites.vcf -B:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 /mnt/fred/home/efoss/gene_databases/GATK_resource_files/dbsnp_132.b37.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -recalFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.recal -tranchesFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.tranches -rscriptFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.plots.R
I then ran ApplyRecalibration:
java -Xmx36g -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar -T ApplyRecalibration -R /mnt/fred/home/efoss/gene_databases/human_genome/human_g1k_v37/human_g1k_v37.fasta -B:input,VCF /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.raw.vcf --ts_filter_level 99.0 -tranchesFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.tranches -recalFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.recal -o /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.recalibrated.raw.vcf
My problem is that there are plenty of bad calls getting through ApplyRecalibration. I pick up quite a few mutations that appear in 5 out of 5 (unrelated) individuals, so I can be confident that they are sequencing artifacts. However, these mutations are often listed as passing the filters. Here are two examples:
1 801943 rs7516866 C T 44026.78 PASS AC=10;AF=1.00;AN=10;BaseQRankSum=4.380;DB;DP=1449;DS;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=16.2975;MQ=47.88;MQ0=66;MQRankSum=7.651;QD=30.38;ReadPosRankSum=5.739;SB=-18779.46;VQSLOD=-4.0745;culprit=ReadPosRankSum GT:ADP:GQ:PL 1/1:10,268:278:99:7953,572,0 1/1:6,293:299:99:9124,682,0 1/1:5,268:273:99:7936,586,0 1/1:4,295:299:99:9809,761,0 1/1:10,290:300:99:8784,610,0
1 802191 rs71507475 G A 1686.49 PASS AC=5;AF=0.50;AN=10;BaseQRankSum=2.704;DB;DP=1499;DS;Dels=0.00;FS=5.855;HRun=0;HaplotypeScore=44.5371;MQ=22.55;MQ0=460;MQRankSum=-3.161;QD=1.13;ReadPosRankSum=-0.860;SB=-867.92;VQSLOD=-15469854734583906.0000;culprit=HaplotypeScore GT:ADP:GQ:PL 0/1:224,69:300:99:286,0,2161 0/1:225,71:299:99:264,0,2035 0/1:237,58:300:99:230,0,2248 0/1:217,75:300:99:525,0,1676 0/1:213,77:300:99:426,0,1708
I gathered up all of these 5/5 sequencing artifacts and checked the results of the exome analysis published in Nature Genetics and not a single one appeared in those results, so clearly I'm either running the analysis incorrectly or I'm interpreting the results i ncorrectly (or both).
I would appreciate any suggestions.
Thank you very much for the help.
Eric
java -Xmx36g -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar -T VariantRecalibrator -R /mnt/fred/home/efoss/gene_databases/human_genome/human_g1k_v37/human_g1k_v37.fasta --percentBadVariants 0.05 --maxGaussians 6 -B:input,VCF /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.raw.vcf -B:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 /mnt/fred/home/efoss/gene_databases/GATK_resource_files/hapmap_3.3.b37.sites.vcf -Bmni,VCF,known=false,training=true,truth=false,prior=12.0 /mnt/fred/home/efoss/gene_databases/GATK_resource_files/1000G_omni2.5.b37.sites.vcf -B:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 /mnt/fred/home/efoss/gene_databases/GATK_resource_files/dbsnp_132.b37.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -recalFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.recal -tranchesFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.tranches -rscriptFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.plots.R
I then ran ApplyRecalibration:
java -Xmx36g -jar /home/efoss/sequencing/GenomeAnalysisTK-1.1-31-gdc8398e/GenomeAnalysisTK.jar -T ApplyRecalibration -R /mnt/fred/home/efoss/gene_databases/human_genome/human_g1k_v37/human_g1k_v37.fasta -B:input,VCF /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.raw.vcf --ts_filter_level 99.0 -tranchesFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.tranches -recalFile /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.recal -o /mnt/fred/home/efoss/sequencing/MDS/all_five_patients_UnifiedGenotyper_output_SNPs_091711_1.recalibrated.raw.vcf
My problem is that there are plenty of bad calls getting through ApplyRecalibration. I pick up quite a few mutations that appear in 5 out of 5 (unrelated) individuals, so I can be confident that they are sequencing artifacts. However, these mutations are often listed as passing the filters. Here are two examples:
1 801943 rs7516866 C T 44026.78 PASS AC=10;AF=1.00;AN=10;BaseQRankSum=4.380;DB;DP=1449;DS;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=16.2975;MQ=47.88;MQ0=66;MQRankSum=7.651;QD=30.38;ReadPosRankSum=5.739;SB=-18779.46;VQSLOD=-4.0745;culprit=ReadPosRankSum GT:ADP:GQ:PL 1/1:10,268:278:99:7953,572,0 1/1:6,293:299:99:9124,682,0 1/1:5,268:273:99:7936,586,0 1/1:4,295:299:99:9809,761,0 1/1:10,290:300:99:8784,610,0
1 802191 rs71507475 G A 1686.49 PASS AC=5;AF=0.50;AN=10;BaseQRankSum=2.704;DB;DP=1499;DS;Dels=0.00;FS=5.855;HRun=0;HaplotypeScore=44.5371;MQ=22.55;MQ0=460;MQRankSum=-3.161;QD=1.13;ReadPosRankSum=-0.860;SB=-867.92;VQSLOD=-15469854734583906.0000;culprit=HaplotypeScore GT:ADP:GQ:PL 0/1:224,69:300:99:286,0,2161 0/1:225,71:299:99:264,0,2035 0/1:237,58:300:99:230,0,2248 0/1:217,75:300:99:525,0,1676 0/1:213,77:300:99:426,0,1708
I gathered up all of these 5/5 sequencing artifacts and checked the results of the exome analysis published in Nature Genetics and not a single one appeared in those results, so clearly I'm either running the analysis incorrectly or I'm interpreting the results i ncorrectly (or both).
I would appreciate any suggestions.
Thank you very much for the help.
Eric
Comment