Hi all,
I have just started to try to work with VCF files, and have read through the specifications here (https://github.com/samtools/hts-specs) and also on the 1000 genomes website. However, I am still a bit confused about the difference between Genotype Quality (GQ) and Genotype Likelihoods (PL) in the SAMPLE columns. Specifically, I am using the mgp.v3.snps.rsIDdbSNPv137.vcf file downloaded from Sanger Mouse Genomes (ftp-mouse.sanger.ac.uk/current_snps/mgp.v3.snps.rsIDdbSNPv137.vcf.gz), and am confused by the following line:
It is the field for CASTEiJ that confuses me:
FORMAT = GT:GQ: DP:SP:PL:FI
CASTEiJ = 1/1:99:59:38:255,0,107,.,.,.:1
GT indicates the genotype is Alt1/Alt1, and the GQ field suggests this is very unlikely to be wrong. However, the PL field (255,0,107,.,.,.) suggests that the 0/1 (Alt/Ref; i.e. heterozygous) genotype is most likely.
So - I do not understand this apparent discrepancy between the GT and PL fields?
More generally, I do not completely understand what the difference is between a Genotype Quality (GQ), and the Genotype Likelihoods (PL). If anyone is able to explain these to me in simple (not too statistical!) terms I would be incredibly grateful!
Ultimately, I would like to apply a simple filter to extract all CASTEiJ SNPs that are "high quality" (very unlikely to be wrong). I thought a good way to do this would be to take all entries with the CASTEiJ field like this: 1/1:xx:xx:xx:xx:1 (i.e. Alt/Alt, that passes filter), with either a "good" GQ score, or a "good" PL score for the Alt/Alt. Does this sound sensible?
Many thanks, Alex
I have just started to try to work with VCF files, and have read through the specifications here (https://github.com/samtools/hts-specs) and also on the 1000 genomes website. However, I am still a bit confused about the difference between Genotype Quality (GQ) and Genotype Likelihoods (PL) in the SAMPLE columns. Specifically, I am using the mgp.v3.snps.rsIDdbSNPv137.vcf file downloaded from Sanger Mouse Genomes (ftp-mouse.sanger.ac.uk/current_snps/mgp.v3.snps.rsIDdbSNPv137.vcf.gz), and am confused by the following line:
Code:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2 129S1 129S5 AJ AKRJ BALBcJ C3HHeJ C57BL6NJ CASTEiJ ....etc 1 185273864 . T G,A 208.22 StrandBias AC1=1;AC=12,2;AF1=0.5336;AN=36;DP4=251,455,132,218;DP=1083;MDV=99;MQ=31;MSD=27;PV0=0.054;PV1=1;PV2=1;PV3=1;PV4=0.054,1,1,1;QD=0.8148;SB=0.8512;VDB=0.0308 GT:GQ:DP:SP:PL:FI 1/1:99:27:13:198,0,9,.,.,.:1 1/1:99:91:40:255,0,57,.,.,.:0 1/1:99:19:19:129,0,56,.,.,.:1 0/0:.:43:3:0,.,.,.,.,.:1 1/1:99:53:44:255,0,163,.,.,.:0 0/0:.:69:17:0,.,.,.,.,.:1 0/0:.:61:7:0,.,.,.,.,.:1 0/0:.:61:0:0,.,.,.,.,.:1 1/1:99:59:38:255,0,107,.,.,.:1 ......etc
FORMAT = GT:GQ: DP:SP:PL:FI
CASTEiJ = 1/1:99:59:38:255,0,107,.,.,.:1
GT indicates the genotype is Alt1/Alt1, and the GQ field suggests this is very unlikely to be wrong. However, the PL field (255,0,107,.,.,.) suggests that the 0/1 (Alt/Ref; i.e. heterozygous) genotype is most likely.
So - I do not understand this apparent discrepancy between the GT and PL fields?
More generally, I do not completely understand what the difference is between a Genotype Quality (GQ), and the Genotype Likelihoods (PL). If anyone is able to explain these to me in simple (not too statistical!) terms I would be incredibly grateful!
Ultimately, I would like to apply a simple filter to extract all CASTEiJ SNPs that are "high quality" (very unlikely to be wrong). I thought a good way to do this would be to take all entries with the CASTEiJ field like this: 1/1:xx:xx:xx:xx:1 (i.e. Alt/Alt, that passes filter), with either a "good" GQ score, or a "good" PL score for the Alt/Alt. Does this sound sensible?
Many thanks, Alex