Seqanswers Leaderboard Ad

**rururara** · 04-19-2011, 06:56 PM

hai all,

i try to validate my snp in vcf format using vcf-validator.
But it print out as below :

Expected GT as the first genotype field at Chr12:15841735
Expected GT as the first genotype field at Chr12:17651041
Expected GT as the first genotype field at Chr12:17804331
Expected GT as the first genotype field at Chr12:18935754
Expected GT as the first genotype field at Chr12:19270259
Expected GT as the first genotype field at Chr12:19878395
Expected GT as the first genotype field at Chr12:19951137

Can somebody explain me what does it means? I try to find the information but still not clear.
Is that an error or anything?

Thanks?

**Michael.James.Clark** · 04-22-2011, 01:03 PM

Hi rururara

Read the VCF4 specs here:

1000genomes.org - 1000genomes Resources and Information.

http://www.1000genomes.org/node/101

1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

"If genotype information is present, then the same types of data must be present for all samples. First a FORMAT field is given specifying the data types and order. This is followed by one field per sample, with the colon-separated data in this field corresponding to the types specified in the format. The first sub-field must always be the genotype (GT)."

**blackgore** · 05-03-2011, 10:54 AM

Hi,

At the risk of appearing quite dumb, if anyone can help with (admittedly my first day using) the vcf format and vcftools, I'd be very grateful! For reference, I'm using illumina reads, samtools v0.1.13, and vcftools 0.1.15. BAM alignment was created using BWA.

VCF file was generated by the following:

samtools view -b [bamFile] [regions of interest] | mpileup -uf [reference genome] - | bcftools view -vcgAN - > variants.raw.bcf

and validated using

vcf-validator variants.raw.bcf

My questions (so far) are:

1) How I can get AF (allele frequency) values into the VCF file? Despite how I've tried, it does not seem to want to appear in the output.

##fileformat=VCFv4.1
##samtoolsVersion=0.1.13 (r926:134)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability that sample chromosomes are not all the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT -
chr2.fa 102741718 . A G,C,T 14.2 . DP=5511;AF1=0.5;CI95=0.5,0.5;DP4=2024,2411,432,528;MQ=58;FQ=17.1;PV4=0.75,3.5e-06,0.0085,1 GT:PL:GQ 0/1:44,0,248,255,255,255,255,255,255,255:47
chr2.fa 102742168 . T C,G,A 143 . DP=5872;AF1=0.5;CI95=0.5,0.5;DP4=1633,2355,778,917;MQ=59;FQ=146;PV4=0.0006,1,2.2e-21,1 GT:PL:GQ 0/1:173,0,255,255,255,255,255,255,255,255:99
chr2.fa 102745722 . AAC AACNAC 217 . INDEL;DP=2851;AF1=0.5;CI95=0.5,0.5;DP4=357,8,604,10;MQ=59;FQ=217;PV4=0.62,1,0.0028,0.12 GT:PL:GQ 0/1:255,0,255:99
chr2.fa 102748084 . G A,X 5.46 . DP=22;AF1=0.4999;CI95=0.5,0.5;DP4=10,6,2,3;MQ=59;FQ=7.8;PV4=0.61,0.00013,0.036,1 GT:PL:GQ 0/1:34,0,255,82,255,255:37
chr2.fa 102748094 . A G,X 21 . DP=23;AF1=0.5;CI95=0.5,0.5;DP4=13,5,2,3;MQ=59;FQ=24;PV4=0.3,0.00017,0.028,1 GT:PL:GQ 0/1:51,0,255,105,255,255:54
chr2.fa 102750361 . CAAAAAAAAAAAAA CAAAAAAAAAAA,CAAAAAAAAAAAAAA 29 . INDEL;DP=38;AF1=1;CI95=0.5,1;DP4=10,0,20,4;MQ=54;FQ=-46.5;PV4=0.3,1,1,1 GT:PL:GQ 1/1:69,12,0,83,39,73:72
chr2.fa 102750722 . GTG GTGTTCTCTG,GTTCTCTG 217 . INDEL;DP=3246;AF1=0.5;CI95=0.5,0.5;DP4=374,410,913,997;MQ=42;FQ=217;PV4=0.97,1,0,1 GT:PL:GQ 0/1:255,0,255,255,255,255:99
chr2.fa 102751604 . GTTTTTTTT GTTTTTTT,GTTTTTTTTT 56.5 . INDEL;DP=5535;AF1=1;CI95=1,1;DP4=456,504,1961,2049;MQ=59;FQ=-246;PV4=0.45,1,5.9e-10,0.24 GT:PL:GQ 1/1:97,211,0,244,255,123:99

2) The validator script's output (snippet below) gives me a number of issues, which I'd like to sort out. How do I add the missing header information? Without knowing enough about the vcf format just yet, I'm shying away from manually opening the file and adding the information. Also, does the 'X' allele signify a problem, or is it just an "unknown" allele (but why would it be?)?

The header tag 'reference' not present. (Not required but highly recommended.)
The header tag 'contig' not present for CHROM=chr2.fa. (Not required but highly recommended.)
chr2.fa:102740231 .. Could not parse the allele(s) [X]
chr2.fa:102740459 .. Could not parse the allele(s) [X]
chr2.fa:102748084 .. Could not parse the allele(s) [X]
chr2.fa:102748094 .. Could not parse the allele(s) [X]

**ketan_bnf** · 05-03-2011, 08:58 PM

Hi blackgore,

for your first question about Allele frequency, in vcf file AF1 is showing the Allele frequency for the related alt base.

read starting of the vcf file there are coments,

##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">

chr2.fa 102741718 . A G,C,T 14.2 . DP=5511;AF1=0.5;CI95=0.5,0.5;DP4=2024,2411,432,528;MQ=58;FQ=17.1;PV4=0.75,3.5e-06,0.0085,1 GT:PL:GQ 0/1:44,0,248,255,255,255,255,255,255,255:47

for your 2nd question i am not sure.

**blackgore** · 05-04-2011, 01:32 AM

Hi ketan,

thanks for your fast response! Unfortunately the AF1 tag is not very informative to me, as it's only for the first ALT allele, and pretty much always 1.0 or 0.5. The AF tag, as I understand it, gives proportional representation of all the called alleles, which is what I'm really after.

**marcela** · 05-09-2011, 01:53 AM

Hi!

I have the following PL's

REF=A
ALT=C,G

PL=159,39,137,0,6,137

P(D|AA)=10^{-15.9}
P(D|AC)=10^{-.39}
P(D|CC)=10^{-13.7}
P(D|AG)=1
P(D|CG)=10^{-0.06}
P(D|GG)=10^{-13.7}

From where I assumed the genotype would be AG, however, looking at the alignment:
A CCCgCcCCcCCCCCCCccccc

I would think it is AC instead, is the order of the genotypes calculated in a different way?
How do I assign the order for:
REF=G
ALT=T,C,A
PL:236,157,228,235,0,131,138,225,224,232

Thanks!

**AKilleen** · 05-09-2011, 06:20 AM

Bovine snps in vcf format

Hi Ketan/everyone,

I'm just wondering could anybody point me in the direction of known bovine SNPs in vcf format??

**ashrafi_h** · 08-23-2011, 09:28 AM

VCF file Allele composition

Hi,

In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads.

How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?

**swbarnes2** · 08-23-2011, 09:47 AM

Originally posted by ashrafi_h View Post

Hi,

In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads.

How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?

The DP4 value tells you how many high quality reads, across all samples in the vcf
1) match reference, in the forward direction
2) match reference, in the reverse direction
3) match alternate, in the forward direction
4) match alternate, in the reverse direction

The DP includes all the reads, and the DP4 filters poor quality ones, so the sum of the DP4 can be less than the DP value.

**curious_mapper** · 10-26-2011, 02:36 PM

Originally posted by ashrafi_h View Post

Hi,

In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads.

How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?

Hi ashrafi_h

Did you find an answer for your question? I stumbled upon this post looking to understand vcf file in detail and am exactly looking on how to get the allele composition frequency information from the vcf file.

**marcela** · 10-26-2011, 10:54 PM

Hi there!

I guess you can have that info from the BaseCounts or AD:

chr1 724189 . G A 52.24 .

AB=0.500
AC=1
AF=0.50
AN=2
BaseCounts=3,0,3,0
BaseQRankSum=-1.537
DB
DP=6
QD=8.71 . . .

GT:AD: DP:GQ:PL 0/1:3,3:6:82.23:82,0,105

If you don't have this info, you could annotate your SNVs with GATK

**curious_mapper** · 10-27-2011, 08:13 AM

Thanks marcela, but my vcf file doesn't seem to have the AD tag information. I called the SNPs using samtools mpileup on the CLC generated alignments. Is that information suppressed somewhere while generating the SNPs?

Here is an example SNP from the vcf file:

BACT_1513|gi|293366021|ref|NZ_GG749271.1| 97966 . C A,G,T 66 . DP=35;VDB=0.0042;AF1=1;AC1=2;DP4=0,0,7,25;MQ=31;FQ=-82 GT:PL:GQ 1/1:182,138,83,107,0,82,125,29,14,107:99

**curious_mapper** · 11-15-2011, 11:41 AM

Originally posted by marcela View Post

Hi!

I have the following PL's

REF=A
ALT=C,G

PL=159,39,137,0,6,137

P(D|AA)=10^{-15.9}
P(D|AC)=10^{-.39}
P(D|CC)=10^{-13.7}
P(D|AG)=1
P(D|CG)=10^{-0.06}
P(D|GG)=10^{-13.7}

From where I assumed the genotype would be AG, however, looking at the alignment:
A CCCgCcCCcCCCCCCCccccc

I would think it is AC instead, is the order of the genotypes calculated in a different way?
How do I assign the order for:
REF=G
ALT=T,C,A
PL:236,157,228,235,0,131,138,225,224,232

Thanks!

Hi marcela,

I don't know if you were able to figure this out, but I thought I'd write down the order as an exercise.

GG,GT,TT,GC,TC,CC,GA,TA,CA,AA

Karthik

**wanguan2000** · 12-29-2011, 11:24 PM

GQ The Genotype Quality calculation

Originally posted by ketan_bnf View Post

chr1 10740313 . A G 188.30 PASS AC=2;AF=1.00;AN=2;DP=11;Dels=0.00;HRun=1;Haplotype Score=6.9635;MQ=26.82;MQ0=0;QD=17.12;SB=-72.04;sumGLbyD=20.12 GT:AD: DP :GQ:PL 1/1:1,10:7:21.05:221,21,0

Here PL is 221,21,0

according to samtools mpileup page

PL means SAMtools/BCFtools writes genotype likelihoods in the PL format which is a comma delimited list of phred-scaled data likelihoods of each possible genotype.

P(D|AA) = 10^(-2.21) = 0.006
P(D|AG) = 10^(-0.21) = 0.617
P(D|GG) = 10^(0) = 1

so does it means genotype is GG for this SNP?

And thanks for AD and DP, now i understood it.

GQ:21.05
PL:221,21,0
you had made a calculation error.

P(D|AA) = 10^(-22.1) = 7.943282e-23
P(D|AG) = 10^(-2.1) = 0.007943282
P(D|GG) = 10^(0) = 1
1 - 1/(1+7.943282e-23+0.007943282) = 0.007880684
GT= -10*log(0.007880684,10) = 21.03436

**aforntacc** · 07-12-2012, 12:38 AM

Hello every one please i need help i am struggling to understand what to do on my analysis, i have VCF format data on variant call
"#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 110506_SN132_A_s_1_seq 110506_SN132_A_s_2_seq_ 110506_SN132_A_s_3_seq 110506_SN132_A_s_4_seq_ 110616_SN365_A_s_5_seq_ 110616_SN365_A_s_6_seq_
chr1 11433 . T C 11.4 AltSup AC1=12;AF1=1;DP4=0,0,1,1;DP=66;FQ=-26.9;MQ=39;MfGt=0/1;MinDP=0;NeqMfGt=2 GT:PL: DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:29,3,0:1:0:5 1/1:15,3,0:1:0:5 0/1:0,0,0:0:0:3 0/1:0,0,0:0:0:3 0/1:0,0,0:0:0:3
i have a 6 genotype information corresponding to 1-3 wildtype and 4-6 mutant libraries). i have read the vcf documentations but still struggling to understand my data because i want to compare the difference between WT and MT.
thanks

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News