 05-09-2011, 01:53 AM #21 marcela Junior Member   Location: Sweden Join Date: Feb 2011 Posts: 7 Hi! I have the following PL's REF=A ALT=C,G PL=159,39,137,0,6,137 P(D|AA)=10^{-15.9} P(D|AC)=10^{-.39} P(D|CC)=10^{-13.7} P(D|AG)=1 P(D|CG)=10^{-0.06} P(D|GG)=10^{-13.7} From where I assumed the genotype would be AG, however, looking at the alignment: A CCCgCcCCcCCCCCCCccccc I would think it is AC instead, is the order of the genotypes calculated in a different way? How do I assign the order for: REF=G ALT=T,C,A PL:236,157,228,235,0,131,138,225,224,232 Thanks! Last edited by marcela; 05-10-2011 at 12:26 AM.
 05-09-2011, 06:20 AM #22 AKilleen Guest   Posts: n/a Bovine snps in vcf format Hi Ketan/everyone, I'm just wondering could anybody point me in the direction of known bovine SNPs in vcf format??
 08-23-2011, 09:28 AM #23 ashrafi_h Junior Member   Location: Davis Join Date: Jan 2010 Posts: 7 VCF file Allele composition Hi, In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads. How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?
 Hi, In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads. How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?
The DP4 value tells you how many high quality reads, across all samples in the vcf
1) match reference, in the forward direction
2) match reference, in the reverse direction
3) match alternate, in the forward direction
4) match alternate, in the reverse direction

The DP includes all the reads, and the DP4 filters poor quality ones, so the sum of the DP4 can be less than the DP value.

 Hi, In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads. How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?
Hi ashrafi_h

Did you find an answer for your question? I stumbled upon this post looking to understand vcf file in detail and am exactly looking on how to get the allele composition frequency information from the vcf file.

 10-26-2011, 10:54 PM #26 marcela Junior Member   Location: Sweden Join Date: Feb 2011 Posts: 7 Hi there! I guess you can have that info from the BaseCounts or AD: chr1 724189 . G A 52.24 . AB=0.500 AC=1 AF=0.50 AN=2 BaseCounts=3,0,3,0 BaseQRankSum=-1.537 DB DP=6 QD=8.71 . . . GT:AD: DP:GQ:PL 0/1:3,3:6:82.23:82,0,105 If you don't have this info, you could annotate your SNVs with GATK
 10-27-2011, 08:13 AM #27 curious_mapper Junior Member   Location: St Louis Join Date: Feb 2010 Posts: 4 Thanks marcela, but my vcf file doesn't seem to have the AD tag information. I called the SNPs using samtools mpileup on the CLC generated alignments. Is that information suppressed somewhere while generating the SNPs? Here is an example SNP from the vcf file: BACT_1513|gi|293366021|ref|NZ_GG749271.1| 97966 . C A,G,T 66 . DP=35;VDB=0.0042;AF1=1;AC1=2;DP4=0,0,7,25;MQ=31;FQ=-82 GT:PL:GQ 1/1:182,138,83,107,0,82,125,29,14,107:99
 Hi! I have the following PL's REF=A ALT=C,G PL=159,39,137,0,6,137 P(D|AA)=10^{-15.9} P(D|AC)=10^{-.39} P(D|CC)=10^{-13.7} P(D|AG)=1 P(D|CG)=10^{-0.06} P(D|GG)=10^{-13.7} From where I assumed the genotype would be AG, however, looking at the alignment: A CCCgCcCCcCCCCCCCccccc I would think it is AC instead, is the order of the genotypes calculated in a different way? How do I assign the order for: REF=G ALT=T,C,A PL:236,157,228,235,0,131,138,225,224,232 Thanks!
Hi marcela,

I don't know if you were able to figure this out, but I thought I'd write down the order as an exercise.

GG,GT,TT,GC,TC,CC,GA,TA,CA,AA

Karthik

GQ The Genotype Quality calculation

 chr1 10740313 . A G 188.30 PASS AC=2;AF=1.00;AN=2;DP=11;Dels=0.00;HRun=1;Haplotype Score=6.9635;MQ=26.82;MQ0=0;QD=17.12;SB=-72.04;sumGLbyD=20.12 GT:AD: DP :GQ:PL 1/1:1,10:7:21.05:221,21,0 Here PL is 221,21,0 according to samtools mpileup page PL means SAMtools/BCFtools writes genotype likelihoods in the PL format which is a comma delimited list of phred-scaled data likelihoods of each possible genotype. P(D|AA) = 10^(-2.21) = 0.006 P(D|AG) = 10^(-0.21) = 0.617 P(D|GG) = 10^(0) = 1 so does it means genotype is GG for this SNP? And thanks for AD and DP, now i understood it.

GQ:21.05
PL:221,21,0

P(D|AA) = 10^(-22.1) = 7.943282e-23
P(D|AG) = 10^(-2.1) = 0.007943282
P(D|GG) = 10^(0) = 1
1 - 1/(1+7.943282e-23+0.007943282) = 0.007880684
GT= -10*log(0.007880684,10) = 21.03436

 07-12-2012, 12:38 AM #30 aforntacc Member   Location: italy Join Date: Jun 2011 Posts: 48 Hello every one please i need help i am struggling to understand what to do on my analysis, i have VCF format data on variant call "#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 110506_SN132_A_s_1_seq 110506_SN132_A_s_2_seq_ 110506_SN132_A_s_3_seq 110506_SN132_A_s_4_seq_ 110616_SN365_A_s_5_seq_ 110616_SN365_A_s_6_seq_ chr1 11433 . T C 11.4 AltSup AC1=12;AF1=1;DP4=0,0,1,1;DP=66;FQ=-26.9;MQ=39;MfGt=0/1;MinDP=0;NeqMfGt=2 GT:PL: DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:29,3,0:1:0:5 1/1:15,3,0:1:0:5 0/1:0,0,0:0:0:3 0/1:0,0,0:0:0:3 0/1:0,0,0:0:0:3 i have a 6 genotype information corresponding to 1-3 wildtype and 4-6 mutant libraries). i have read the vcf documentations but still struggling to understand my data because i want to compare the difference between WT and MT. thanks
 07-12-2012, 12:44 AM #31 laura Senior Member   Location: Cambridge UK Join Date: Sep 2008 Posts: 151 What aspects of the format are you struggling with The genotypes are shows as index values on the Ref/Alt columns so in your case T is 0 and C is 1 This gives you your give genotypes for this site being T/C, C/C, C/C, T/C, T/C, T/C
 What aspects of the format are you struggling with The genotypes are shows as index values on the Ref/Alt columns so in your case T is 0 and C is 1 This gives you your give genotypes for this site being T/C, C/C, C/C, T/C, T/C, T/C
Ok, thank you very much, at first i was reluctant to analyse this part of the data but when i saw the previous threads on this website i was encouraged so thanks once again.

Please what use are these (AltSup AC1=12;AF1=1;DP4=0,0,1,1;DP=66;FQ=-26.9;MQ=39;MfGt=0/1;MinDP=0;NeqMfGt=2) for my analysis since i am only interested on the SNPs and INDELS that pass the filtering criteria and their differences among my libararies and not the Reference.
secondly the way you interpreted the GT index is it ture for all sites that pass the quality craiteria.
thank you

 07-12-2012, 01:40 AM #33 laura Senior Member   Location: Cambridge UK Join Date: Sep 2008 Posts: 151 Those fields will be determined by what ever analysis package you used to generate your vcf file Some of them might be standard fields which are all explained in the VCF documentation http://www.1000genomes.org/wiki/Anal...mat-version-41
 Those fields will be determined by what ever analysis package you used to generate your vcf file Some of them might be standard fields which are all explained in the VCF documentation http://www.1000genomes.org/wiki/Anal...mat-version-41
thanks a lot laura
i am only interested in the difference among the wt and mt, from which i will select candidate regions. i am more than happy if you can point me towards the right direction, this is my very first time handling this kind of data.
thanks

 07-12-2012, 04:14 AM #35 laura Senior Member   Location: Cambridge UK Join Date: Sep 2008 Posts: 151 If you want to know the difference between your WT and MT individuals you need to compare their genotypes
 07-17-2012, 01:26 AM #36 aforntacc Member   Location: italy Join Date: Jun 2011 Posts: 48 Thank you Laura, gradually i am making progress. please i want to ask but i dont know if this is a stupid question if i want to uncode the GT index for all SNPs that pass the filter criteria how can i do that? specifically do i have to do this with the VCF tools (decode genotype) using the PERL5LIB environment or what? am a bit confused please. thanks a lot
 07-17-2012, 01:29 AM #37 laura Senior Member   Location: Cambridge UK Join Date: Sep 2008 Posts: 151 Unfortunately that is a bit of a how long is a piece of string question as it very much depends on what tools/programming language you wish to use to do it If you want a vcf file with just PASS snps in it you can use the vcftools binary and its --remove-filtered-geno-all option but if you want other info than that then it depends
 07-31-2012, 01:53 AM #38 aforntacc Member   Location: italy Join Date: Jun 2011 Posts: 48 hi laura i am still progressing small small but i have got this error when i want to output the vcf file with passed snps ( --remove-filtered-geno-all) bilbo@ubuntu:~/vcftools_0.1.4a\$ ./cpp/vcftools --vcf /media/My\ Passport/other\ analysis\ by\ fasteris/2012-02-21_GQJ-1-6_VitisVinifera_variants.vcf --remove-filtered-geno-all --out /media/My\ Passport/other\ analysis\ by\ fasteris/lagolas.vcf VCFtools - v0.1.4 (C) Adam Auton 2009 Parameters as interpreted: --out /media/My Passport/other analysis by fasteris/lagolas.vcf --remove-filtered-geno-all --vcf /media/My Passport/other analysis by fasteris/2012-02-21_GQJ-1-6_VitisVinifera_variants.vcf Scanning /media/My Passport/other analysis by fasteris/2012-02-21_GQJ-1-6_VitisVinifera_variants.vcf ... Error:VCF version must be v4.0: You are using version VCFv4.1 now i am stuck, please what should i do. thanks
 07-31-2012, 01:58 AM #39 laura Senior Member   Location: Cambridge UK Join Date: Sep 2008 Posts: 151 It looks like you either need to investigate if your problem can be solved with the vcftools perl scripts or maybe change your header from vcf4.1 to vcf4.0 and see what the vcftools binary does These questions are now most appropriate for the vcftools-help list which you can find http://vcftools.sourceforge.net/
 08-02-2012, 08:22 AM #40 aforntacc Member   Location: italy Join Date: Jun 2011 Posts: 48 Hi all i dont have snp id in my data, the ID column is all in dot (.) why is this because i am able to filter out the indels but not the snps how can i do this. thanks #CHROM POS ID REF ALT chr1 8686 . T C chr1 10802 . T C chr1 10815 . A G chr1 10836 . C A chr1 11355 . C A chr1 11433 . T C chr1 11669 . ATTTT ATTTTT

