Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    hai all,

    i try to validate my snp in vcf format using vcf-validator.
    But it print out as below :


    Expected GT as the first genotype field at Chr12:15841735
    Expected GT as the first genotype field at Chr12:17651041
    Expected GT as the first genotype field at Chr12:17804331
    Expected GT as the first genotype field at Chr12:18935754
    Expected GT as the first genotype field at Chr12:19270259
    Expected GT as the first genotype field at Chr12:19878395
    Expected GT as the first genotype field at Chr12:19951137


    Can somebody explain me what does it means? I try to find the information but still not clear.
    Is that an error or anything?

    Thanks?

    Comment


    • #17
      Hi rururara

      Read the VCF4 specs here:

      1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!


      "If genotype information is present, then the same types of data must be present for all samples. First a FORMAT field is given specifying the data types and order. This is followed by one field per sample, with the colon-separated data in this field corresponding to the types specified in the format. The first sub-field must always be the genotype (GT)."
      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
      Projects: U87MG whole genome sequence [Website] [Paper]

      Comment


      • #18
        Hi,

        At the risk of appearing quite dumb, if anyone can help with (admittedly my first day using) the vcf format and vcftools, I'd be very grateful! For reference, I'm using illumina reads, samtools v0.1.13, and vcftools 0.1.15. BAM alignment was created using BWA.

        VCF file was generated by the following:

        samtools view -b [bamFile] [regions of interest] | mpileup -uf [reference genome] - | bcftools view -vcgAN - > variants.raw.bcf

        and validated using

        vcf-validator variants.raw.bcf

        My questions (so far) are:

        1) How I can get AF (allele frequency) values into the VCF file? Despite how I've tried, it does not seem to want to appear in the output.


        ##fileformat=VCFv4.1
        ##samtoolsVersion=0.1.13 (r926:134)
        ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
        ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
        ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
        ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability that sample chromosomes are not all the same">
        ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
        ##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">
        ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
        ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
        ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
        ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
        ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
        ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
        ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
        ##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
        #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT -
        chr2.fa 102741718 . A G,C,T 14.2 . DP=5511;AF1=0.5;CI95=0.5,0.5;DP4=2024,2411,432,528;MQ=58;FQ=17.1;PV4=0.75,3.5e-06,0.0085,1 GT:PL:GQ 0/1:44,0,248,255,255,255,255,255,255,255:47
        chr2.fa 102742168 . T C,G,A 143 . DP=5872;AF1=0.5;CI95=0.5,0.5;DP4=1633,2355,778,917;MQ=59;FQ=146;PV4=0.0006,1,2.2e-21,1 GT:PL:GQ 0/1:173,0,255,255,255,255,255,255,255,255:99
        chr2.fa 102745722 . AAC AACNAC 217 . INDEL;DP=2851;AF1=0.5;CI95=0.5,0.5;DP4=357,8,604,10;MQ=59;FQ=217;PV4=0.62,1,0.0028,0.12 GT:PL:GQ 0/1:255,0,255:99
        chr2.fa 102748084 . G A,X 5.46 . DP=22;AF1=0.4999;CI95=0.5,0.5;DP4=10,6,2,3;MQ=59;FQ=7.8;PV4=0.61,0.00013,0.036,1 GT:PL:GQ 0/1:34,0,255,82,255,255:37
        chr2.fa 102748094 . A G,X 21 . DP=23;AF1=0.5;CI95=0.5,0.5;DP4=13,5,2,3;MQ=59;FQ=24;PV4=0.3,0.00017,0.028,1 GT:PL:GQ 0/1:51,0,255,105,255,255:54
        chr2.fa 102750361 . CAAAAAAAAAAAAA CAAAAAAAAAAA,CAAAAAAAAAAAAAA 29 . INDEL;DP=38;AF1=1;CI95=0.5,1;DP4=10,0,20,4;MQ=54;FQ=-46.5;PV4=0.3,1,1,1 GT:PL:GQ 1/1:69,12,0,83,39,73:72
        chr2.fa 102750722 . GTG GTGTTCTCTG,GTTCTCTG 217 . INDEL;DP=3246;AF1=0.5;CI95=0.5,0.5;DP4=374,410,913,997;MQ=42;FQ=217;PV4=0.97,1,0,1 GT:PL:GQ 0/1:255,0,255,255,255,255:99
        chr2.fa 102751604 . GTTTTTTTT GTTTTTTT,GTTTTTTTTT 56.5 . INDEL;DP=5535;AF1=1;CI95=1,1;DP4=456,504,1961,2049;MQ=59;FQ=-246;PV4=0.45,1,5.9e-10,0.24 GT:PL:GQ 1/1:97,211,0,244,255,123:99


        2) The validator script's output (snippet below) gives me a number of issues, which I'd like to sort out. How do I add the missing header information? Without knowing enough about the vcf format just yet, I'm shying away from manually opening the file and adding the information. Also, does the 'X' allele signify a problem, or is it just an "unknown" allele (but why would it be?)?

        The header tag 'reference' not present. (Not required but highly recommended.)
        The header tag 'contig' not present for CHROM=chr2.fa. (Not required but highly recommended.)
        chr2.fa:102740231 .. Could not parse the allele(s) [X]
        chr2.fa:102740459 .. Could not parse the allele(s) [X]
        chr2.fa:102748084 .. Could not parse the allele(s) [X]
        chr2.fa:102748094 .. Could not parse the allele(s) [X]

        Comment


        • #19
          Hi blackgore,

          for your first question about Allele frequency, in vcf file AF1 is showing the Allele frequency for the related alt base.

          read starting of the vcf file there are coments,

          ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">

          chr2.fa 102741718 . A G,C,T 14.2 . DP=5511;AF1=0.5;CI95=0.5,0.5;DP4=2024,2411,432,528;MQ=58;FQ=17.1;PV4=0.75,3.5e-06,0.0085,1 GT:PL:GQ 0/1:44,0,248,255,255,255,255,255,255,255:47

          for your 2nd question i am not sure.

          Comment


          • #20
            Hi ketan,

            thanks for your fast response! Unfortunately the AF1 tag is not very informative to me, as it's only for the first ALT allele, and pretty much always 1.0 or 0.5. The AF tag, as I understand it, gives proportional representation of all the called alleles, which is what I'm really after.

            Comment


            • #21
              Hi!

              I have the following PL's

              REF=A
              ALT=C,G

              PL=159,39,137,0,6,137

              P(D|AA)=10^{-15.9}
              P(D|AC)=10^{-.39}
              P(D|CC)=10^{-13.7}
              P(D|AG)=1
              P(D|CG)=10^{-0.06}
              P(D|GG)=10^{-13.7}

              From where I assumed the genotype would be AG, however, looking at the alignment:
              A CCCgCcCCcCCCCCCCccccc

              I would think it is AC instead, is the order of the genotypes calculated in a different way?
              How do I assign the order for:
              REF=G
              ALT=T,C,A
              PL:236,157,228,235,0,131,138,225,224,232

              Thanks!
              Last edited by marcela; 05-10-2011, 12:26 AM.

              Comment


              • #22
                Bovine snps in vcf format

                Hi Ketan/everyone,

                I'm just wondering could anybody point me in the direction of known bovine SNPs in vcf format??

                Comment


                • #23
                  VCF file Allele composition

                  Hi,

                  In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads.

                  How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?

                  Comment


                  • #24
                    Originally posted by ashrafi_h View Post
                    Hi,

                    In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads.

                    How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?
                    The DP4 value tells you how many high quality reads, across all samples in the vcf
                    1) match reference, in the forward direction
                    2) match reference, in the reverse direction
                    3) match alternate, in the forward direction
                    4) match alternate, in the reverse direction

                    The DP includes all the reads, and the DP4 filters poor quality ones, so the sum of the DP4 can be less than the DP value.

                    Comment


                    • #25
                      Originally posted by ashrafi_h View Post
                      Hi,

                      In the old pileup file of pileup command we could calculate or at least see the allele composition of reads at each position. For instance, if ref base is A and the reads are ......,,,,,,.T.... was meaning 18 A and one T in the reads.

                      How can we get the same information in VCF file? It is useless to have the Depth but not knowing what is what?
                      Hi ashrafi_h

                      Did you find an answer for your question? I stumbled upon this post looking to understand vcf file in detail and am exactly looking on how to get the allele composition frequency information from the vcf file.

                      Comment


                      • #26
                        Hi there!

                        I guess you can have that info from the BaseCounts or AD:

                        chr1 724189 . G A 52.24 .

                        AB=0.500
                        AC=1
                        AF=0.50
                        AN=2
                        BaseCounts=3,0,3,0
                        BaseQRankSum=-1.537
                        DB
                        DP=6
                        QD=8.71 . . .

                        GT:AD: DP:GQ:PL 0/1:3,3:6:82.23:82,0,105

                        If you don't have this info, you could annotate your SNVs with GATK

                        Comment


                        • #27
                          Thanks marcela, but my vcf file doesn't seem to have the AD tag information. I called the SNPs using samtools mpileup on the CLC generated alignments. Is that information suppressed somewhere while generating the SNPs?

                          Here is an example SNP from the vcf file:

                          BACT_1513|gi|293366021|ref|NZ_GG749271.1| 97966 . C A,G,T 66 . DP=35;VDB=0.0042;AF1=1;AC1=2;DP4=0,0,7,25;MQ=31;FQ=-82 GT:PL:GQ 1/1:182,138,83,107,0,82,125,29,14,107:99

                          Comment


                          • #28
                            Originally posted by marcela View Post
                            Hi!

                            I have the following PL's

                            REF=A
                            ALT=C,G

                            PL=159,39,137,0,6,137

                            P(D|AA)=10^{-15.9}
                            P(D|AC)=10^{-.39}
                            P(D|CC)=10^{-13.7}
                            P(D|AG)=1
                            P(D|CG)=10^{-0.06}
                            P(D|GG)=10^{-13.7}

                            From where I assumed the genotype would be AG, however, looking at the alignment:
                            A CCCgCcCCcCCCCCCCccccc

                            I would think it is AC instead, is the order of the genotypes calculated in a different way?
                            How do I assign the order for:
                            REF=G
                            ALT=T,C,A
                            PL:236,157,228,235,0,131,138,225,224,232

                            Thanks!
                            Hi marcela,

                            I don't know if you were able to figure this out, but I thought I'd write down the order as an exercise.

                            GG,GT,TT,GC,TC,CC,GA,TA,CA,AA

                            Karthik

                            Comment


                            • #29
                              GQ The Genotype Quality calculation

                              Originally posted by ketan_bnf View Post
                              chr1 10740313 . A G 188.30 PASS AC=2;AF=1.00;AN=2;DP=11;Dels=0.00;HRun=1;Haplotype Score=6.9635;MQ=26.82;MQ0=0;QD=17.12;SB=-72.04;sumGLbyD=20.12 GT:AD: DP :GQ:PL 1/1:1,10:7:21.05:221,21,0

                              Here PL is 221,21,0

                              according to samtools mpileup page

                              PL means SAMtools/BCFtools writes genotype likelihoods in the PL format which is a comma delimited list of phred-scaled data likelihoods of each possible genotype.

                              P(D|AA) = 10^(-2.21) = 0.006
                              P(D|AG) = 10^(-0.21) = 0.617
                              P(D|GG) = 10^(0) = 1

                              so does it means genotype is GG for this SNP?

                              And thanks for AD and DP, now i understood it.

                              GQ:21.05
                              PL:221,21,0
                              you had made a calculation error.

                              P(D|AA) = 10^(-22.1) = 7.943282e-23
                              P(D|AG) = 10^(-2.1) = 0.007943282
                              P(D|GG) = 10^(0) = 1
                              1 - 1/(1+7.943282e-23+0.007943282) = 0.007880684
                              GT= -10*log(0.007880684,10) = 21.03436

                              Comment


                              • #30
                                Hello every one please i need help i am struggling to understand what to do on my analysis, i have VCF format data on variant call
                                "#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 110506_SN132_A_s_1_seq 110506_SN132_A_s_2_seq_ 110506_SN132_A_s_3_seq 110506_SN132_A_s_4_seq_ 110616_SN365_A_s_5_seq_ 110616_SN365_A_s_6_seq_
                                chr1 11433 . T C 11.4 AltSup AC1=12;AF1=1;DP4=0,0,1,1;DP=66;FQ=-26.9;MQ=39;MfGt=0/1;MinDP=0;NeqMfGt=2 GT:PL: DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:29,3,0:1:0:5 1/1:15,3,0:1:0:5 0/1:0,0,0:0:0:3 0/1:0,0,0:0:0:3 0/1:0,0,0:0:0:3
                                i have a 6 genotype information corresponding to 1-3 wildtype and 4-6 mutant libraries). i have read the vcf documentations but still struggling to understand my data because i want to compare the difference between WT and MT.
                                thanks

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                66 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X