Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to generate allele depth at a specific position

    Hi all,

    I have two questions, first one is regarding allele depth which exactly is as follows,
    How can we find, the number of the reads at each variant site have a particular allele, i.e. if you have ref = C and ALT = A,G, how many of the reads have A, G and C? Is there a tool to generate this?

    The second question is,
    How can we infer a SNP to be homozygous or heterozygous from the mpileup output, vcf file?

    Could someone help me with this!!!

  • #2
    With mpileup if there are two alternate alleles I don't believe there is a way to tell how many reads go to each (from the mpileup output). You can open up the bam file in IGV and look that way, I think.

    From the mpileup output, assuming you have run it on a single individual, you can look at the "AC1=" part of the info field. If it's 1, it's heterozygous. If it's 2, it's homozygous. You can also look at the genotype field. If it's "0/1" it's heterozygous, if it's "1/1" it's homozygous.

    Comment


    • #3
      Thanks for your answer.

      I have a bam file which is pooled with 10 individuals, so in this case can we still look at the value 'AC' and decide its zygosity status or is GT the only field from where we can infer the zygosity status.

      Comment


      • #4
        I've yet to analyze pooled samples so I'm not sure. Could you paste the header lines and a couple of example lines of the called variants?

        Comment


        • #5
          Originally posted by Heisman View Post
          With mpileup if there are two alternate alleles I don't believe there is a way to tell how many reads go to each (from the mpileup output). ...
          Actually, you can get this information from mpileup. It's in the fifth column of the pileup format output. This column indicates the state of all reads at a given position. You can tell how many reads have the reference allele, how many have the alternate, and how many have something else. It's a little ugly to parse, but the information is there.

          Comment


          • #6
            Originally posted by Alex Renwick View Post
            Actually, you can get this information from mpileup. It's in the fifth column of the pileup format output. This column indicates the state of all reads at a given position. You can tell how many reads have the reference allele, how many have the alternate, and how many have something else. It's a little ugly to parse, but the information is there.
            When there are two alternate alleles, I've noticed that with DP4 reads for both alleles are lumped together, ie, there are still 4 columns instead of 6.

            Comment


            • #7
              Originally posted by Heisman View Post
              When there are two alternate alleles, I've noticed that with DP4 reads for both alleles are lumped together, ie, there are still 4 columns instead of 6.
              Hmm...I think we must be talking about different things. The default pileup output looks like (from samtools):

              Code:
              seq1 272 T 24  ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
              seq1 273 T 23  ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
              seq1 274 T 23  ,.$....,,.,.,...,,,.,...    7<7;<;<<<<<<<<<=<;<;<<6
              seq1 275 A 23  ,$....,,.,.,...,,,.,...^l.  <+;9*<<<<<<<<<=<<:;<<<<
              seq1 276 G 22  ...T,,.,.,...,,,.,....  33;+<<7=7<<7<&<<1;<<6<
              seq1 277 T 22  ....,,.,.,.C.,,,.,..G.  +7<;<<<<<<<&<=<<:;<<&<
              seq1 278 G 23  ....,,.,.,...,,,.,....^k.   %38*<<;<7<<7<=<<<;<<<<<
              seq1 279 C 23  A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
              This shows, for example, that position 279 of seq1 has reference allele C, and that 21 reads agree with the reference while one read has A and one read has T.

              Comment


              • #8
                I don't believe there is any way to get pileup outuput when using the newer version of mpileup, correct? Or am I mistaken? That's an interesting idea though to use pileup for that purpose (assuming I'm not mistaken).

                Comment


                • #9
                  VCF version 4 contains this information if you call your variants with mpileup-bcftools or the GATK UnifiedGenotyper. Have a look at the docs but note that both programs do not produce the exact same information by default. I find the GATK VCF more informative for counting allele frequency and depth.

                  Comment


                  • #10
                    Originally posted by Heisman View Post
                    I don't believe there is any way to get pileup outuput when using the newer version of mpileup, correct? Or am I mistaken? That's an interesting idea though to use pileup for that purpose (assuming I'm not mistaken).
                    Call mpileup without computing genotype (-g or -u option) and it produces pileup format. E.g., "samtools mpileup -f ref.fasta sample.bam > sample.pileup".

                    Comment


                    • #11
                      That's true, even if there are two alternate alleles it gives only four values in DP4. So we cannot decide the heterozygosity in this case. Is there a way to find this? will be very interesting for me if there is a solution!!

                      Comment


                      • #12
                        Hi, here are the few lines of the output;


                        ##fileformat=VCFv4.1
                        ##samtoolsVersion=0.1.16 (r963:234)
                        ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
                        ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
                        ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
                        ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
                        ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
                        ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
                        ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
                        ##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">
                        ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
                        ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
                        ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
                        ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
                        ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
                        ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
                        ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
                        ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
                        ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
                        ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
                        ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
                        ##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
                        #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT alignment-sorted.bam
                        chr1 3002525 . C A 34 . DP=3;AF1=1;CI95=0.5,1;DP4=0,0,2,1;MQ=27;FQ=-36 GT:PL:GQ
                        1/1:66,9,0:16
                        chr1 3009311 . A G 63 . DP=3;AF1=1;CI95=0.5,1;DP4=0,0,2,1;MQ=42;FQ=-36 GT:PL:GQ 1/1:95,9,0:16
                        chr1 3010971 . G A 67 . DP=7;AF1=0.5;CI95=0.5,0.5;DP4=1,1,2,3;MQ=33;FQ=28;PV4=1,0.18,0.1,1 GT:PL:GQ 0/1:97,0,55:58
                        chr1 3011005 . T C 10.4 . DP=5;AF1=0.5;CI95=0.5,0.5;DP4=2,1,0,2;MQ=33;FQ=13.2;PV4=0.4,0.14,0.42,1 GT:PL:GQ 0/1:40,0,77:42
                        chr1 3011166 . T G 11.3 . DP=4;AF1=0.5001;CI95=0.5,0.5;DP4=1,1,2,0;MQ=26;FQ=6.48;PV4=1,0.21,1,1 GT:PL:GQ 0/1:41,0,33:36
                        chr1 3011231 . T A 37 . DP=3;AF1=1;CI95=0.5,1;DP4=0,0,1,2;MQ=28;FQ=-36 GT:PL:GQ 1/1:69,9,0:16
                        chr1 3030035 . T C 39 . DP=3;AF1=1;CI95=0.5,1;DP4=0,0,2,1;MQ=42;FQ=-36 GT:PL:GQ 1/1:71,9,0:16
                        chr1 3031485 . G T 48 . DP=3;AF1=1;CI95=0.5,1;DP4=0,0,3,0;MQ=42;FQ=-36 GT:PL:GQ 1/1:80,9,0:16
                        chr1 3035555 . A G 70 . DP=3;AF1=1;CI95=0.5,1;DP4=0,0,2,1;MQ=58;FQ=-36 GT:PL:GQ 1/1:102,9,0:16
                        chr1 3044119 . G C 25.3 . DP=5;AF1=0.5336;CI95=0.5,1;DP4=2,0,2,1;MQ=22;FQ=-18.1;PV4=1,1,1,0.26 GT:PL:GQ 0/1:55,0,9:12
                        chr1 3044145 . A G 75.8 . DP=6;AF1=1;CI95=0.5,1;DP4=1,0,2,3;MQ=25;FQ=-37;PV4=1,1,1,1 GT:PL:GQ 1/1:108,10,0:17
                        chr1 3044148 . A G 81.3 . DP=5;AF1=1;CI95=0.5,1;DP4=0,0,2,3;MQ=28;FQ=-42 GT:PL:GQ 1/1:114,15,0:27
                        chr1 3047201 . C T 11.1 . DP=2;AF1=1;CI95=0.5,1;DP4=0,0,2,0;MQ=33;FQ=-33 GT:PL:GQ 1/1:42,6,0:9

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        21 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X