Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting the base/nucleotide across all reads at particular position/s

    Hello everyone. I will try to explain with an example of what i need.

    We have vcf file to get some positions of snps.
    Code:
    #CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
    20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 
    20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 
    20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 
    20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ
    and a bam file with the alignments to reference. the approximate depth is 20x and all are 454 reads. Now from this information, i wanted to extract all the bases at the particular positions.. I need output of something like this.

    Read Name Position1 Position2 Position3 Position4
    14370 17330 1110696 1230237
    Read1 G T A T
    Read2 A A G .
    Read3 A A G .
    Read4 G A A T
    Read5 A T T .
    Read6 A T T .
    Read7 G A A T
    I would like to build such table using all reads at specific position. Can this be possible by samtools mpileup/vcf tools or does anyone has any script written to solve such problem. From this information i will be extracting the haplotype information for specific genotypes.
    Last edited by empyrean; 05-21-2012, 09:46 AM.

  • #2
    Any suggestions / comments?

    Comment


    • #3
      You can generate the consensus at each of those positions with mpileup and then look at the DP4 values to see how many bases match the reference and how many bases match a variant allele. Is that sufficient?

      Comment


      • #4
        Thank you for the reply. Actually No because, i need to get the exact base for the read to build such haplotype graph.

        Comment


        • #5
          Oh, I see; in your example your bases are so far apart that wouldn't be possible and hence I didn't think that was what you actually wanted to do. I'm not sure how to do this so maybe somebody else will chime in.

          Comment


          • #6
            Thanks again.. hoping for some help here !!

            Comment


            • #7
              Hi empyrean,

              the GATK toolkit has a powerful Java API for extracting information from BAM files at all sorts of levels, including individual bases from individual reads - there may be some mileage in there for you:



              cheers

              Micha

              Comment


              • #8
                Originally posted by mbayer View Post
                Hi empyrean,

                the GATK toolkit has a powerful Java API for extracting information from BAM files at all sorts of levels, including individual bases from individual reads - there may be some mileage in there for you:



                cheers

                Micha
                I agree with Micha -- you can get the GATK to do that for you. You would need to modify a walker to walk the reference, get the pileup at each position you want (which you can specify in an input file) and return the results formatted as a table.

                Comment


                • #9
                  Any progress on this

                  I'm actually trying to do something similar. Did you have any progress with GATK on this, or did you find another method? Many thanks!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 08:47 AM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  54 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X