Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract genotype consensus from BAM files at specific locations

    Hi all,

    I'm very new to this forum, and haven't looked around much to see if this has been answered already.

    But is there an easy way to extract the genotype (consensus) for a sample (across all reads) at a very specific location (not a region), given its BAM file? I know this could be very simple & possible with samtools, but my search/research hasn't been great.

    Example: The base at chr1 pos 45800 for that sample from its BAM file?
    And also based on certain coverage depth & quality (Like say similar to Q30 & DP3 in VCF language).

    To give you the context, the problem is we have analyzed SNPs for a large set of samples, using bowtie/samtools/vcftools, reporting the SNP calls in VCF format. Obviously, this captures only variants (wrt reference) individually. But, when comparing these samples across the population to define haplotype groups, using the VCF files... looking at the overlapping SNP loci, we do not know if a missing value for a sample means the genotype is same as reference or there hasn't been any coverage at all.

    Hence now, I retroactively want to fetch these missing bases for each of the samples at all the SNP loci called, for more accurate haplotype groupings.

    I only have the sorted BAM file (with index) and VCF file for each sample & reference of course. Hope someone can help.

    Thanks so much in advance

  • #2
    You can force samtools mpileup to give you an entry for every single point in your reference.

    Also, you know that if you run all your .bams together in mpileup, it will take every point where one sample has a discrepancy, and tell you what all the samples look like at that point.

    Comment


    • #3
      Originally posted by swbarnes2 View Post
      You can force samtools mpileup to give you an entry for every single point in your reference.

      Also, you know that if you run all your .bams together in mpileup, it will take every point where one sample has a discrepancy, and tell you what all the samples look like at that point.
      Thanks for your reply swbarnes2! That sounds like exactly what I would need. The calls for a set of loci across all samples & I have the files setup exactly that way. But can you delve into a bit more detail? How can I accomplish that?

      I've been looking up a lot all morning into samtools mpileup, but still couldn't coax it for my liking, although I surely know there is a solution out there! Can anyone help?

      So, I've used the -r option, and can extract a specific location precisely, although it points to REFERENCE BASE, while I need the consensus from the SAMPLE READS...

      samtools mpileup -I -Bf ref.fa -r chr1:45600-45600 -Q 30 -d 100 in.bam
      [mpileup] 1 samples in 1 input files
      <mpileup> Set max per-sample depth to 8000
      chr1 45600 T 19 CcCCccCccCC$cCcccccc IIIIIIIIIIIIIIIIIII
      Although I'm not sure if the filters work either. Thanks so much in advance guys.

      And yes, I have about 100 BAM files & I'll looking up about a million loci
      Last edited by gr8guns; 03-20-2014, 11:42 AM.

      Comment


      • #4
        Here is a previous thread on the actual commands: http://seqanswers.com/forums/showthread.php?t=28281

        Original reference for the mpileup: http://samtools.sourceforge.net/mpileup.shtml

        Comment


        • #5
          Hi folks,

          I really appreciate you bothering to reply & pointing to the commands, but my situation is a bit different & they don't work. Let me explain.

          I need to retrieve just ONE base from the sample BAM (variant or not!) at a particular location. Not a long consensus sequence (using vcfutils.pl vcf2fq), nor the SNPs only (using vcfutils.pl varFilter -D100 > var.flt.vcf)

          Extracting the consensus seq & then taking out just those loci could be an option, but would be time-consuming I fear.

          For example, I have a list of loci in a file... "SNP_loci.txt"

          Chr1 100017386
          Chr1 100019654
          Chr1 100019657
          Chr1 100019756
          Chr1 100020740
          Chr1 100022994
          Chr1 100023027
          Chr1 100030383
          Chr1 100032933
          Chr1 100033225
          ...

          And I have a bunch of samples-specific BAM files (along with their indexed files)...
          s1.bam
          s2.bam
          s3.bam
          s4.bam
          s5.bam
          s6.bam
          ...

          And a reference file
          ref.fa

          So, I need to extract the calls at those locations from each of the bam files.

          Alternately, using something like this..
          samtools mpileup -I -uf ref.fa s1.bam | bcftools view -bvcg - > s1.raw.bcf
          bcftools view s1.raw.bcf | vcfutils.pl varFilter -D100 > s1.flt.vcf
          Would give me VARIANTS only in VCF file i.e., when REF != ALT, correct?? But I basically want to know the ALT, even if its same as REF or even absent (based on quality Q>=30 & DP>=3)... hence checking back retroactively this way... for a particular set of locations.

          I'm thinking on these lines, but can't put it together...
          Are there other tools possibly too??

          samtools mpileup -I -Bf ref.fa -l SNP_loci.txt -Q 30 -d 100 -D -S s1.bam | bcftools view -cg - | vcfutils.pl
          Thanks again for your interest & time!

          Comment


          • #6
            bump... any suggestions guys?

            Comment


            • #7
              Originally posted by gr8guns View Post
              So, I need to extract the calls at those locations from each of the bam files.
              If that is all you need then why not use the view option from samtools. This post explains how to do that: http://www.biostars.org/p/47945/

              Comment


              • #8
                Originally posted by GenoMax View Post
                If that is all you need then why not use the view option from samtools. This post explains how to do that: http://www.biostars.org/p/47945/
                Again thanks for your response genoMax! But that wouldn't work. I need the ONE consensus genotype (major allele) across all reads for that sample... something that mpileup would help with? Not just extracting the base for all reads.

                Example:
                At position Chr1:45680, what is the consensus genotype for sample-1, given its BAM file?
                It could be same as reference, a variant, a het or no consensus at all.

                Comment


                • #9
                  Hi gr8guns,

                  Did you figure out how to use samtools to properly address your question? If so, it would be great if you could post your solution.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 08:47 AM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  57 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X