Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Automate nucleotide-specific coverage retrieval from BAMs?

    I've read previous threads on computing coverage from BAMs, and I can't seem to find an answer to my specific quandary. I am aware that coverage can be quite easily computed from a BAM file over intervals/genes/exons/etc. provided in a BED or GFF file using Samtools Pileup, Depth, or Bedtools Coverage, or GTK, but I would also like to be able to pull out average nucleotide-specific coverages within each of the intervals specified.

    Specifically, I have a bunch of aligned, indexed BAM files from whole-exome sequencing data, and I have a BED file containing the locations of coding regions of all RefSeq genes. For each BAM, I want to be able to pull out average coverage per coding region of each specific nucleotide, that is, the average coverage of A's, T's, G's and C's. Furthermore, I would like to be able query the coverage of nucleotides within specific contexts for each coding region, such as, for example, what is the average coverage of all G's in CG contexts? Similarly, I would also like to get nucleotide specific lengths for each region in the BED file.

    Does anyone know if this is doable using the basic tools provided in Samtools or Bedtools? If not, how difficult would it be to pipe some output (like a pileup) into a custom python script to get this done without making my vastly under-powered machine explode?

    Thanks in advance for your help!

  • #2
    CoverageBed in BEDtools will do most of what you want. You may want a bit of scripting to massage its output to exactly what you want.

    Comment


    • #3
      The reference base and read depth at each position could be obtained in an easily-parseable VCF format with
      samtools mpileup -uf reference.fasta alignment.bam | bcftools view -cg -

      I use the BEDOPS tools for other interval slicing/dicing such as you describe, they generally use minimal memory by using only sorted inputs, and the new "--max-mem" option in sort-bed may help make anything doable on your machine.

      bedtools has the advantage of working with that VCF output directly though.

      Comment


      • #4
        Thanks for the responses!

        I'm still a little concerned about the amount of memory I will need to obtain read depth at each position. I'm not trying to get coverage of just a list of SNPs or sequences, but literally every base in every exon in the genome. I was hoping there might be some way to obtain averages of base-specific coverage across intervals without having to output the depth at every base. I think (though I could be wrong) Bedtools genomeCoverageBed -d option can output the number of reads covering each and every base in each chromosome . . . for the entire genome this would equate to 3 billion rows of text. Is it possible to narrow the genomeCoverageBed command to cover only coding regions? Or is there a similar function (like CoverageBed) that can take a BED file of exon locations (as opposed to just chromosomes) and compute their per base coverage? In this case, maybe the number of rows would be narrowed down to 0.75 billion or so . . .

        Am I approaching this problem correctly, or is there a more efficient solution?

        Ultimately, the results I want would be something like this:

        Gene_____Avg.Coverage.A_____Avg.Coverage.T_____Avg.Coverage.G.in.CG
        a___________51________________73________________67
        b___________39________________89________________100
        Last edited by Alphred; 02-28-2013, 06:32 PM.

        Comment


        • #5
          The number of rows will not be a memory problem if you only process one line at a time, and the final program in the piped command-line (e.g. your python script or maybe "groupby") just saves the counts per gene.

          That bcftools command gives every base, not just variants. The only reason I suggested it instead of "bedtools coverage" is that it also gives you the reference base, which is not in the BAM file.

          Comment


          • #6
            Originally posted by EricHaugen View Post
            bedtools has the advantage of working with that VCF output directly though.
            The vcf2bed script that is part of BEDOPS can be used to pipe BED data as standard input into other BEDOPS tools, like sort-bed and bedops, e.g.:

            $ vcf2bed < my.vcf \
            | sort-bed --max-mem 2G - \
            | bedops --element-of -1 - foo.bed \
            > answer.bed


            The hyphen character denotes the use of standard input, in place of a regular file. (Of course, standard output from vcf2bed can be piped in bedtools utilities or any other app that takes in standard input.)

            Another option is to use named pipes, if there is more than one input, two or more of which are of non-BED formats, e.g.:

            $ mkfifo my_vcf_pipe
            $ vcf2bed < my.vcf | sort-bed - > my_vcf_pipe
            $ mkfifo my_gff_pipe
            $ gff2bed < my.gff | sort-bed - > my_gff_pipe
            $ bedops --intersect my_vcf_pipe my_gff_pipe > answer.bed
            $ rm my_vcf_pipe my_gff_pipe
            Last edited by AlexReynolds; 02-28-2013, 11:16 PM.

            Comment


            • #7
              Cool, thanks guys!

              Just to make sure I understand correctly, based on the comments, I could use:

              samtools mpileup -u -l exons.bed -f ref.fasta my.bam | bcftools view | “some tool to annotate gene names to positions” | “simple script to sum rows by gene and nucleotide” > output.vcf

              Should this accomplish what I’m after? If so, what would be the simplest (no frills) method to annotate gene names to Chrom + Pos?

              Comment


              • #8
                If you have a .bed file with gene positions, you can use that to assign each line in the .bam to a gene position. You might just want to use that, and then use something like cut | sort | uniq -c to count how many reads hit every gene. With read counts + gene lengths, you can calculate average coverage.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                31 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X