Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract base coverage from bam/bai-files

    Hi Everybody,

    I'm still quite new to the world of next-generation sequencing. I started to work with the Ion Torrent PGM several months ago.
    I've got a question related to the analysis of the coverage depth of different nucleotides/dels/insertions on the base positions of my reads.

    I normally use the bam and corresponding bai-files generated by the Torrent Suite Software and load them into the IGV-Genome Browser.
    There, one can get information about single base position coverage by mouseover (see screenshot).

    One can see the total base count, count of A/T/G/Cs (+ and - reads), and count of dels and insertions.
    I would like to have this data in a table, to use it in an Excel spreadsheet.

    Can anybody tell me, how I can extract this data from my bam/bai-files for a defined region of interest? I'd appreciate any hints.
    The most preferred output would be a data table like this:
    e.g.
    chrom position total count count of:A+ A- T+ T- C+ C- G+ G- del+ del- ins+ ins-
    chr13 33211005 50 20 21 2 3 0 0 0 0 2 0 1 1

    I've searched the forums for related threads, but I only found some bioinformatic information that I couldn't understand. I'm just a biochemist, and I have no big skills in bioinformatics. E.g., I don't know about Linux OS and how to use software or commands there.
    Any help is much appreciated.

    Thank you very much in advance.
    Greetings
    Coru

  • #2
    How big is your region of interest, and do you want one row for every position? If you do it could be a huge spreadsheet that Excel might struggle with.

    A VCF file, (for example as generated by samtools), has similar information for sites with SNPs & indels, which you can then filter to help distinguish true variants from sequencing errors.

    Comment


    • #3
      I would use bedtools (which can use BAM files as input). Use genomecov to get the coverage all across the genome:



      Or just coverage if you have a specific gff/bed file with the region that you want to determine the coverage for:



      edit: sorry, that's just for the base-pair coverage (independent of the base). I suspect that per-nucleotide counts in a particular region is straying into "needs a custom script to work it out from the samtools mpileup output" territory.
      Last edited by gringer; 10-29-2013, 12:53 PM.

      Comment


      • #4
        Hello all,
        I have a similar related question to extract chromosome positions and depth per region. So I used bedtools to get bam to bed conversion:
        I got :
        Chr1 11419 11425 HWUSI-# 1 + 1
        Chr1 13877 13892 HWUSI-# 0 - 2
        Chr1 14714 14721 HWUSI-# 1 + 3
        Chr1 16155 16163 HWUSI-# 1 - 4
        Chr1 18239 18249 HWUSI-# 1 + 5
        Now I looked up the manual for bedtools and it indicates that col5 as scores 1 to 1000 . I tried looking more into it, but cannot understand what those 1 and 0s stand for specifically.I know they are scores and that's it. I also have downstream in this column 32, 42 etc and not jut 1 or 0.I have cut and pasted just the top 5 outputs.
        Also the last column looks like it is assigned a number for each cluster. Is this correct?
        Please can some one clarify this for me?
        Also if I need to get just the first 4 columns as is and also get the seq itself in the next column and counts of tags in the next columns, can I use bedtools to do it? If so , how? or do I need to write a script so that I have:
        Chr# start end name seq # of tags pos/neg strand cluster#. from my bam files?
        Please any help will do
        Thanks in advance
        geneart.

        Comment


        • #5
          Hi All,

          Thank you very much for your comments and hints.
          I found a computer scientist who wrote a small program that can do the job.

          Greetings
          Rayk

          Comment


          • #6
            Corus S
            Would you be able to post the code as I also would be interested in doing this search??

            Cheers

            Gray

            Comment


            • #7
              Well, I don't have the source code, just the final files of the program.

              But I can ask the guy who wrote it, if he is willing to share the code.

              I'll let you know...

              Comment


              • #8
                Hi Gray,

                I contacted him, but unfortunately he doesn't want to reveal the code or the final files in the public at the moment, since the program is still beta status.

                When he has extended the program he will make it publicly available in the final version.

                Greetings

                Comment


                • #9
                  I contacted him, but unfortunately he doesn't want to reveal the code or the final files in the public at the moment, since the program is still beta status.
                  Okay, well I've realised you can do this just using samtools and common Linux command-line programs. Here are base counts:
                  Code:
                  #forward
                  samtools view -F 0x10 sampled_lib19_extended.bam YAL005C_mRNA-51+51bp:500-3000 | awk -F '\t' '{print $10}' | fold -w 1 | sort | uniq -c
                  #reverse
                  samtools view -f 0x10 sampled_lib19_extended.bam YAL005C_mRNA-51+51bp:500-3000 | awk -F '\t' '{print $10}' | fold -w 1 | sort | uniq -c
                  #generic
                  samtools view [filter] input.bam sequence:bpRange | awk -F '\t' '{print $10}' | fold -w 1 | sort | uniq -c
                  [These one-liners can be easily modified to put sequence names at the start of the lines if necessary]

                  For a quick INDEL count, you can use the CIGAR column (#6):

                  Code:
                  #generic (use -f 0x10 or -F 0x10 for forward/reverse filter as necessary)
                  samtools view [filter] input.bam sequence:bpRange | awk -F '\t' '{print $6}' | fold -w 1 | grep '[ID]' | sort | uniq -c

                  Comment


                  • #10
                    Igvtools (which is a command line tools associated with IGV) count can do this task easily. It produces a text file in wig format. However it may not easy to open it in Excel as desired.

                    Comment


                    • #11
                      Originally posted by yl01 View Post
                      Igvtools (which is a command line tools associated with IGV) count can do this task easily. It produces a text file in wig format. However it may not easy to open it in Excel as desired.
                      Could you possibly share the igvtools command line to do this? I have been trying to figure out how to get the base coverage values at each position from my BAM files. I'd want to know which base is at a position and how many reads produce that base so I can create some custom scripts using that information.
                      I tried:

                      igv count --bases input_sorted.bam output_basecounts.wig

                      But my .wig files are empty. Should I be saving the output differently??
                      Last edited by Katherine_B; 01-03-2019, 04:53 PM. Reason: Forgot to include the command I used

                      Comment


                      • #12
                        Use BEDtools 'genomecov' command with '-d' flag to obtain per-base depth of coverage.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X