Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome coverage

    How do you calculate the genome coverage for a sequenced genome?
    I have the BAM file for my genome, and I have calculated the read depth at each position using samtools depth command.
    Is there a way I can calculate genome coverage using this?
    Last edited by bioinf newbie; 02-15-2012, 01:01 PM.

  • #2
    If I'm not mistaken, the output of 'samtools depth' provides the index of the reference (second column) and the depth of that index (third column). So every reference index that has a depth greater than 0 reads (or whatever your cutoff value is) you can say that the base is covered. A simple perl script could do the trick or you could pipe it through the command line.

    Comment


    • #3
      calculate coverage from bam file

      samtools depth whole.bam | awk '{sum+=$3;cnt++}END{print sum/cnt" "sum}'

      It outputs 2 numbers: average coverage for a coveraged base (sum/cnt) AND total base coverage (sum).

      Note, some bases may not be covered.

      Divide (total base coverage) by (genome size).

      example : /samtools depth TCGA-AG-4008-01A-01D-1115-02_IlluminaHiSeq-DNASeq_whole.bam | awk '{sum+=$3;cnt++}END{print sum/cnt" "sum}
      3.86809 10573174269

      10573174269/3100000000 = 3.41070137709677419354 (note 3.1 billion is aproxmate size of human genome)

      so 3.4 coverage
      ____________
      Sanity check ....
      A way to check this, and an faster way to calculate it is ... run samtools flagstat and get the mapped reads. Multiply this by read size then divide by genome size.

      Example:
      Get number of mapped reads ...
      samtools flagstat TCGA-AG-4008-01A-01D-1115-02_IlluminaHiSeq-DNASeq_whole.bam | grep mapped
      213309151 + 0 mapped (87.77%:nan%)
      208629507 + 0 with itself and mate mapped
      3785276 + 0 with mate mapped to a different chr

      213309151 is what we're after = number of mapped reads

      Find out read length
      samtools view TCGA-AG-4008-01A-01D-1115-02_IlluminaHiSeq-DNASeq_whole.bam | awk '{print length($10)}' | head -1
      50
      (read lentgth is 50, beware there can be mixed read lengths)

      So ...
      (213309151*50)/3100000000 = 3.44047017741935483870

      Again, 3.1 billion is the genome size of the fearsome creature "homo sapiens".

      check. close enough. we good.

      Comment


      • #4
        Thank you very much Richard and @twaddlac.

        Comment


        • #5
          Hi,

          I have a bedgraph file, which goes like this

          chr start end coverage

          1 213 214 890
          2 214 215 900
          2 340 345 900
          4 34090 34809 34

          Now I would like to calculate the average coverage, is there any better way to do it using awk or any other tools?

          Comment


          • #6
            Hey:
            I have a question. All these applies only for genome assemblies?
            Can you calculate the coverage for an "assembled" Transcriptome with the same command line that Richard Finney posted here?

            I'm trying to assess transcriptome assemblies, looking for the best algorithm, at least, for my data. I used Trinity and MIRA. MIRA gives an output with a lot of stats, but Trinity doesn´t. So, in order to compare them I need to retrieve stats from Trinity outputs. Someone knows how to do that?

            Also, I don't understand why Samtools needs a mapping information. Can you assess Quality or stats of the indexed reference with samtools? I mean, with the *.fasta.fai file.


            Thanks!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            66 views
            0 likes
            Last Post seqadmin  
            Working...
            X