Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to obtain coverage % of the genome?

    For our paired-end exome sequence data, I have aligned the reads to HG18 reference genome with BWA, and called variants with SAMtools mpileup command. Could anyone teach me how to obtain the mapping coverage %, error rate, etc.? Thanks a lot.

  • #2
    I'd use BEDTools to get coverage stats. The normal Illumina pipeline will work out error stats, so if you have Illumina data, ask about that sumamry.

    Comment


    • #3
      Dear swbarnes2, thank you very much. I will try BEDTools.

      Comment


      • #4
        You should also look into using GATK DepthOfCoverage.

        Comment


        • #5
          As with Heisman I prefer DepthOfCoverage especially for exomes because of the percent reads at X depth feature.

          For you interval list it can tell you which percentage of your bases are higher than 5x, 10x, 30x,50x, etc (configurable). This helps greatly to see if you have sufficient overall base coverage >20x to make confident calls.

          Comment


          • #6
            Thanks to Heisman and lletourn, DepthOfCoverage is great.

            Comment


            • #7
              Isn't depth of coverage just the ratio of total number of reads sequenced to the number of bases in your genome?

              Cov = Total reads sequenced/ no: of bases in genome (in fasta)?

              Or is that X fold coverage? I'm confused now!
              Last edited by cedance; 10-04-2011, 12:54 PM.

              Comment


              • #8
                Originally posted by cedance View Post
                Isn't depth of coverage just the ratio of total number of reads sequenced to the number of bases in your genome?

                Cov = Total reads sequenced/ no: of bases in genome (in fasta)?
                Yes but it's a gross approximation. You lose the reads that are trimmed/clipped. the ones that don't align, the ones not on target the bases with too low quality, the reads that have a too-low mapping quality...you get the picture.

                When a tool compute average or median of an interval_list it gives you the "true", usable coverage.

                Comment


                • #9
                  When the alignment is written in Goby format (which you can produce with this version of BWA, or with GobyWeb), you can estimate coverage statistics for exome as described in this tutorial:
                  estimating-coverage-statistics-for-targeted-resequencing-experiments

                  Comment


                  • #10
                    Originally posted by lletourn View Post
                    Yes but it's a gross approximation. You lose the reads that are trimmed/clipped. the ones that don't align, the ones not on target the bases with too low quality, the reads that have a too-low mapping quality...you get the picture.

                    When a tool compute average or median of an interval_list it gives you the "true", usable coverage.
                    hi lletourn, Thanks for your reply. got the picture. (sorry for too many questions, I have just thought about looking at the quality of my "mapped" data. I had overlooked it until now.)

                    I tried DepthOfCoverage on my data but I got an error related to "RG" (read group). I work on alternative splicing and I merged the data set (by modifying the header so that I can get back the files). I guess I should just add a tag with RG as optional field. Do you know about this?

                    Also, could you explain a bit more detailed as to what sort of coverage can we get and the command you normally use (on 1 or multiple bam files)? I would assume: 1) % of bases covered with 10x, 20x... etc... (helps to visualize the quality of your data I guess?) 2) coverage of gene / desired location?
                    If 2) is possible, then I would also like to know how I can get coverage for a desired location.

                    And how different is this from mpileup?

                    Comment


                    • #11
                      Yes, GATK is pretty Finicky about the RG tag and karyotypic order of chromosomes (GATK doesn't like chr1, chr10, chr11, it likes chr1, chr2, chr3, etc).

                      You could use Picards AddOrReplaceReadGroup to add your missing RG tag.


                      I run DepthOfCoverage twice:
                      Once with CCDS to check exon coverage
                      A second time for the genome coverage

                      For Exomes, I just do it the one time.

                      Commands are like this
                      Code:
                      #Exome
                      java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R human_hg19.fasta -I my.sorted.dup.bam --omitDepthOutputAtEachBase --logging_level ERROR -geneList refGene.sorted.txt --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80
                      --summaryCoverageThreshold 90 --summaryCoverageThreshold 100
                      --summaryCoverageThreshold 150 --minBaseQuality 20 --minMappingQuality 30 --start 1 --stop 500 --nBins 499 -dt NONE -L CCDS.ccdsGenes.bed -o my.sorted.dup.CCDS.coverage
                      
                      #Genome
                      java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R human_hg19.fasta -I my.sorted.dup.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 20 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o my.sorted.dup.coverage
                      The start, stop and nBins are just there as examples if your coverage is > 500x
                      Last edited by lletourn; 10-05-2011, 06:25 PM.

                      Comment


                      • #12
                        Hi lletourn,
                        I ran for genome coverage. Looking at the output I realized it doesn't make much sense because I am working on transcriptome data from RNA-Seq. I should just look for exons with the -L option and exon coordinates as .bed file, I suppose. If I am doing something wrong, please let me know. I'll get back to you once I get some results with that. I am also crosschecking with samtools mpileup.

                        Thanks again.

                        Comment


                        • #13
                          You're right, look only at the exomes+UTR.

                          Or give it any interval and also give it a refGene formatted file (the geneList option only work if you also pass an interval. I think that's a bug).

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          9 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X