Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to obtain coverage % of the genome?

    For our paired-end exome sequence data, I have aligned the reads to HG18 reference genome with BWA, and called variants with SAMtools mpileup command. Could anyone teach me how to obtain the mapping coverage %, error rate, etc.? Thanks a lot.

  • #2
    I'd use BEDTools to get coverage stats. The normal Illumina pipeline will work out error stats, so if you have Illumina data, ask about that sumamry.

    Comment


    • #3
      Dear swbarnes2, thank you very much. I will try BEDTools.

      Comment


      • #4
        You should also look into using GATK DepthOfCoverage.

        Comment


        • #5
          As with Heisman I prefer DepthOfCoverage especially for exomes because of the percent reads at X depth feature.

          For you interval list it can tell you which percentage of your bases are higher than 5x, 10x, 30x,50x, etc (configurable). This helps greatly to see if you have sufficient overall base coverage >20x to make confident calls.

          Comment


          • #6
            Thanks to Heisman and lletourn, DepthOfCoverage is great.

            Comment


            • #7
              Isn't depth of coverage just the ratio of total number of reads sequenced to the number of bases in your genome?

              Cov = Total reads sequenced/ no: of bases in genome (in fasta)?

              Or is that X fold coverage? I'm confused now!
              Last edited by cedance; 10-04-2011, 12:54 PM.

              Comment


              • #8
                Originally posted by cedance View Post
                Isn't depth of coverage just the ratio of total number of reads sequenced to the number of bases in your genome?

                Cov = Total reads sequenced/ no: of bases in genome (in fasta)?
                Yes but it's a gross approximation. You lose the reads that are trimmed/clipped. the ones that don't align, the ones not on target the bases with too low quality, the reads that have a too-low mapping quality...you get the picture.

                When a tool compute average or median of an interval_list it gives you the "true", usable coverage.

                Comment


                • #9
                  When the alignment is written in Goby format (which you can produce with this version of BWA, or with GobyWeb), you can estimate coverage statistics for exome as described in this tutorial:
                  estimating-coverage-statistics-for-targeted-resequencing-experiments

                  Comment


                  • #10
                    Originally posted by lletourn View Post
                    Yes but it's a gross approximation. You lose the reads that are trimmed/clipped. the ones that don't align, the ones not on target the bases with too low quality, the reads that have a too-low mapping quality...you get the picture.

                    When a tool compute average or median of an interval_list it gives you the "true", usable coverage.
                    hi lletourn, Thanks for your reply. got the picture. (sorry for too many questions, I have just thought about looking at the quality of my "mapped" data. I had overlooked it until now.)

                    I tried DepthOfCoverage on my data but I got an error related to "RG" (read group). I work on alternative splicing and I merged the data set (by modifying the header so that I can get back the files). I guess I should just add a tag with RG as optional field. Do you know about this?

                    Also, could you explain a bit more detailed as to what sort of coverage can we get and the command you normally use (on 1 or multiple bam files)? I would assume: 1) % of bases covered with 10x, 20x... etc... (helps to visualize the quality of your data I guess?) 2) coverage of gene / desired location?
                    If 2) is possible, then I would also like to know how I can get coverage for a desired location.

                    And how different is this from mpileup?

                    Comment


                    • #11
                      Yes, GATK is pretty Finicky about the RG tag and karyotypic order of chromosomes (GATK doesn't like chr1, chr10, chr11, it likes chr1, chr2, chr3, etc).

                      You could use Picards AddOrReplaceReadGroup to add your missing RG tag.


                      I run DepthOfCoverage twice:
                      Once with CCDS to check exon coverage
                      A second time for the genome coverage

                      For Exomes, I just do it the one time.

                      Commands are like this
                      Code:
                      #Exome
                      java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R human_hg19.fasta -I my.sorted.dup.bam --omitDepthOutputAtEachBase --logging_level ERROR -geneList refGene.sorted.txt --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80
                      --summaryCoverageThreshold 90 --summaryCoverageThreshold 100
                      --summaryCoverageThreshold 150 --minBaseQuality 20 --minMappingQuality 30 --start 1 --stop 500 --nBins 499 -dt NONE -L CCDS.ccdsGenes.bed -o my.sorted.dup.CCDS.coverage
                      
                      #Genome
                      java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R human_hg19.fasta -I my.sorted.dup.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 20 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o my.sorted.dup.coverage
                      The start, stop and nBins are just there as examples if your coverage is > 500x
                      Last edited by lletourn; 10-05-2011, 06:25 PM.

                      Comment


                      • #12
                        Hi lletourn,
                        I ran for genome coverage. Looking at the output I realized it doesn't make much sense because I am working on transcriptome data from RNA-Seq. I should just look for exons with the -L option and exon coordinates as .bed file, I suppose. If I am doing something wrong, please let me know. I'll get back to you once I get some results with that. I am also crosschecking with samtools mpileup.

                        Thanks again.

                        Comment


                        • #13
                          You're right, look only at the exomes+UTR.

                          Or give it any interval and also give it a refGene formatted file (the geneList option only work if you also pass an interval. I think that's a bug).

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Advancing Precision Medicine for Rare Diseases in Children
                            by seqadmin




                            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                            12-16-2024, 07:57 AM
                          • seqadmin
                            Recent Advances in Sequencing Technologies
                            by seqadmin



                            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                            Long-Read Sequencing
                            Long-read sequencing has seen remarkable advancements,...
                            12-02-2024, 01:49 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 12-17-2024, 10:28 AM
                          0 responses
                          33 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-13-2024, 08:24 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-12-2024, 07:41 AM
                          0 responses
                          34 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-11-2024, 07:45 AM
                          0 responses
                          46 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X