Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percentage of exons covered

    Hi all,

    I have a bed file with the start and end positions of all the exons. How can we find out, what percentage of these exons are covered in the sequenced sample(bam file)?

    I have tried using,

    samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

    and a few lines from the output looks like,

    chr1 60817388 60817507 0 0 119 0.0000000
    chr1 91226064 91226183 0 0 119 0.0000000
    chr1 101711776 101711895 0 0 119 0.0000000
    chr1 108003221 108003340 0 0 119 0.0000000
    ....

    Also tried,
    samtools view -b | coverageBed -abam stdin -b exons.bed -hist

    Few lines of output:

    chr1 60817388 60817507 32 1 119 0.0084034

    chr1 60817388 60817507 33 2 119 0.0168067

    chr1 60817388 60817507 34 3 119 0.0252101

    Which output should be used to get the percentage of exons covered?

    Or Any other method to do so, would be highly appreciated.

    Thanks
    Last edited by meher; 06-25-2013, 08:27 AM.

  • #2
    The output one (w/o -hist) only gives you the fraction of exons bases has non-zero coverage by your reads. There is no information of the depth of the coverage.

    The output two gives you the fraction covered of each depth of ceverage.

    At the end of the output two, you can see a histogram summarizing the coverage among all exons.

    Comment


    • #3
      Originally posted by fengqi View Post
      The output one (w/o -hist) only gives you the fraction of exons bases has non-zero coverage by your reads. There is no information of the depth of the coverage.

      The output two gives you the fraction covered of each depth of ceverage.

      At the end of the output two, you can see a histogram summarizing the coverage among all exons.
      Can we get a single value like x% of exons are covered from the output?

      Comment


      • #4
        Originally posted by meher View Post
        Can we get a single value like x% of exons are covered from the output?
        default output will report 4 columns for each interaval in B (that is your exon info file)

        The first column is the nubmer of reads in your bam file that overlappled the exon interval.

        You can easily pick up those exons with zero value. Then you will know how many exons covered, then you can get x%.

        Hope I understood correctly what you want.

        Comment


        • #5
          Originally posted by fengqi View Post
          default output will report 4 columns for each interaval in B (that is your exon info file)

          The first column is the nubmer of reads in your bam file that overlappled the exon interval.

          You can easily pick up those exons with zero value. Then you will know how many exons covered, then you can get x%.

          Hope I understood correctly what you want.
          Sorry that i couldn't understand, what do you mean by default output? Is it from

          samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

          or

          samtools view -b | coverageBed -abam stdin -b exons.bed -hist

          Either of these outputs have more than 4 columns. So, which one are you referring?

          Comment


          • #6
            Originally posted by meher View Post
            Sorry that i couldn't understand, what do you mean by default output? Is it from

            samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

            or

            samtools view -b | coverageBed -abam stdin -b exons.bed -hist

            Either of these outputs have more than 4 columns. So, which one are you referring?
            The first one, w/o -hist

            Comment


            • #7
              Originally posted by fengqi View Post
              The first one, w/o -hist
              Okay. But there are seven columns in the output. First three gives the chromosome number, start and end position. The next two columns are zero's and sixth is the length of the exon and the last is again zero.

              So, how can we get the x% out of this?

              Comment


              • #8
                it sounds like you first want to make the coverage of each exon into a binary call (ie either covered or not covered). Then you can just count the number that are covered verses the total.

                The last column of the coverageBed output is the percent coverage of the exon feature in that row. Let's say you want to call 95% coverage "covered" and anything below that as "not covered". Also consider that if you're working with a typical alternatively spliced species your list of exons may have redundant entries for exons that are shared between isoforms. So to put it together this might work:

                first count all of the unique lines in the exons file:
                Code:
                sort -k1,1 -k2,2m -k3,3n exons.bed | uniq > exons_unique.bed
                wc -l exons_unique.bed
                Now you have a total count of exons. Now run the coverageBed with the unique list and then count the number of rows that have a value in the last column greater than your "covered" threshold:

                Code:
                coverageBed -abam <bam_file> -b exons_unique.bed > exons_coverage.bed
                awk '{if($7 > 0.95) print $0}' exons_coverage.bed | wc -l
                that should do it. realize, of course, that if you are working with an alternatively spliced transcriptome this result is a little weird because instead of knowing which isoforms the coverage actually belongs to we're assigning to any that the coverage intersects with.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Originally posted by sdriscoll View Post
                  it sounds like you first want to make the coverage of each exon into a binary call (ie either covered or not covered). Then you can just count the number that are covered verses the total.

                  The last column of the coverageBed output is the percent coverage of the exon feature in that row. Let's say you want to call 95% coverage "covered" and anything below that as "not covered". Also consider that if you're working with a typical alternatively spliced species your list of exons may have redundant entries for exons that are shared between isoforms. So to put it together this might work:

                  first count all of the unique lines in the exons file:
                  Code:
                  sort -k1,1 -k2,2m -k3,3n exons.bed | uniq > exons_unique.bed
                  wc -l exons_unique.bed
                  Now you have a total count of exons. Now run the coverageBed with the unique list and then count the number of rows that have a value in the last column greater than your "covered" threshold:

                  Code:
                  coverageBed -abam <bam_file> -b exons_unique.bed > exons_coverage.bed
                  awk '{if($7 > 0.95) print $0}' exons_coverage.bed | wc -l
                  that should do it. realize, of course, that if you are working with an alternatively spliced transcriptome this result is a little weird because instead of knowing which isoforms the coverage actually belongs to we're assigning to any that the coverage intersects with.
                  Thanks it worked. But could it be possible to find, how much percentage of the exons are covered by atleast 1X/5X/10X reads out of the coverageBed output?

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  29 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X