Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percentage of exons covered

    Hi all,

    I have a bed file with the start and end positions of all the exons. How can we find out, what percentage of these exons are covered in the sequenced sample(bam file)?

    I have tried using,

    samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

    and a few lines from the output looks like,

    chr1 60817388 60817507 0 0 119 0.0000000
    chr1 91226064 91226183 0 0 119 0.0000000
    chr1 101711776 101711895 0 0 119 0.0000000
    chr1 108003221 108003340 0 0 119 0.0000000
    ....

    Also tried,
    samtools view -b | coverageBed -abam stdin -b exons.bed -hist

    Few lines of output:

    chr1 60817388 60817507 32 1 119 0.0084034

    chr1 60817388 60817507 33 2 119 0.0168067

    chr1 60817388 60817507 34 3 119 0.0252101

    Which output should be used to get the percentage of exons covered?

    Or Any other method to do so, would be highly appreciated.

    Thanks
    Last edited by meher; 06-25-2013, 08:27 AM.

  • #2
    The output one (w/o -hist) only gives you the fraction of exons bases has non-zero coverage by your reads. There is no information of the depth of the coverage.

    The output two gives you the fraction covered of each depth of ceverage.

    At the end of the output two, you can see a histogram summarizing the coverage among all exons.

    Comment


    • #3
      Originally posted by fengqi View Post
      The output one (w/o -hist) only gives you the fraction of exons bases has non-zero coverage by your reads. There is no information of the depth of the coverage.

      The output two gives you the fraction covered of each depth of ceverage.

      At the end of the output two, you can see a histogram summarizing the coverage among all exons.
      Can we get a single value like x% of exons are covered from the output?

      Comment


      • #4
        Originally posted by meher View Post
        Can we get a single value like x% of exons are covered from the output?
        default output will report 4 columns for each interaval in B (that is your exon info file)

        The first column is the nubmer of reads in your bam file that overlappled the exon interval.

        You can easily pick up those exons with zero value. Then you will know how many exons covered, then you can get x%.

        Hope I understood correctly what you want.

        Comment


        • #5
          Originally posted by fengqi View Post
          default output will report 4 columns for each interaval in B (that is your exon info file)

          The first column is the nubmer of reads in your bam file that overlappled the exon interval.

          You can easily pick up those exons with zero value. Then you will know how many exons covered, then you can get x%.

          Hope I understood correctly what you want.
          Sorry that i couldn't understand, what do you mean by default output? Is it from

          samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

          or

          samtools view -b | coverageBed -abam stdin -b exons.bed -hist

          Either of these outputs have more than 4 columns. So, which one are you referring?

          Comment


          • #6
            Originally posted by meher View Post
            Sorry that i couldn't understand, what do you mean by default output? Is it from

            samtools view -b <BAM> | coverageBed -abam stdin -b exons.bed

            or

            samtools view -b | coverageBed -abam stdin -b exons.bed -hist

            Either of these outputs have more than 4 columns. So, which one are you referring?
            The first one, w/o -hist

            Comment


            • #7
              Originally posted by fengqi View Post
              The first one, w/o -hist
              Okay. But there are seven columns in the output. First three gives the chromosome number, start and end position. The next two columns are zero's and sixth is the length of the exon and the last is again zero.

              So, how can we get the x% out of this?

              Comment


              • #8
                it sounds like you first want to make the coverage of each exon into a binary call (ie either covered or not covered). Then you can just count the number that are covered verses the total.

                The last column of the coverageBed output is the percent coverage of the exon feature in that row. Let's say you want to call 95% coverage "covered" and anything below that as "not covered". Also consider that if you're working with a typical alternatively spliced species your list of exons may have redundant entries for exons that are shared between isoforms. So to put it together this might work:

                first count all of the unique lines in the exons file:
                Code:
                sort -k1,1 -k2,2m -k3,3n exons.bed | uniq > exons_unique.bed
                wc -l exons_unique.bed
                Now you have a total count of exons. Now run the coverageBed with the unique list and then count the number of rows that have a value in the last column greater than your "covered" threshold:

                Code:
                coverageBed -abam <bam_file> -b exons_unique.bed > exons_coverage.bed
                awk '{if($7 > 0.95) print $0}' exons_coverage.bed | wc -l
                that should do it. realize, of course, that if you are working with an alternatively spliced transcriptome this result is a little weird because instead of knowing which isoforms the coverage actually belongs to we're assigning to any that the coverage intersects with.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Originally posted by sdriscoll View Post
                  it sounds like you first want to make the coverage of each exon into a binary call (ie either covered or not covered). Then you can just count the number that are covered verses the total.

                  The last column of the coverageBed output is the percent coverage of the exon feature in that row. Let's say you want to call 95% coverage "covered" and anything below that as "not covered". Also consider that if you're working with a typical alternatively spliced species your list of exons may have redundant entries for exons that are shared between isoforms. So to put it together this might work:

                  first count all of the unique lines in the exons file:
                  Code:
                  sort -k1,1 -k2,2m -k3,3n exons.bed | uniq > exons_unique.bed
                  wc -l exons_unique.bed
                  Now you have a total count of exons. Now run the coverageBed with the unique list and then count the number of rows that have a value in the last column greater than your "covered" threshold:

                  Code:
                  coverageBed -abam <bam_file> -b exons_unique.bed > exons_coverage.bed
                  awk '{if($7 > 0.95) print $0}' exons_coverage.bed | wc -l
                  that should do it. realize, of course, that if you are working with an alternatively spliced transcriptome this result is a little weird because instead of knowing which isoforms the coverage actually belongs to we're assigning to any that the coverage intersects with.
                  Thanks it worked. But could it be possible to find, how much percentage of the exons are covered by atleast 1X/5X/10X reads out of the coverageBed output?

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Advancing Precision Medicine for Rare Diseases in Children
                    by seqadmin




                    Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                    12-16-2024, 07:57 AM
                  • seqadmin
                    Recent Advances in Sequencing Technologies
                    by seqadmin



                    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                    Long-Read Sequencing
                    Long-read sequencing has seen remarkable advancements,...
                    12-02-2024, 01:49 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 12-17-2024, 10:28 AM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-13-2024, 08:24 AM
                  0 responses
                  43 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-12-2024, 07:41 AM
                  0 responses
                  29 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-11-2024, 07:45 AM
                  0 responses
                  42 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X