Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computing Enrichment and RPKM

    I'm conducting analysis of RNA HiSeq data, and we are trying to compute enrichment for a given window of reads in the IP over reads in our control. This window could be an entire gene, or a very small 25 bp segment within an exon. Working with some collaborators, we've been in discussion about specifically how to compute enrichment and whether or not that includes RPKM. I've now thoroughly confused myself and I was wondering if anyone had insight into better ways of computing this.

    My initial method of computing enrichment was the ratio of reads in the IP to the reads in the control, normalized by total number of reads sequenced in each:
    Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL),
    where w represents the number of reads that mapped to that given window and Σ represents the total number of reads that were mapped to the genome (as a normalization factor).

    However, our collaborators insisted that we incorporate RPKM as a normalization factor (that is divide), to account for differing gene lengths, so our final equation then became:
    Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL) / (10^9 * #CNTLg / Σ CNTL / length),
    where here #CNTLg is the number of reads that map to the gene exons (so excluding introns) and length refers to the length of the mature transcript (CDS + UTRs, no introns).

    However, our results are very strange, since low RPKM values (< 1) result in a very high enrichment score, and this doesn't make sense for computing enrichment. Furthermore, through answers on this forum, it sounds like RPKM is used more for differential expression between two samples, e.g., two biological replicates, and not necessarily to be used for computing the enrichment of our IP over the control. We're not trying to find DE genes here, but trying to determine an enrichment of our IP over our control for any given window.

    Discussing this with my PI, we thought perhaps excluding RPKM but normalizing solely over the transcript length might be better. One odd result of dividing the enrichment by RPKM is that you're essentially multiplying by the transcript length, which is opposite of what I'd think we're trying to achieve.

    Another possibility I thought is to perhaps compute the RPKM for the control, and then compute the RPKM as such for the IP, and take the ratio of that. This at least seems consistent with what RPKM seems to have been designed for, if I'm understanding RPKM correctly, but I'm still not sure if that makes any more sense or is better than the other approaches.

    Thank you very much and I greatly appreciate your help if anyone has any ideas!

  • #2
    The division by length is plain wrong. For an enrichment score, you want to divide some measure of signal strength in IP with a measure in CNTL. If your colleagues insist that these measures should be normalized for length, they can do so. However, as both measures are divided by the same length, it cancels out. Incidentally, this is why RPKM is not so useful for differentially expression, either. Dividing by length just obscures how much evidence you have: A ratio of 5 to 2 reads has the same ratio as 500 to 200 reads, but in the latter case you can be more sure that this is a real enrichment and not just chance. This is why the raw number of reads (without normalization) is useful and also why looking at the ratio only is not sufficient.

    BTW, are you talking about CLIP, or how come you have IP and control?

    Comment


    • #3
      Yes, this is for a form of IP. So I'm trying to gauge the enrichment of the IP over the control in a given window. I've heard that RPKK is apparently not a good measure anymore, and that length normalization actually increases variance, so I agree with your point there.

      So we've opted to just use a read count ratio, normalized by total number of reads mapped in IP/control, respectively. Using Fisher's exact test produces too many p-value counts of 0s, because the enrichment is too high to be quantified with the test.

      Thanks!

      Comment


      • #4
        Do you have replicates or any other means to assess sample-to-sample variability? Then, you could use DESeq. (The real reason why Fisher's test does not work is that it implicitly assumes biological and extra-Poisson technical variation to be zero.)

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X