SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
The Future of Cloud Computing jenniferwatson Events / Conferences 1 10-11-2012 12:23 AM
computing FOLD CHANGES bogdan RNA Sequencing 1 01-31-2011 02:50 PM
transcript length bias in enrichment analysis and RPKM PFS RNA Sequencing 1 12-12-2010 06:32 PM
Enrichment with AGILENT's Sureselect target enrichment system dottomarco 454 Pyrosequencing 1 11-18-2009 02:14 AM
Poor enrichment...defective enrichment beads? njae4 454 Pyrosequencing 0 03-09-2009 07:23 AM

Reply
 
Thread Tools
Old 03-08-2012, 04:03 PM   #1
ysaletore
Junior Member
 
Location: New York, NY

Join Date: Mar 2012
Posts: 2
Default Computing Enrichment and RPKM

I'm conducting analysis of RNA HiSeq data, and we are trying to compute enrichment for a given window of reads in the IP over reads in our control. This window could be an entire gene, or a very small 25 bp segment within an exon. Working with some collaborators, we've been in discussion about specifically how to compute enrichment and whether or not that includes RPKM. I've now thoroughly confused myself and I was wondering if anyone had insight into better ways of computing this.

My initial method of computing enrichment was the ratio of reads in the IP to the reads in the control, normalized by total number of reads sequenced in each:
Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL),
where w represents the number of reads that mapped to that given window and Σ represents the total number of reads that were mapped to the genome (as a normalization factor).

However, our collaborators insisted that we incorporate RPKM as a normalization factor (that is divide), to account for differing gene lengths, so our final equation then became:
Enrichment = (#IPw / Σ IP) / (#CNTLw / Σ CNTL) / (10^9 * #CNTLg / Σ CNTL / length),
where here #CNTLg is the number of reads that map to the gene exons (so excluding introns) and length refers to the length of the mature transcript (CDS + UTRs, no introns).

However, our results are very strange, since low RPKM values (< 1) result in a very high enrichment score, and this doesn't make sense for computing enrichment. Furthermore, through answers on this forum, it sounds like RPKM is used more for differential expression between two samples, e.g., two biological replicates, and not necessarily to be used for computing the enrichment of our IP over the control. We're not trying to find DE genes here, but trying to determine an enrichment of our IP over our control for any given window.

Discussing this with my PI, we thought perhaps excluding RPKM but normalizing solely over the transcript length might be better. One odd result of dividing the enrichment by RPKM is that you're essentially multiplying by the transcript length, which is opposite of what I'd think we're trying to achieve.

Another possibility I thought is to perhaps compute the RPKM for the control, and then compute the RPKM as such for the IP, and take the ratio of that. This at least seems consistent with what RPKM seems to have been designed for, if I'm understanding RPKM correctly, but I'm still not sure if that makes any more sense or is better than the other approaches.

Thank you very much and I greatly appreciate your help if anyone has any ideas!
ysaletore is offline   Reply With Quote
Old 03-08-2012, 11:20 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 978
Default

The division by length is plain wrong. For an enrichment score, you want to divide some measure of signal strength in IP with a measure in CNTL. If your colleagues insist that these measures should be normalized for length, they can do so. However, as both measures are divided by the same length, it cancels out. Incidentally, this is why RPKM is not so useful for differentially expression, either. Dividing by length just obscures how much evidence you have: A ratio of 5 to 2 reads has the same ratio as 500 to 200 reads, but in the latter case you can be more sure that this is a real enrichment and not just chance. This is why the raw number of reads (without normalization) is useful and also why looking at the ratio only is not sufficient.

BTW, are you talking about CLIP, or how come you have IP and control?
Simon Anders is offline   Reply With Quote
Old 03-11-2012, 11:28 AM   #3
ysaletore
Junior Member
 
Location: New York, NY

Join Date: Mar 2012
Posts: 2
Default

Yes, this is for a form of IP. So I'm trying to gauge the enrichment of the IP over the control in a given window. I've heard that RPKK is apparently not a good measure anymore, and that length normalization actually increases variance, so I agree with your point there.

So we've opted to just use a read count ratio, normalized by total number of reads mapped in IP/control, respectively. Using Fisher's exact test produces too many p-value counts of 0s, because the enrichment is too high to be quantified with the test.

Thanks!
ysaletore is offline   Reply With Quote
Old 03-11-2012, 02:39 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 978
Default

Do you have replicates or any other means to assess sample-to-sample variability? Then, you could use DESeq. (The real reason why Fisher's test does not work is that it implicitly assumes biological and extra-Poisson technical variation to be zero.)
Simon Anders is offline   Reply With Quote
Reply

Tags
enrichment, hiseq, rnaseq, rpkm

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:51 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.