SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to compute IP efficiency zhangmj Epigenetics 3 02-13-2014 09:56 PM
< Script to compute distribution length of sequences > Giorgio C Bioinformatics 8 08-23-2012 02:29 AM
easy way to compute FPKMs from bam files? lpn RNA Sequencing 1 07-15-2011 05:41 AM
best way to compute how many reads map to non-coding regions? PFS Bioinformatics 2 03-20-2011 12:56 PM
Compute paired-end distance distribution? krobison Bioinformatics 11 11-12-2009 08:30 AM

Reply
 
Thread Tools
Old 11-02-2010, 12:05 AM   #1
liuxq
Member
 
Location: beijing, china

Join Date: Jun 2010
Posts: 36
Default How to compute RPKM?

Everyone knows the formula for RPKM compuation: rpkm=10^9*C/NL,where C is the reads number of the transcript, L is the length of the transcript and N is the total reads number of the sample

However, in my RNA-seq analysis pipeline, I have three "N".

1. total reads number
2. number of reads which can be mapped to reference genome
3. number of reads which are the result after mappable reads filtering using repeatmask

how to select the total reads number N for RPKM computation? I find that using three "N" have totally different effect.

Thanks very much.
liuxq is offline   Reply With Quote
Old 11-02-2010, 06:00 AM   #2
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by liuxq View Post
Everyone knows the formula for RPKM compuation: rpkm=10^9*C/NL,where C is the reads number of the transcript, L is the length of the transcript and N is the total reads number of the sample

However, in my RNA-seq analysis pipeline, I have three "N".

1. total reads number
2. number of reads which can be mapped to reference genome
3. number of reads which are the result after mappable reads filtering using repeatmask

how to select the total reads number N for RPKM computation? I find that using three "N" have totally different effect.

Thanks very much.
If all your experiments use repeat mask, then use option 3. Just make sure to clearly point out this definition when you report FPKM.
RockChalkJayhawk is offline   Reply With Quote
Old 11-02-2010, 06:31 AM   #3
liuxq
Member
 
Location: beijing, china

Join Date: Jun 2010
Posts: 36
Default

Quote:
Originally Posted by RockChalkJayhawk View Post
If all your experiments use repeat mask, then use option 3. Just make sure to clearly point out this definition when you report FPKM.
why using option 3 is more reasonable?
liuxq is offline   Reply With Quote
Old 11-02-2010, 06:34 AM   #4
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Option 3 in this scenario represents the last step in processing - or the final number of mapped reads that you will use in your analysis. It is not as informative to use any N other than what passes through your quality control steps.
RockChalkJayhawk is offline   Reply With Quote
Old 12-23-2010, 04:17 PM   #5
sameet
Member
 
Location: Earth

Join Date: Apr 2010
Posts: 34
Default

Hi,
I am a bit confused. What should i use for N, total number of reads that mapped, or the unique number of reads that mapped. I cannot afford to discard the repeated reads because I have some important data in it.
__________________
Sameet Mehta (Ph.D.),
Visiting Fellow,
National Cancer Insitute,
Bethesda,
US.
sameet is offline   Reply With Quote
Old 12-23-2010, 04:39 PM   #6
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by sameet View Post
Hi,
I am a bit confused. What should i use for N, total number of reads that mapped, or the unique number of reads that mapped. I cannot afford to discard the repeated reads because I have some important data in it.
In that case I would use 2, but make sure you clearly state that you haven't removed reads from repeat regions.
RockChalkJayhawk is offline   Reply With Quote
Old 12-24-2010, 03:04 AM   #7
sameet
Member
 
Location: Earth

Join Date: Apr 2010
Posts: 34
Default

Quote:
Originally Posted by RockChalkJayhawk View Post
In that case I would use 2, but make sure you clearly state that you haven't removed reads from repeat regions.
Hi,
I was thinking along same lines. But I want to know how to handle situations when the same read maps to multiple locations, because this happens at a a pretty high high rate in my samples.
__________________
Sameet Mehta (Ph.D.),
Visiting Fellow,
National Cancer Insitute,
Bethesda,
US.
sameet is offline   Reply With Quote
Old 12-24-2010, 04:36 AM   #8
severin
Genome Informatics Facility
 
Location: Iowa @isugif

Join Date: Sep 2009
Posts: 105
Default

Hi Sameet,
As far as I have seen there really is no clear rule on what to do with mappings to multiple locations, which is why many scientists use uniquely mappable reads for each gene. In the RNA-Seq Atlas for Glycine max, I used the uniquely mappable reads then use the mappable total count (N) that includes the multiple alignments. Now ,of course, there are programs (Cufflinks or Erange) that try to account for multiple mappings but that doesn't help you decide to include them in the first place.

As people above have mentioned, reporting the methodology is very important. I found in soybean (it has had two whole genome duplications, so lots of similar genes) the Atlas paper using only the uniquely mappable reads on a non-replicated sample still provided plenty of interesting data that fit what we would expect from a soybean (genes involved in seed filling still were highly expressed in seed filling etc).

No one method is going to be better than another in every case. It really depends on what you are looking at. Just be aware of the potential biases and include those in your interpretation.

Last edited by severin; 12-24-2010 at 04:39 AM.
severin is offline   Reply With Quote
Old 12-25-2010, 07:58 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 958
Default

See Robinson and Oshlack's paper (Genome Biol 2010, 11:R25) for some thought why neither of the three 'N' values may be a good option, at least if you want to see differential expression.
Simon Anders is offline   Reply With Quote
Old 02-02-2011, 05:35 AM   #10
john23
Junior Member
 
Location: Finland

Join Date: Feb 2011
Posts: 1
Default

Quote:
Originally Posted by Simon Anders View Post
See Robinson and Oshlack's paper (Genome Biol 2010, 11:R25) for some thought why neither of the three 'N' values may be a good option, at least if you want to see differential expression.
RPKM/FPKM is a better option then the raw read counts because it takes into account the quantity of RNA which has been used for sequencing. In general the RNA samples are sequenced using different amounts of RNA which gives totally different number of reads (a larger quantity of RNA gives a larger number of reads).
john23 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:46 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.