View Single Post
Old 09-23-2008, 05:26 AM   #4
Junior Member
Location: Poznan, Poland

Join Date: Jul 2008
Posts: 6


The point with RPKM that I do not like, it is that I do not feel that it can handle different coverages. Perhaps I can explain it better through an example.

Let say that we are working with a genome with three genes A, B and C with the same length (I know not very realist, but just an example), and we want to study their expression in two conditions 1 and 2.

The real expression of the genes is:
Condition 1 Condition 2
Gene A 1 1
Gene B 1 1
Gene C 1 0

We run a RNA-seq experiment and we get the next number of reads
Condition 1 Condition 2
Gene A 3*10^5 4.5*10^5
Gene B 3*10^5 4.5*10^5
Gene C 3*10^5 0
Total 9*10^5 9*10^5

Translate to RPKM , since they have the same length, it should be something like:
Condition 1 Condition 2
Gene A 333333 5*10^5
Gene B 333333 5*10^5
Gene C 333333 0

As you can see, it seems that Gene A and B are also differentially express. This is because, since the expression of gene C is lower in condition 2 than 1, we have more reads that will improve of the coverage of the other genes.

Anyway, I think that always it is nice to normalize the data in some way. Mainly, when you are working with so low number of replicates.

Last edited by Chema; 09-23-2008 at 05:30 AM. Reason: change format
Chema is offline   Reply With Quote