View Single Post
Old 04-03-2010, 01:55 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

If you want to test for differential expression, it is a good idea to stay on the level of raw, integer counts, and not use RPKM or related data that is normalized by transcript length. This is because significance depends on the number of actual reads that you count. If you have low count you need to see a high fold-change to call significance.

See this thread for more details: http://seqanswers.com/forums/showthread.php?t=4349 (especially from post #6 onwards)

If you work with count data, your testing procedure needs to be aware of the ratios of sequencing depths of the libraries. This functionality is offered by several tools, namely edgeR, DESeq, and cuffdiff. I recommend DESeq, of course, as it is our work. ;-)

Note that this does not alleviate the bias towards longer genes: If two genes have the same expressions (same number of transcript molecules per volume) in two samples and hence the same fold change, the longer one may be called significant and the shorter one not, because the longer one produces more fragments.

If this bothers you, you have a couple of options:
- use Tag-Seq instead of RNA-Seq
- additionally filter with with a rather large threshold on log fold change
- for GO enrichment test and the like, use the test by Young et al. (2010), which takes the length bias into account: http://genomebiology.com/2010/11/2/R14

Here is a figure, that shows, how the log fold change required for significance (red dots: genes with significant DE; black dots: other genes) depends on the counts when using DESeq for testing:



For more information, see our paper: http://dx.doi.org/10.1038/npre.2010.4282.1

Last edited by Simon Anders; 04-03-2010 at 01:57 AM.
Simon Anders is offline   Reply With Quote