View Single Post
Old 12-05-2012, 11:06 PM   #2
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

cuffdiff, as well as probably all other differential expression test softwares, use the false discovery rate correction described here: http://en.wikipedia.org/wiki/False_d...berg_procedure

the reasons are probably described pretty well in that Wikipedia article. if you have 100 genes between two conditions you're essentially testing those conditions 100 times (even though each test is between different genes). so the p-values from the tests have to be corrected. my opinion is it's a little weird since that means the correction is influenced by how many genes you test (so you can cheat it a little by excluding genes which you think aren't testable). whatever - statistics don't always make sense.

in your case you're not seeing an issue with the FDR correction. cuffdiff suffers from something else. just like several other tools, cuffdiff uses a parametric statistical test to test for significantly misexpressed genes. in order to do that kind of a test one usually needs a mean and a variance and possibly a number of degrees of freedom. one also needs to know something about the distribution of the metrics that are being compared. so what these tests to is model the expression distributions and then extract variance values from the models and plug them into some type of stat test (cuffdiff ends up using something similar to a t-test). last year people were thinking the poisson distribution was applicable to read-count data but now most of these tools use the negative-binomial distribution. i think in some cases a certain amount of pooled information goes into these models. obviously something is broken because emperically it's completely obvious that a change from 62 to 0.4 in expression is massive and if those were means from normal distributions and you used a t-test you'd have a p-value so small that the FDR correction would have an insignificant impact on its value.

so the problem here is likely cuffdiff's modeling and estimation of the variance for this gene.

the only thing i can recommend is for you to try a different tool to test for differential expression. unfortunately nothing is as tidy as the tophat, cuffdiff pipeline but it will better for you to try something else for this test. i guarantee that you'll be more satisfied with the results. DESeq, is a good tool. I've read that ebseq is a smart new tool as well but I haven't tried it out. both of those run in R and both of them need count data so that means you'll need read counts for your genes. HTSeq count can do a good job of counting hits at genes from your alignments assuming you us an annotation that fits. The author would recommend the ensemble annotation. all i can recommend is for you to use something that gives you a unique name per gene locus.
sdriscoll is offline   Reply With Quote