SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   RNA Sequencing (http://seqanswers.com/forums/forumdisplay.php?f=26)
-   -   cuffdiff q values (http://seqanswers.com/forums/showthread.php?t=25410)

jk1124 12-01-2012 09:06 AM

cuffdiff q values
 
I am having trouble understanding Cuffdiff q-values. I have a gene, for example, that appears to be differentially expressed but is not called so even when using either two or three replicates. In two replicates, the gene levels appear like this:

Condtion1 Rep 1 61.8278 58.4906 65.1649 OK
Condition2 Rep 1 0.416481 0.176026 0.656937 OK

Condition 1 Rep 2 52.4839 50.2009 54.7669 OK
Condition 2 Rep 2 3.07902 2.52377 3.63427 OK

but somehow, even though the p-value is significant, the q-value is almost one.

Can anyone explain to me how this q-value adjustment takes place, and what other information in my data might be causing the adjustment to be so high? This gene seems to me, along with other genes in the data, like it should have been called differentially expressed.

As a note: I ran cuffdiff on these two samples separately, not using replicates (just out of curiosity) and in the second pair, this gene is not called differentially expressed, though it is in the first. What's going on?

sdriscoll 12-05-2012 11:06 PM

cuffdiff, as well as probably all other differential expression test softwares, use the false discovery rate correction described here: http://en.wikipedia.org/wiki/False_d...berg_procedure

the reasons are probably described pretty well in that Wikipedia article. if you have 100 genes between two conditions you're essentially testing those conditions 100 times (even though each test is between different genes). so the p-values from the tests have to be corrected. my opinion is it's a little weird since that means the correction is influenced by how many genes you test (so you can cheat it a little by excluding genes which you think aren't testable). whatever - statistics don't always make sense.

in your case you're not seeing an issue with the FDR correction. cuffdiff suffers from something else. just like several other tools, cuffdiff uses a parametric statistical test to test for significantly misexpressed genes. in order to do that kind of a test one usually needs a mean and a variance and possibly a number of degrees of freedom. one also needs to know something about the distribution of the metrics that are being compared. so what these tests to is model the expression distributions and then extract variance values from the models and plug them into some type of stat test (cuffdiff ends up using something similar to a t-test). last year people were thinking the poisson distribution was applicable to read-count data but now most of these tools use the negative-binomial distribution. i think in some cases a certain amount of pooled information goes into these models. obviously something is broken because emperically it's completely obvious that a change from 62 to 0.4 in expression is massive and if those were means from normal distributions and you used a t-test you'd have a p-value so small that the FDR correction would have an insignificant impact on its value.

so the problem here is likely cuffdiff's modeling and estimation of the variance for this gene.

the only thing i can recommend is for you to try a different tool to test for differential expression. unfortunately nothing is as tidy as the tophat, cuffdiff pipeline but it will better for you to try something else for this test. i guarantee that you'll be more satisfied with the results. DESeq, is a good tool. I've read that ebseq is a smart new tool as well but I haven't tried it out. both of those run in R and both of them need count data so that means you'll need read counts for your genes. HTSeq count can do a good job of counting hits at genes from your alignments assuming you us an annotation that fits. The author would recommend the ensemble annotation. all i can recommend is for you to use something that gives you a unique name per gene locus.

jk1124 12-06-2012 06:52 AM

Thanks for your helpful response. I had looked into DESeq, but it also didn't return what I expected (and also doesn't handle multi-mapping reads, which I have a lot of). I have been thinking our data may be more variable than we can work with, so I'm looking into mask files and this sort of thing, but I'm wondering - is there a standard metric(s) for reporting variability? I'm pretty new at this, and wondering, generally, how I might decide whether or not my samples are usable for this kind of analysis.

oliviera 01-04-2013 05:36 AM

Dear community, dear sdriscoll,
I have a question which could be a related with cuffdiff.
By comparing the output from isoform_exp.diff and gene_exp.diff it looks like their is less trasnscripts which are differentially expressed than genes. This sound like counter intuitive to me as I would expect at least the same number in both cases. Could you explain that?

sdriscoll 01-04-2013 08:38 AM

I'm pretty sure in one of their papers they point out the possibility that there may be differential isoform expression but not differential locus expression. It is possible that there could be isoform switching or activation in a locus with significantly disturbing the total read count level of a locus. I would expect, however, that a differentially expressed locus would also have differentially expressed isoforms in it. Since cuffdiff models the variances separately for the isoform test and the locus level test sometimes it might report a single isoform gene as significant in one test but not the other. That I think is an unfortunate side effect of the statistics used and the variance estimations. The p-values are just a guide. Every gene may actually have different tolerances to differential expression. For some genes a two fold change could be significant regardless of the statistical test. It's important to keep that in mind.


All times are GMT -8. The time now is 02:00 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.