SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Different fpkm values for cuffdiff and cuffcompare madsaan Bioinformatics 3 12-12-2012 04:14 PM
Different FPKM values of cufflinks and cuffdiff mrfox Bioinformatics 5 10-17-2012 01:10 PM
Cufflinks and cuffdiff FPKM values combiochem Bioinformatics 12 10-13-2012 11:37 PM
Galaxy Cuffdiff Values ccard28 Bioinformatics 1 09-05-2012 06:52 AM
Cuffdiff producing q-values > 1 ocs Bioinformatics 7 08-15-2012 10:26 AM

Reply
 
Thread Tools
Old 12-01-2012, 09:06 AM   #1
jk1124
Member
 
Location: U.S.

Join Date: Oct 2012
Posts: 17
Default cuffdiff q values

I am having trouble understanding Cuffdiff q-values. I have a gene, for example, that appears to be differentially expressed but is not called so even when using either two or three replicates. In two replicates, the gene levels appear like this:

Condtion1 Rep 1 61.8278 58.4906 65.1649 OK
Condition2 Rep 1 0.416481 0.176026 0.656937 OK

Condition 1 Rep 2 52.4839 50.2009 54.7669 OK
Condition 2 Rep 2 3.07902 2.52377 3.63427 OK

but somehow, even though the p-value is significant, the q-value is almost one.

Can anyone explain to me how this q-value adjustment takes place, and what other information in my data might be causing the adjustment to be so high? This gene seems to me, along with other genes in the data, like it should have been called differentially expressed.

As a note: I ran cuffdiff on these two samples separately, not using replicates (just out of curiosity) and in the second pair, this gene is not called differentially expressed, though it is in the first. What's going on?
jk1124 is offline   Reply With Quote
Old 12-05-2012, 11:06 PM   #2
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

cuffdiff, as well as probably all other differential expression test softwares, use the false discovery rate correction described here: http://en.wikipedia.org/wiki/False_d...berg_procedure

the reasons are probably described pretty well in that Wikipedia article. if you have 100 genes between two conditions you're essentially testing those conditions 100 times (even though each test is between different genes). so the p-values from the tests have to be corrected. my opinion is it's a little weird since that means the correction is influenced by how many genes you test (so you can cheat it a little by excluding genes which you think aren't testable). whatever - statistics don't always make sense.

in your case you're not seeing an issue with the FDR correction. cuffdiff suffers from something else. just like several other tools, cuffdiff uses a parametric statistical test to test for significantly misexpressed genes. in order to do that kind of a test one usually needs a mean and a variance and possibly a number of degrees of freedom. one also needs to know something about the distribution of the metrics that are being compared. so what these tests to is model the expression distributions and then extract variance values from the models and plug them into some type of stat test (cuffdiff ends up using something similar to a t-test). last year people were thinking the poisson distribution was applicable to read-count data but now most of these tools use the negative-binomial distribution. i think in some cases a certain amount of pooled information goes into these models. obviously something is broken because emperically it's completely obvious that a change from 62 to 0.4 in expression is massive and if those were means from normal distributions and you used a t-test you'd have a p-value so small that the FDR correction would have an insignificant impact on its value.

so the problem here is likely cuffdiff's modeling and estimation of the variance for this gene.

the only thing i can recommend is for you to try a different tool to test for differential expression. unfortunately nothing is as tidy as the tophat, cuffdiff pipeline but it will better for you to try something else for this test. i guarantee that you'll be more satisfied with the results. DESeq, is a good tool. I've read that ebseq is a smart new tool as well but I haven't tried it out. both of those run in R and both of them need count data so that means you'll need read counts for your genes. HTSeq count can do a good job of counting hits at genes from your alignments assuming you us an annotation that fits. The author would recommend the ensemble annotation. all i can recommend is for you to use something that gives you a unique name per gene locus.
sdriscoll is offline   Reply With Quote
Old 12-06-2012, 06:52 AM   #3
jk1124
Member
 
Location: U.S.

Join Date: Oct 2012
Posts: 17
Default

Thanks for your helpful response. I had looked into DESeq, but it also didn't return what I expected (and also doesn't handle multi-mapping reads, which I have a lot of). I have been thinking our data may be more variable than we can work with, so I'm looking into mask files and this sort of thing, but I'm wondering - is there a standard metric(s) for reporting variability? I'm pretty new at this, and wondering, generally, how I might decide whether or not my samples are usable for this kind of analysis.
jk1124 is offline   Reply With Quote
Old 01-04-2013, 05:36 AM   #4
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

Dear community, dear sdriscoll,
I have a question which could be a related with cuffdiff.
By comparing the output from isoform_exp.diff and gene_exp.diff it looks like their is less trasnscripts which are differentially expressed than genes. This sound like counter intuitive to me as I would expect at least the same number in both cases. Could you explain that?
oliviera is offline   Reply With Quote
Old 01-04-2013, 08:38 AM   #5
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I'm pretty sure in one of their papers they point out the possibility that there may be differential isoform expression but not differential locus expression. It is possible that there could be isoform switching or activation in a locus with significantly disturbing the total read count level of a locus. I would expect, however, that a differentially expressed locus would also have differentially expressed isoforms in it. Since cuffdiff models the variances separately for the isoform test and the locus level test sometimes it might report a single isoform gene as significant in one test but not the other. That I think is an unfortunate side effect of the statistics used and the variance estimations. The p-values are just a guide. Every gene may actually have different tolerances to differential expression. For some genes a two fold change could be significant regardless of the statistical test. It's important to keep that in mind.
sdriscoll is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO