SEQanswers (
-   Bioinformatics (
-   -   Checking Cuffdiff (

jparsons 12-31-2012 07:15 AM

Checking Cuffdiff
I am using an interesting dataset to "test" differential isoform expression programs.

Unfortunately, I am not an expert in every (any?) program, so I could use some sanity checking.

I have 3 separate tissues, ABC. I want to use (in this case) cuffdiff to identify isoforms which are uniquely expressed in A/B/C, as I can use other "ground truth" runs to verify these claims.

I ran the program as follows, alternating A, B, and C:

cuffdiff -p 8 -c 10 <ucsc.gtf> A1,A2,A3 B1,B2,B3,C1,C2,C3 -o outdir
I'm not using a cufflinks-derived gtf or (exclusively) tophat-mapped reads. I imagine I'm doing it all wrong. I have two main questions:

1) Can I get away with not using the entire cufflinks pathway here? (If not, why doesn't the program complain?)
2) Am I properly comparing the 3 tissues? Does A vs B,C return transcripts DE in only A, as i intend it to?

rboettcher 01-09-2013 05:27 AM

Hello jparsons,

I used cufflinks and cuffdiff with GSNAP alignments and it worked fine, so you do not need to stick to TopHat necessarily as long as the sam/bam-files have all required columns.
However, I used the cufflinks -> cuffmerge -> cuffdiff variant to check my genes, since that way was suggested by the authors (but not very successful for me).

After following some discussions in this forum, see

I concluded that cufflinks/cuffdiff have a problem in their correction for variance. For my analysis, the bigger my sample groups were, the fewer genes were found significantly DE until none were left. Therefore I assume that pooling group B and C will result in a similar problem due to high variance between both groups.

Besides that, your command looks fine, so please keep us posted on your progress.

jparsons 01-09-2013 12:23 PM


Thanks for the response. I eventually compared the output from tophat->cufflinks->cuffmerge->cuffdiff to that from only cuffdiff and found that they were (mostly) identical. I am content using cuffdiff without going through the entire pipeline.

I got results for cuffdiff and finally managed to get RSEM to like me for long enough to spit out quantitations. When compared to the "truth" set (sadly only available on the gene level for now), the RSEM/cuffdiff lists are 'decent' individually, coming close to the expected ratio on average, but having numerous outliers. Taking the overlap set of genes called by both RSEM and cuffdiff makes for a much cleaner picture, with far less deviation from the ratio, and fewer false positives.

I'm still working on making metrics that make sense, so 'decent' and 'cleaner' is the best i can offer for now. I imagine I will develop permissive and restrictive "true positive" lists at each ratio and then generate ROCs for each algorithm I can successfully test.

I'm currently worried about algorithms making calls for downregulated genes or calling them as differentially expressed in cases where the assumption that "A>>B+C or A<<B+C" doesn't hold. I don't know how to handle that yet, and it may be the source of the outliers I mentioned before.

Overall, I am actually impressed with cuffdiff's performance, given how much grief it gets here. Neither algorithm is even remotely perfect, neither is obviously superior.

All times are GMT -8. The time now is 03:00 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.