![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
cuffdiff for differential gene testing | Paul Walker | RNA Sequencing | 4 | 01-26-2015 09:04 AM |
Does Cuffdiff normalize FPKMS before it does differential expression? | greener | Bioinformatics | 0 | 02-25-2011 03:37 PM |
Cufflinks/Cuffdiff significant differential expression | memo | Bioinformatics | 5 | 01-25-2011 10:49 AM |
Cuffdiff differential expression two groups | jamminbeh | Bioinformatics | 0 | 12-09-2010 03:13 PM |
Testing Cufflinks | zezziane | RNA Sequencing | 0 | 06-15-2010 02:24 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Singapore Join Date: Oct 2009
Posts: 44
|
![]()
Hi all,
I have been trying out the new 1.3 version of the cufflinks package to see if the testing for differential splicing/expression etc is improved. Now I expected a more stringent analysis, as announced, but I get mostly failed tests. I used the same reads.bam and annotation and the same parameters (5 biological replicates and 2 groups, 2x 100 bp PE reads, ~ 140mil per sample). This is my result with cuffdiff v1.0.3 for diff. promoter usage: 191 FAIL 13713 NOTEST 4745 OK with 1403 Genes found with sign. diff. promoter usage. Now this is the cuffdiff v 1.3 result: 5719 FAIL 35 LOWDATA 12301 NOTEST 594 OK with 20 Genes found with sign. diff. promoter usage. It seems that the number of false positives is really reduced. The new version find only ~ 3.5% of genes tested to be differential. However, so many fail to be tested in the first place that the result is not really usable. Did anyone experience something similar? Is this failing of testing maybe related to my large dataset or the replicates? All the best, Sebastian |
![]() |
![]() |
![]() |
#2 |
Member
Location: Sheffield, UK Join Date: Dec 2011
Posts: 32
|
![]()
I am also having the same problem in both splicing.diff and promotor.diff.
I have 5 samples in each of two groups, but only 25M reads per sample (100bp paired end). Promoters Cufflinks 1.0.3 OK 11,309 NOTEST 8,637 FAIL 648 with 7,902 significant Cuffdiff 1.3 OK 1,383 NOTEST 10,243 FAIL 8,948 2 significant Splicing Cufflinks 1.0.3 OK 13,996 NOTEST 51,240 FAIL 2,933 8,645 genes with sig. differential splicing Cufflinks 1.3 OK 1,017 NOTEST 28,032 FAIL 37,736 LOWDATA 1,401 6 genes with sig. differential splicing. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212
|
![]()
Just to be clear - this is happening only for the splicing.diff, promoters.diff (and maybe cds.diff?) files? Note gene_exp.diff or isoform_exp.diff?
Can one or both of you send us (to the support list) a gene's worth of reads in one of these loci that are failing? And a snippet of GTF to run it against? The new code uses a sampling-based approach to estimate a null distribution of relative isoform abundances in each condition, rather than an analytic null model based on the gradient. The upside of this approach is that it's more accurate and more conservative - the downside is that the sampling method can fail under some conditions. I haven't seen this happening in any of my datasets (and I have one that looks extremely similar to yours), so I probably can't do anything about it without a small test data set that reproduces the problem. It's possible it's something easy I overlooked and can fix quickly. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Charlottesville, VA Join Date: May 2011
Posts: 112
|
![]()
Just came across this. Anecdotally, a colleague here has seen something similar, switching from the old pipeline (using 1.0.3, with a workflow similar to Jeremy's Galaxy exercise - tophat, cufflinks -g, cuffcompare -R, cuffdiff -N) to the new pipeline (using v1.3 from the protocols paper - tophat -G [gtf], cufflinks (no RABT), cuffmerge -g (RABT), cuffdiff -u -b). Unfortunately I don't have example data to send. Just wondering if you guys or others were able to figure out what was happening.
Last edited by turnersd; 04-04-2012 at 05:31 AM. Reason: link to protocols paper |
![]() |
![]() |
![]() |
#5 |
Member
Location: Innsbruck Join Date: Jul 2010
Posts: 28
|
![]()
Dear all,
I also using the cufflinks package for our RNA-seq analysis and I also ran into the same problem. Upon examination of cuffdiff results we also note a striking amount of transcripts (38%) and genes (25%) with status FAIL. We found such result very hard to be justified. How we can manage to loose the 25% of examined elements? In order to get better results, reducing the FAIL number, we conducted different tests with different conditions. 1) We have three samples, each one with three biological replicas. The plot1 (see Plot1 attached) shows the number of FAIL (cuffdiff's v1.3.0) obtained with 1, 2 or 3 biological replicas. As you can see the number of elements dramatically increases with the number of replicas (FAIL in tracking file). 2) As DerSeb previously shown, I also tested the cuffdiff's behavior using different versions with all the three replicas. The plot (see Plot2 attached) clearly shows very different results. From cuffdiff 1.2.0, a remarkable worsening of the number of FAIL appears. From what I was able to see, there is no improvement with FAIL as the say with the 1.2.0 version. Essentially the number of FAILs increases with the number of biological replicas and with cufflinks versions following the v1.1.0. Do somebody find any possible solution for this issue? Could anybody provide me any explanation behind this results? Thanks Francesco |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212
|
![]()
Hi francicco,
We've fixed this issue in the upcoming release of Cufflinks 1.4.0, which is right around the corner. We've been a little swamped with the release of TopHat 2 and other items, but we're working hard to get this out because I know several groups have run into this. The explanation of what was going on is a bit complicated, but we were able to reproduce the issue on one of our test sets, and came with a nice fix for it. The newest version produces a handful of FAIL genes at most, and when we've looked at those, the genes are ones where Cuffdiff has flagged a genuine structural problem that prevents us from calling gene expression. |
![]() |
![]() |
![]() |
#7 |
Member
Location: Innsbruck Join Date: Jul 2010
Posts: 28
|
![]()
Dear Cole,
Thank you for you rapid answer! Do you know when the new version will be public available? I offer myself for testing the new version on my data, do you think would be possible? Cheers Francesco |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: New Haven, CT (USA) Join Date: Nov 2011
Posts: 2
|
![]()
Hi,
I am having a similar issue. I am running Tophat/Cufflinks pipeline. I have two groups of individuals (5 each), test and control. Two tissue samples each individual. Cuffdiff gives only one DE gene and one diff splicing. I do get about 400 with either DESeq or edgeR and about 800 hits for diff exon usage. I am running all latest versions (although the data were mapped with an older Tophat release, I think 1.0.3). Can I use the older version (1.0.3) of Cuffdiff with the files prodced by cuffmerge and cufflinks 1.3.0? Thanks |
![]() |
![]() |
![]() |
#9 |
Member
Location: Innsbruck Join Date: Jul 2010
Posts: 28
|
![]() |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Baltimore Join Date: Mar 2012
Posts: 120
|
![]()
Hi guys,
Anyone try this with Cufflinks 2.0? Is the problem resolved? I also have approximately 40% of my genes as NOTEST right now with the old cufflinks |
![]() |
![]() |
![]() |
#11 |
Member
Location: Toronto Join Date: Jan 2008
Posts: 30
|
![]()
http://seqanswers.com/forums/showthr...t=17678&page=2
I got another issue with new CUFFLINK 2: When I directly quantify against ensembl gtf, the cufflinks returned 0 expression for most of them. This only occurred when I used replicates. single sample group is fine. And seems only when transcripts matched to known gene's annotation. #command: cufflinks-2.0.0.Linux_x86_64/cuffdiff -p 8 -L P1,P2 -c 1 -b anFam2.fa -o cuffdiff.P1.P2.ensembl canFam2.67.gtf TOPHAT2.C1.bam,TOPHAT2.C2.bam,TOPHAT2.C3.bam,TOPHAT2.C4.bam TOPHAT2.C5.bam,TOPHAT2.C6.bam,TOPHAT2.C7.bam,TOPHAT2.C8.bam,TOPHAT2.C9.bam Here are the number of genes returned FPKM 0 in cuffdiff: $8 is for treatment P1, $9 is for treatment P2 in output. awk ' $8 ==0 { i++}; END {print i " of " NR " = " i/NR*100 "%"} ' cuffdiff.P1.P2.ensembl/gene_exp.diff cufflinks-2.0.0: P1: 24649 of 24661 = 99.9513% P2: 24645 of 24661 = 99.9351% cufflinks 1.3.0 seems right: P1: 6345 of 24661 = 25.7289% P2: 6564 of 24661 = 26.6169% Now I am using edgeR and DESeq for identifying DE genes, and use cuffddiff (v1.3.0) results (pvalue, FC >=1.5) as additional evidence in filtering. But seems edgeR and DESeq only work on gene level and can not do isoform level analysis. |
![]() |
![]() |
![]() |
#12 |
Member
Location: Innsbruck Join Date: Jul 2010
Posts: 28
|
![]()
I personally do not trust cufflinks 2 results. For instance it gives 0 FPKM to transcript clearly expressed
Developers need to do something, sooner or later... |
![]() |
![]() |
![]() |
#13 |
Member
Location: Aperture Science Join Date: Mar 2012
Posts: 59
|
![]()
I have a similar issue. When I add more replicates the number of sig. genes goes down drastically. Finally after much searching I discovered that the number of FAIL in gene_exp.diff increases with more replicates. I reran everything with tophat2 and cufflinks2 and the results now are 0 sig. genes with all replicates, which it shouldn't be. When I look at the gene_exp.diff file I see that the big majority of status messages was not FAIL this time, but NOTEST.
Here's some statistics to my statement. 2+2 replicates (cufflinks 1.3.0) NOTEST 8130 OK 34495 FAIL 271 3+3 replicates (cufflinks 1.3.0) NOTEST 8271 OK 29908 FAIL 4887 4+4 replicates (cufflinks 1.3.0) NOTEST 8645 OK 25996 FAIL 8823 Notice how the status FAIL increases here with more replicates. Below is the statistics from the cufflinks2 runs with very large number of NOTEST resulting in 0 sig. genes. 4+4 replicates (cufflinks 2) NOTEST 35560 OK 9142 FAIL 9 7+8 replicates (cufflinks 2) NOTEST 38875 OK 6269 FAIL 0 7+8 replicates (cufflinks 2) but without frag-bias-correct, upper-quartile-norm and multiread-correct in the cuffdiff run NOTEST 17534 OK 27558 FAIL 52 I would very much like to know the reason to this and if I can correct it somehow. Last edited by glados; 05-30-2012 at 07:42 AM. |
![]() |
![]() |
![]() |
#14 |
Member
Location: Sheffield, UK Join Date: Dec 2011
Posts: 32
|
![]()
We eventually came to the conclusion that the original problem in Cufflinks 1.3 was being caused by excessive variance between our samples. As more samples were added, the variance was getting bigger - this is why we only saw the problems in datasets with large numbers of samples. This made biological sense for us: our samples were from different patients with each patient given a before and after treatment sample.
In cufflinks 2, the large variance no longer caused the model to fall over, but it didn't find any significant genes: presumably because the variances were so large (which can be seen in the confidence limits on the FPKM estimation). We didn't see the large number of NOTESTs though. |
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 | |
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#17 | |
Member
Location: Aperture Science Join Date: Mar 2012
Posts: 59
|
![]() Quote:
I reran in cuffdiff two different cufflinks files I had with the same parameters (4+4 replicates) and --min-outlier-p 0, and the result was exactly like I got before without --min-outlier-p. In gene-exp_diff 35560 NOTESTs for one cufflinks and 11240 NOTESTs for the other. The difference between the two cufflinks groups I tested is that the first has been run with --multi-read-correct --upper-quartile-norm --frag-bias-correct both in cufflinks and cuffdiff, and the other group only in cuffdiff (frag-bias-correct in cufflinks). Both has been run with --GTF-guide. I am wondering if this has any influence on number of NOTEST. You can use these parameters in both programs but is better to use them only in one of them, if so which one? However, I still think that 11240 NOTESTs are too high and it gives me practically 0 significant genes which I'm sure is incorrect. edit: I also tested with using these parameters in cufflinks only and not in cuffdiff, and got the same results as I did when I used them in cuffdiff but not in cufflinks, i.e. 11240 NOTESTs. Additionally I wonder why the variance on every gene is huge. I have so many replicates I would expect it to become smaller but the error bars always reach the bottom (i.e. fpkm conf_lo = 0 and fpkm_conf_hi is extremely high). I am wondering if this has anything to do with me not getting any significant DE genes. Last edited by glados; 06-01-2012 at 02:25 AM. |
|
![]() |
![]() |
![]() |
#18 | |
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#19 | |
Member
Location: Aperture Science Join Date: Mar 2012
Posts: 59
|
![]() Quote:
What I mean is that the parameters --multi-read-correct --upper-quartile-norm and --frag-bias-correct is available for both cufflinks and cuffdiff, so I've tried using them only in cufflinks, only in cuffdiff and in both. I get much more NOTESTs in the gene_exp.diff-file from cuffdiff when I use these parameters in both cufflinks and cuffdiff (35560) than in only one of them (11240), so I wondered if that had anything to do with it, the number of NOTESTs is still high though. My cuffdiff command can be something like this Code:
cuffdiff -o output_path --labels X,Y --num-threads 12 --frag-bias-correct genome.fa --upper-quartile-norm --multi-read-correct merged.gtf X1.bam,X2.bam,X3.bam,X4.bam Y1.bam,Y2.bam,Y3.bam,Y4.bam Last edited by glados; 06-01-2012 at 04:41 AM. |
|
![]() |
![]() |
![]() |
#20 | |
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212
|
![]() Quote:
Based on your comment that the variances are huge, I'm wondering if the problem is with the assembly. Cuffdiff takes into consideration both cross-replicate variability and fragment assignment uncertainty (disambiguating how many reads came from each isoform). In general, the more isoforms a gene has, the more uncertainty there will be in assigning reads to each isoform, and the more uncertainty there will be in the overall gene expression level. That means more variance, so if you have a ton of isoforms (possibly because of a bad assembly), you'll see very few differentially expressed genes. Another thing to check is whether you still see this when using a reference GTF. Have you tried that as a sanity check? |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|