Seqanswers Leaderboard Ad

**dglemay** · 10-23-2011, 09:51 PM

Hi Jon,
As the number of Fragments increases, does the number of transcripts with non-zero FPKM also increase? This might be driving the increased FAIL rate. I noticed that with one of our samples where something had gone wrong and very few genes were expressed, the FAIL rate was extremely low compared to the other normal samples (with higher FAIL rate).
-Danielle

**rossini** · 11-01-2011, 08:07 AM

Hi

I have the same problem. Should I just ignore their FAIL status and use their FPKM for the analysis anyway?

Thanks.

**dglemay** · 11-01-2011, 08:14 AM

I would like to know what others are doing too. For now, I've been using the FPKMs regardless of Pass/Fail status, with the rationale that the non-zero FPKM, although possibly inaccurate, is more accurate than FPKM=0, which it will effectively be if those genes are ignored.

**Jon_Keats** · 11-01-2011, 02:18 PM

No Reply and More Unsettling Results

Originally posted by dglemay View Post

I would like to know what others are doing too. For now, I've been using the FPKMs regardless of Pass/Fail status, with the rationale that the non-zero FPKM, although possibly inaccurate, is more accurate than FPKM=0, which it will effectively be if those genes are ignored.

I'd agree that dropping all those "FAIL" genes or applying a value of 0 is a waste of time. I'm a bit uneasy using the values, but they seem to be fairly stable with read counts and read length. But I must admit I'm looking at other options like MISO, hand calculated FPKM, count based methods EdgeR or DESeq/DEXseq

From my side this seems like a program error in how the flag is applied not the calculation of isoform abundance. I've done some more testing and the number of "FAIL" genes correlates not only with the number of reads aligned but also the read length.

This seems non-biological and against all common sequencing idea's that longer and more is better so I'm guessing it is a programatic error.

To bad no one ever replies to the [email protected] emails anymore.... I miss Cole

Attached Files

Fragments_Length_Fail.jpg (53.3 KB, 65 views)

**erikjlar** · 11-12-2011, 11:42 PM

I have his issue too. Abundant genes in genes.fpkm_tracking are very often FAIL, and there is some LOWDATA too, despite RPKM > 150. This is based on ~80 million paired-end reads aligned with tophat.

Code:

271.616	0	219.218	FAIL
270.067	0	860.157	FAIL
269.393	0	1902.31	FAIL
268.861	0	744.834	FAIL
268.316	0	14399.4	FAIL
265.731	97.6681	433.795	OK
265.582	0	113.429	FAIL
263.137	0	5027.49	FAIL
263.087	0	1078.35	FAIL
261.497	0	243.624	FAIL
260.606	0	1735.99	FAIL
252.388	0	437.81	LOWDATA
251.266	0	421.303	FAIL
247.974	0	110.572	FAIL
247.812	0	123.419	FAIL
247.316	0	213.169	FAIL
245.179	0	469.43	FAIL

Agree that for gene level there should be little in terms of numeric/algorithmic challenges here, so has to be a bug? Question is if RPKMs are still useful. Wish someone in the cufflinks team could comment - i'm considering abandoning the package altogether but not sure what to use instead...

EDIT: Problem disappears when using the -g option (reference plus de novo assembly), or with de novo only. So I assume problem somehow stems from cufflinks being confused when the supplied gene models don't quite fit the data (which probably happens all the time).

**adarob** · 11-13-2011, 01:39 PM

A fix for this is being worked on and should be released later this week. Sorry, but due to the overwhelming number of questions and reports, we are no longer able to respond to all e-mails at tophat.cufflinks. However, all are read and taking into consideration for future updates.

-Adam

**Jon_Keats** · 11-13-2011, 10:18 PM

Thanks Adam

Glad to hear someone is reading the emails. It was starting to feel like a fruitless waste of my time.

**Cole Trapnell** · 11-15-2011, 05:41 AM

A bit more on what FAIL means, and how it can happen. We use FAIL for genes that actually throw a numerical exception during isoform abundance calculation. In Cufflinks and Cuffdiff, there's a couple of calculations that require us to build matrices with either a row per transcript and a column per read (more or less) or a square matrix with a row and column for each transcript. Some of these matrices need to be invertible or positive definite or have other properties in order for the next steps in the algorithm to succeed. However, sometimes (due to things like round-off error) they aren't. Other times, missing data causes trouble. Oddly enough, this is actually more likely to happen the more reads you get overall, because you can see that isoforms are present, but you don't actually have enough data to calculate those abundances. This is the effect you were observing above. So since we can't be sure about the values (and in fact, were we to go ahead and do the calculation anyways, they could be *wildly* off in theory, or even negative), we set them to zero and move on.

In order to make differential expression estimates more conservative, version 1.1.0 really ramped up the checks that are done before these steps so we don't end up reporting false positives that are due to numerical exceptions. However, users (like yourselves) have been pretty frustrated by those changes, so I've spent the last few weeks going back and streamlining the overall algorithm to actually eliminate pieces that require the matrices to have some of those properties. The main offender was our "importance sampling" procedure, which tries to give us a sense (for each gene) for the accuracy for the maximum likelihood estimate of isoform abundances. This procedure was originally meant to improve the robustness of the estimate when one or more isoforms were close to zero, but in practice, we found that it actually hurts as often as it helps. Moreover, this procedure would often FAIL genes, so I removed it altogether. I've compensated on the differential expression side with some other statistical improvements and fixes, and the result is globally more accurate differential analysis (both in terms of fewer FAILs and fewer false positives than 1.1.0).

The upcoming version 1.2.0 should drastically reduce the number of FAIL genes, though there will still be some. If we can't calculate an MLE to begin with, or if for some reason the confidence interval calculation fails, a gene will be marked as FAIL.

Hope this sheds light on things.

**cw11** · 11-15-2011, 06:35 AM

Glad to hear that a fix is on the way! Will the fix address the problem of false positives as well? I'm using CuffDiff 1.1.0 and I've noticed that a lot of the genes I'm getting as differentially expressed output have FPKMs that look (for example) like this (where my 3 conditions are GG, AG, and AA):

GG: 0 (OK)
AG: 11.6888 (OK)
AA: 10.7249 (OK)

CuffDiff will report GG as being differentially expressed relative to AG (and GG as being differentially expressed relative to AA), even though GG has many reads, and the RPKM values that I get with SeqGene look like this:

GG: 2.31, 2.04
AG: 2.63, 2.59, 2.5
AA: 2.32, 2.15, 2.42, 2.33, 3.02

Thanks!

**Cole Trapnell** · 11-15-2011, 08:29 AM

Hopefully. As I mentioned, I fixed a bunch of bugs that can generate false positives. As part of an upcoming paper on Cuffdiff, we've also done a ton of simulation experiments and wet experiments comparing the same RNA on multiple platforms, and it's clear from those data that the new v1.2.0 Cuffdiff is extremely accurate and concordant with other ways of doing DE. Not that the older versions were bad - we've just made things more accurate, and in general more conservative, than previous versions (and other tools for that matter).

**Jon_Keats** · 11-15-2011, 10:56 PM

Hi Cole,

Thanks for the detailed reply. I'd avoided bugging you directly as I'd incorrectly assumed you were no longer directly involved after completing your degree...congrats and well deserved by the way. Thanks for all the hard work.

Jonathan

**erikjlar** · 11-16-2011, 02:40 AM

Thanks for replies - great to hear straight from developers.

The problem is not only in the flag - the actual FPKM values are very weird sometimes. Hope this will get better...

Interesting example: the human MYH11 gene seems uncomplicated (see picture). Gene-level estimates should not be impossible to do using -G option (non-directional RNA-seq though)? But result is all zeros, also in isoforms.fpkm.tracking:

ENST00000396324.2 - - ENSG00000133392.9 - - chr16:15796991-15950887 6903 0 0 0 0 FAIL
ENST00000452625.1 - - ENSG00000133392.9 - - chr16:15796991-15950887 6942 0 0 0 0 FAIL
ENST00000396320.3 - - ENSG00000133392.9 - - chr16:15796991-15950887 6942 0 0 0 0 FAIL
ENST00000300036.4 - - ENSG00000133392.9 - - chr16:15796991-15950890 6885 0 0 0 0 FAIL
ENST00000338282.5 - - ENSG00000133392.9 - - chr16:15796991-15950890 6924 0 0 0 0 FAIL

Using -g it's a bit weird - a new gene name assigned but no new isoforms identified. But now there are suddenly FPKMs for two of them!

ENST00000396324.2 - - CUFF.13801 - - chr16:15796991-15950887 6903 0 0 0 0 OK
ENST00000452625.1 - - CUFF.13801 - - chr16:15796991-15950887 6942 0 0 0 0 OK
ENST00000396320.3 - - CUFF.13801 - - chr16:15796991-15950887 6942 0 0 0 0 OK
ENST00000300036.4 - - CUFF.13801 - - chr16:15796991-15950890 6885 72.7214 9.86435 9.52604 10.2027 OK
ENST00000338282.5 - - CUFF.13801 - - chr16:15796991-15950890 6924 86.4312 11.724 11.3666 12.0815 OK

I primarily just want gene-level estimates based on a reference annotation - maybe should go for some less sophisticated solution...

Attached Files

Screen Shot 2011-11-16 at 11.29.59 .png (15.6 KB, 23 views)

**Cole Trapnell** · 11-16-2011, 05:18 AM

Yep - that's what happens in genes that FAIL. The FPKMs for the individual isoforms could be all over the map (one could be zero, the other might have all the expression for the gene, but if you changed the gene's geometry just a little bit, the zero one might suddenly be expressed). This isn't a bug - this is what happens when the importance sampling procedure can't be executed because there's just too many isoforms (that are too similar to one another) and too little data. That's why we mark it FAIL. FAIL means the underlying FPKMs might be nonsense.

**Cole Trapnell** · 11-16-2011, 05:24 AM

Originally posted by Jon_Keats View Post

Hi Cole,

Thanks for the detailed reply. I'd avoided bugging you directly as I'd incorrectly assumed you were no longer directly involved after completing your degree...congrats and well deserved by the way. Thanks for all the hard work.

Jonathan

I'm still very much involved - I just need to limit my development time to features that directly support my work in the lab I joined. I'm a postdoc in John Rinn's lab, and we use Cufflinks and its brethren to find new lincRNAs, profile their expression, and analyze their perturbation. I'm also trying to help Steven and Lior (my PhD advisors) train some of their incoming students who've been extending the tools or developing related ones. Those activities unfortunately don't leave me with much time for answering support email and questions on forums like this, but as Adam said, we really do try to fix issues that are reported by users and add questions to the FAQ. Thanks for your patience!

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 19 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

Cufflinks "FAIL" Expression Estimates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News