SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
"allele balance ratio" and "quality by depth" in VCF files efoss Bioinformatics 2 10-25-2011 12:13 PM
Relatively large proportion of "LOWDATA", "FAIL" of FPKM_status running cufflink ruben6um Bioinformatics 3 10-12-2011 01:39 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 08:55 AM
tview error- "fail to load BAM index" iwtwb8 Bioinformatics 2 05-05-2011 11:36 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 01:19 PM

Reply
 
Thread Tools
Old 10-18-2011, 10:54 PM   #1
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default Cufflinks "FAIL" Expression Estimates

In some of my recent tests using cufflinks for transcript/gene expression estimates using a GTF reference I've seen an odd correlation between the number of transcripts/genes listed as "FAIL". I've seen this in two situations: 1) same sample using the first 10 million, 20 million, 30 million, etc... and a set of new samples with various total fragment counts. What I've seen, and can be seen in the figure, is that the number of "FAIL" transcripts/genes increases as the number of fragments increases. To me this seems counter intuitive and suggests some kind of an error. The other worrisome issue is the "FAIL" transcripts/genes are biased to high expressed genes compared to the "OK" genes in genes with multiple known transcript variants.

There was one other post about this issue but I was wondering if others have noticed this behavior. Is it version specific?
Attached Images
File Type: jpg Screen shot 2011-10-18 at 12.16.50 PM.jpg (50.1 KB, 70 views)
Attached Files
File Type: pdf Data 1.pdf (1.31 MB, 22 views)
Jon_Keats is offline   Reply With Quote
Old 10-23-2011, 10:51 PM   #2
dglemay
Member
 
Location: California

Join Date: Feb 2011
Posts: 16
Default

Hi Jon,
As the number of Fragments increases, does the number of transcripts with non-zero FPKM also increase? This might be driving the increased FAIL rate. I noticed that with one of our samples where something had gone wrong and very few genes were expressed, the FAIL rate was extremely low compared to the other normal samples (with higher FAIL rate).
-Danielle
dglemay is offline   Reply With Quote
Old 11-01-2011, 09:07 AM   #3
rossini
Junior Member
 
Location: US

Join Date: Aug 2011
Posts: 3
Default

Hi

I have the same problem. Should I just ignore their FAIL status and use their FPKM for the analysis anyway?

Thanks.
rossini is offline   Reply With Quote
Old 11-01-2011, 09:14 AM   #4
dglemay
Member
 
Location: California

Join Date: Feb 2011
Posts: 16
Default

I would like to know what others are doing too. For now, I've been using the FPKMs regardless of Pass/Fail status, with the rationale that the non-zero FPKM, although possibly inaccurate, is more accurate than FPKM=0, which it will effectively be if those genes are ignored.
dglemay is offline   Reply With Quote
Old 11-01-2011, 03:18 PM   #5
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Angry No Reply and More Unsettling Results

Quote:
Originally Posted by dglemay View Post
I would like to know what others are doing too. For now, I've been using the FPKMs regardless of Pass/Fail status, with the rationale that the non-zero FPKM, although possibly inaccurate, is more accurate than FPKM=0, which it will effectively be if those genes are ignored.
I'd agree that dropping all those "FAIL" genes or applying a value of 0 is a waste of time. I'm a bit uneasy using the values, but they seem to be fairly stable with read counts and read length. But I must admit I'm looking at other options like MISO, hand calculated FPKM, count based methods EdgeR or DESeq/DEXseq

From my side this seems like a program error in how the flag is applied not the calculation of isoform abundance. I've done some more testing and the number of "FAIL" genes correlates not only with the number of reads aligned but also the read length.

This seems non-biological and against all common sequencing idea's that longer and more is better so I'm guessing it is a programatic error.

To bad no one ever replies to the tophat-cufflinks@gmail.com emails anymore.... I miss Cole
Attached Images
File Type: jpg Fragments_Length_Fail.jpg (53.3 KB, 25 views)
Jon_Keats is offline   Reply With Quote
Old 11-12-2011, 11:42 PM   #6
erikjlar
Junior Member
 
Location: sweden

Join Date: Sep 2011
Posts: 3
Default

I have his issue too. Abundant genes in genes.fpkm_tracking are very often FAIL, and there is some LOWDATA too, despite RPKM > 150. This is based on ~80 million paired-end reads aligned with tophat.

Code:
271.616	0	219.218	FAIL
270.067	0	860.157	FAIL
269.393	0	1902.31	FAIL
268.861	0	744.834	FAIL
268.316	0	14399.4	FAIL
265.731	97.6681	433.795	OK
265.582	0	113.429	FAIL
263.137	0	5027.49	FAIL
263.087	0	1078.35	FAIL
261.497	0	243.624	FAIL
260.606	0	1735.99	FAIL
252.388	0	437.81	LOWDATA
251.266	0	421.303	FAIL
247.974	0	110.572	FAIL
247.812	0	123.419	FAIL
247.316	0	213.169	FAIL
245.179	0	469.43	FAIL
Agree that for gene level there should be little in terms of numeric/algorithmic challenges here, so has to be a bug? Question is if RPKMs are still useful. Wish someone in the cufflinks team could comment - i'm considering abandoning the package altogether but not sure what to use instead...

EDIT: Problem disappears when using the -g option (reference plus de novo assembly), or with de novo only. So I assume problem somehow stems from cufflinks being confused when the supplied gene models don't quite fit the data (which probably happens all the time).

Last edited by erikjlar; 11-13-2011 at 06:13 AM. Reason: New observations
erikjlar is offline   Reply With Quote
Old 11-13-2011, 01:39 PM   #7
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

A fix for this is being worked on and should be released later this week. Sorry, but due to the overwhelming number of questions and reports, we are no longer able to respond to all e-mails at tophat.cufflinks. However, all are read and taking into consideration for future updates.

-Adam
adarob is offline   Reply With Quote
Old 11-13-2011, 10:18 PM   #8
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Thanks Adam

Glad to hear someone is reading the emails. It was starting to feel like a fruitless waste of my time.
Jon_Keats is offline   Reply With Quote
Old 11-15-2011, 05:41 AM   #9
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

A bit more on what FAIL means, and how it can happen. We use FAIL for genes that actually throw a numerical exception during isoform abundance calculation. In Cufflinks and Cuffdiff, there's a couple of calculations that require us to build matrices with either a row per transcript and a column per read (more or less) or a square matrix with a row and column for each transcript. Some of these matrices need to be invertible or positive definite or have other properties in order for the next steps in the algorithm to succeed. However, sometimes (due to things like round-off error) they aren't. Other times, missing data causes trouble. Oddly enough, this is actually more likely to happen the more reads you get overall, because you can see that isoforms are present, but you don't actually have enough data to calculate those abundances. This is the effect you were observing above. So since we can't be sure about the values (and in fact, were we to go ahead and do the calculation anyways, they could be *wildly* off in theory, or even negative), we set them to zero and move on.

In order to make differential expression estimates more conservative, version 1.1.0 really ramped up the checks that are done before these steps so we don't end up reporting false positives that are due to numerical exceptions. However, users (like yourselves) have been pretty frustrated by those changes, so I've spent the last few weeks going back and streamlining the overall algorithm to actually eliminate pieces that require the matrices to have some of those properties. The main offender was our "importance sampling" procedure, which tries to give us a sense (for each gene) for the accuracy for the maximum likelihood estimate of isoform abundances. This procedure was originally meant to improve the robustness of the estimate when one or more isoforms were close to zero, but in practice, we found that it actually hurts as often as it helps. Moreover, this procedure would often FAIL genes, so I removed it altogether. I've compensated on the differential expression side with some other statistical improvements and fixes, and the result is globally more accurate differential analysis (both in terms of fewer FAILs and fewer false positives than 1.1.0).

The upcoming version 1.2.0 should drastically reduce the number of FAIL genes, though there will still be some. If we can't calculate an MLE to begin with, or if for some reason the confidence interval calculation fails, a gene will be marked as FAIL.

Hope this sheds light on things.
Cole Trapnell is offline   Reply With Quote
Old 11-15-2011, 06:35 AM   #10
cw11
Member
 
Location: Massachusetts

Join Date: Sep 2011
Posts: 12
Default

Glad to hear that a fix is on the way! Will the fix address the problem of false positives as well? I'm using CuffDiff 1.1.0 and I've noticed that a lot of the genes I'm getting as differentially expressed output have FPKMs that look (for example) like this (where my 3 conditions are GG, AG, and AA):

GG: 0 (OK)
AG: 11.6888 (OK)
AA: 10.7249 (OK)


CuffDiff will report GG as being differentially expressed relative to AG (and GG as being differentially expressed relative to AA), even though GG has many reads, and the RPKM values that I get with SeqGene look like this:

GG: 2.31, 2.04
AG: 2.63, 2.59, 2.5
AA: 2.32, 2.15, 2.42, 2.33, 3.02

Thanks!

Last edited by cw11; 11-15-2011 at 06:56 AM.
cw11 is offline   Reply With Quote
Old 11-15-2011, 08:29 AM   #11
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Hopefully. As I mentioned, I fixed a bunch of bugs that can generate false positives. As part of an upcoming paper on Cuffdiff, we've also done a ton of simulation experiments and wet experiments comparing the same RNA on multiple platforms, and it's clear from those data that the new v1.2.0 Cuffdiff is extremely accurate and concordant with other ways of doing DE. Not that the older versions were bad - we've just made things more accurate, and in general more conservative, than previous versions (and other tools for that matter).
Cole Trapnell is offline   Reply With Quote
Old 11-15-2011, 10:56 PM   #12
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Hi Cole,

Thanks for the detailed reply. I'd avoided bugging you directly as I'd incorrectly assumed you were no longer directly involved after completing your degree...congrats and well deserved by the way. Thanks for all the hard work.

Jonathan
Jon_Keats is offline   Reply With Quote
Old 11-16-2011, 02:40 AM   #13
erikjlar
Junior Member
 
Location: sweden

Join Date: Sep 2011
Posts: 3
Default

Thanks for replies - great to hear straight from developers.

The problem is not only in the flag - the actual FPKM values are very weird sometimes. Hope this will get better...

Interesting example: the human MYH11 gene seems uncomplicated (see picture). Gene-level estimates should not be impossible to do using -G option (non-directional RNA-seq though)? But result is all zeros, also in isoforms.fpkm.tracking:

ENST00000396324.2 - - ENSG00000133392.9 - - chr16:15796991-15950887 6903 0 0 0 0 FAIL
ENST00000452625.1 - - ENSG00000133392.9 - - chr16:15796991-15950887 6942 0 0 0 0 FAIL
ENST00000396320.3 - - ENSG00000133392.9 - - chr16:15796991-15950887 6942 0 0 0 0 FAIL
ENST00000300036.4 - - ENSG00000133392.9 - - chr16:15796991-15950890 6885 0 0 0 0 FAIL
ENST00000338282.5 - - ENSG00000133392.9 - - chr16:15796991-15950890 6924 0 0 0 0 FAIL

Using -g it's a bit weird - a new gene name assigned but no new isoforms identified. But now there are suddenly FPKMs for two of them!

ENST00000396324.2 - - CUFF.13801 - - chr16:15796991-15950887 6903 0 0 0 0 OK
ENST00000452625.1 - - CUFF.13801 - - chr16:15796991-15950887 6942 0 0 0 0 OK
ENST00000396320.3 - - CUFF.13801 - - chr16:15796991-15950887 6942 0 0 0 0 OK
ENST00000300036.4 - - CUFF.13801 - - chr16:15796991-15950890 6885 72.7214 9.86435 9.52604 10.2027 OK
ENST00000338282.5 - - CUFF.13801 - - chr16:15796991-15950890 6924 86.4312 11.724 11.3666 12.0815 OK

I primarily just want gene-level estimates based on a reference annotation - maybe should go for some less sophisticated solution...
Attached Images
File Type: png Screen Shot 2011-11-16 at 11.29.59 .png (15.6 KB, 23 views)
erikjlar is offline   Reply With Quote
Old 11-16-2011, 05:18 AM   #14
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Yep - that's what happens in genes that FAIL. The FPKMs for the individual isoforms could be all over the map (one could be zero, the other might have all the expression for the gene, but if you changed the gene's geometry just a little bit, the zero one might suddenly be expressed). This isn't a bug - this is what happens when the importance sampling procedure can't be executed because there's just too many isoforms (that are too similar to one another) and too little data. That's why we mark it FAIL. FAIL means the underlying FPKMs might be nonsense.
Cole Trapnell is offline   Reply With Quote
Old 11-16-2011, 05:24 AM   #15
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Jon_Keats View Post
Hi Cole,

Thanks for the detailed reply. I'd avoided bugging you directly as I'd incorrectly assumed you were no longer directly involved after completing your degree...congrats and well deserved by the way. Thanks for all the hard work.

Jonathan
I'm still very much involved - I just need to limit my development time to features that directly support my work in the lab I joined. I'm a postdoc in John Rinn's lab, and we use Cufflinks and its brethren to find new lincRNAs, profile their expression, and analyze their perturbation. I'm also trying to help Steven and Lior (my PhD advisors) train some of their incoming students who've been extending the tools or developing related ones. Those activities unfortunately don't leave me with much time for answering support email and questions on forums like this, but as Adam said, we really do try to fix issues that are reported by users and add questions to the FAQ. Thanks for your patience!
Cole Trapnell is offline   Reply With Quote
Old 11-16-2011, 06:21 AM   #16
erikjlar
Junior Member
 
Location: sweden

Join Date: Sep 2011
Posts: 3
Default

Thanks again for the input. I don't want to bother Cole with more questions - but maybe someone else can comment on alternative solutions if one just wants to do a simple gene-level summarization of reads.

It seems to me that for this scenario, the sophistication of cufflinks can get in the way of getting a robust expression estimate (see my previous post and picture to see what I mean).

Basically, for the example above I would be happy to just get FPKM values, or even read counts, based on all MYH11 exons - but excluding positions that ambiguously overlap with other genes.

Someone must be doing this sort of "array-style" abundance estimates?
erikjlar is offline   Reply With Quote
Old 12-05-2011, 07:12 AM   #17
dglemay
Member
 
Location: California

Join Date: Feb 2011
Posts: 16
Default gene-level estimates

You could use the genes.fpkm_tracking files to get the FPKMs summarized at the gene level.
dglemay is offline   Reply With Quote
Old 12-07-2011, 07:31 PM   #18
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default Update

I've had a chance to check my original dataset for the frequency of "FAIL" genes now with the newest version of Cufflinks. It is much improved with the number of FAIL genes now being between 73-238 compared to the previous version where the "FAIL" genes were between 8000-12,000 genes. There is still a trend for more "FAIL" genes as the number of aligned reads increases which I find odd but at this frequency it seems acceptable.
Attached Images
File Type: jpg Fail2.jpg (33.6 KB, 15 views)
Jon_Keats is offline   Reply With Quote
Reply

Tags
cufflinks deseq edger, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO