Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks "FAIL" Expression Estimates

    In some of my recent tests using cufflinks for transcript/gene expression estimates using a GTF reference I've seen an odd correlation between the number of transcripts/genes listed as "FAIL". I've seen this in two situations: 1) same sample using the first 10 million, 20 million, 30 million, etc... and a set of new samples with various total fragment counts. What I've seen, and can be seen in the figure, is that the number of "FAIL" transcripts/genes increases as the number of fragments increases. To me this seems counter intuitive and suggests some kind of an error. The other worrisome issue is the "FAIL" transcripts/genes are biased to high expressed genes compared to the "OK" genes in genes with multiple known transcript variants.

    There was one other post about this issue but I was wondering if others have noticed this behavior. Is it version specific?
    Attached Files

  • #2
    Hi Jon,
    As the number of Fragments increases, does the number of transcripts with non-zero FPKM also increase? This might be driving the increased FAIL rate. I noticed that with one of our samples where something had gone wrong and very few genes were expressed, the FAIL rate was extremely low compared to the other normal samples (with higher FAIL rate).
    -Danielle

    Comment


    • #3
      Hi

      I have the same problem. Should I just ignore their FAIL status and use their FPKM for the analysis anyway?

      Thanks.

      Comment


      • #4
        I would like to know what others are doing too. For now, I've been using the FPKMs regardless of Pass/Fail status, with the rationale that the non-zero FPKM, although possibly inaccurate, is more accurate than FPKM=0, which it will effectively be if those genes are ignored.

        Comment


        • #5
          No Reply and More Unsettling Results

          Originally posted by dglemay View Post
          I would like to know what others are doing too. For now, I've been using the FPKMs regardless of Pass/Fail status, with the rationale that the non-zero FPKM, although possibly inaccurate, is more accurate than FPKM=0, which it will effectively be if those genes are ignored.
          I'd agree that dropping all those "FAIL" genes or applying a value of 0 is a waste of time. I'm a bit uneasy using the values, but they seem to be fairly stable with read counts and read length. But I must admit I'm looking at other options like MISO, hand calculated FPKM, count based methods EdgeR or DESeq/DEXseq

          From my side this seems like a program error in how the flag is applied not the calculation of isoform abundance. I've done some more testing and the number of "FAIL" genes correlates not only with the number of reads aligned but also the read length.

          This seems non-biological and against all common sequencing idea's that longer and more is better so I'm guessing it is a programatic error.

          To bad no one ever replies to the [email protected] emails anymore.... I miss Cole
          Attached Files

          Comment


          • #6
            I have his issue too. Abundant genes in genes.fpkm_tracking are very often FAIL, and there is some LOWDATA too, despite RPKM > 150. This is based on ~80 million paired-end reads aligned with tophat.

            Code:
            271.616	0	219.218	FAIL
            270.067	0	860.157	FAIL
            269.393	0	1902.31	FAIL
            268.861	0	744.834	FAIL
            268.316	0	14399.4	FAIL
            265.731	97.6681	433.795	OK
            265.582	0	113.429	FAIL
            263.137	0	5027.49	FAIL
            263.087	0	1078.35	FAIL
            261.497	0	243.624	FAIL
            260.606	0	1735.99	FAIL
            252.388	0	437.81	LOWDATA
            251.266	0	421.303	FAIL
            247.974	0	110.572	FAIL
            247.812	0	123.419	FAIL
            247.316	0	213.169	FAIL
            245.179	0	469.43	FAIL
            Agree that for gene level there should be little in terms of numeric/algorithmic challenges here, so has to be a bug? Question is if RPKMs are still useful. Wish someone in the cufflinks team could comment - i'm considering abandoning the package altogether but not sure what to use instead...

            EDIT: Problem disappears when using the -g option (reference plus de novo assembly), or with de novo only. So I assume problem somehow stems from cufflinks being confused when the supplied gene models don't quite fit the data (which probably happens all the time).
            Last edited by erikjlar; 11-13-2011, 06:13 AM. Reason: New observations

            Comment


            • #7
              A fix for this is being worked on and should be released later this week. Sorry, but due to the overwhelming number of questions and reports, we are no longer able to respond to all e-mails at tophat.cufflinks. However, all are read and taking into consideration for future updates.

              -Adam

              Comment


              • #8
                Thanks Adam

                Glad to hear someone is reading the emails. It was starting to feel like a fruitless waste of my time.

                Comment


                • #9
                  A bit more on what FAIL means, and how it can happen. We use FAIL for genes that actually throw a numerical exception during isoform abundance calculation. In Cufflinks and Cuffdiff, there's a couple of calculations that require us to build matrices with either a row per transcript and a column per read (more or less) or a square matrix with a row and column for each transcript. Some of these matrices need to be invertible or positive definite or have other properties in order for the next steps in the algorithm to succeed. However, sometimes (due to things like round-off error) they aren't. Other times, missing data causes trouble. Oddly enough, this is actually more likely to happen the more reads you get overall, because you can see that isoforms are present, but you don't actually have enough data to calculate those abundances. This is the effect you were observing above. So since we can't be sure about the values (and in fact, were we to go ahead and do the calculation anyways, they could be *wildly* off in theory, or even negative), we set them to zero and move on.

                  In order to make differential expression estimates more conservative, version 1.1.0 really ramped up the checks that are done before these steps so we don't end up reporting false positives that are due to numerical exceptions. However, users (like yourselves) have been pretty frustrated by those changes, so I've spent the last few weeks going back and streamlining the overall algorithm to actually eliminate pieces that require the matrices to have some of those properties. The main offender was our "importance sampling" procedure, which tries to give us a sense (for each gene) for the accuracy for the maximum likelihood estimate of isoform abundances. This procedure was originally meant to improve the robustness of the estimate when one or more isoforms were close to zero, but in practice, we found that it actually hurts as often as it helps. Moreover, this procedure would often FAIL genes, so I removed it altogether. I've compensated on the differential expression side with some other statistical improvements and fixes, and the result is globally more accurate differential analysis (both in terms of fewer FAILs and fewer false positives than 1.1.0).

                  The upcoming version 1.2.0 should drastically reduce the number of FAIL genes, though there will still be some. If we can't calculate an MLE to begin with, or if for some reason the confidence interval calculation fails, a gene will be marked as FAIL.

                  Hope this sheds light on things.

                  Comment


                  • #10
                    Glad to hear that a fix is on the way! Will the fix address the problem of false positives as well? I'm using CuffDiff 1.1.0 and I've noticed that a lot of the genes I'm getting as differentially expressed output have FPKMs that look (for example) like this (where my 3 conditions are GG, AG, and AA):

                    GG: 0 (OK)
                    AG: 11.6888 (OK)
                    AA: 10.7249 (OK)


                    CuffDiff will report GG as being differentially expressed relative to AG (and GG as being differentially expressed relative to AA), even though GG has many reads, and the RPKM values that I get with SeqGene look like this:

                    GG: 2.31, 2.04
                    AG: 2.63, 2.59, 2.5
                    AA: 2.32, 2.15, 2.42, 2.33, 3.02

                    Thanks!
                    Last edited by cw11; 11-15-2011, 06:56 AM.

                    Comment


                    • #11
                      Hopefully. As I mentioned, I fixed a bunch of bugs that can generate false positives. As part of an upcoming paper on Cuffdiff, we've also done a ton of simulation experiments and wet experiments comparing the same RNA on multiple platforms, and it's clear from those data that the new v1.2.0 Cuffdiff is extremely accurate and concordant with other ways of doing DE. Not that the older versions were bad - we've just made things more accurate, and in general more conservative, than previous versions (and other tools for that matter).

                      Comment


                      • #12
                        Hi Cole,

                        Thanks for the detailed reply. I'd avoided bugging you directly as I'd incorrectly assumed you were no longer directly involved after completing your degree...congrats and well deserved by the way. Thanks for all the hard work.

                        Jonathan

                        Comment


                        • #13
                          Thanks for replies - great to hear straight from developers.

                          The problem is not only in the flag - the actual FPKM values are very weird sometimes. Hope this will get better...

                          Interesting example: the human MYH11 gene seems uncomplicated (see picture). Gene-level estimates should not be impossible to do using -G option (non-directional RNA-seq though)? But result is all zeros, also in isoforms.fpkm.tracking:

                          ENST00000396324.2 - - ENSG00000133392.9 - - chr16:15796991-15950887 6903 0 0 0 0 FAIL
                          ENST00000452625.1 - - ENSG00000133392.9 - - chr16:15796991-15950887 6942 0 0 0 0 FAIL
                          ENST00000396320.3 - - ENSG00000133392.9 - - chr16:15796991-15950887 6942 0 0 0 0 FAIL
                          ENST00000300036.4 - - ENSG00000133392.9 - - chr16:15796991-15950890 6885 0 0 0 0 FAIL
                          ENST00000338282.5 - - ENSG00000133392.9 - - chr16:15796991-15950890 6924 0 0 0 0 FAIL

                          Using -g it's a bit weird - a new gene name assigned but no new isoforms identified. But now there are suddenly FPKMs for two of them!

                          ENST00000396324.2 - - CUFF.13801 - - chr16:15796991-15950887 6903 0 0 0 0 OK
                          ENST00000452625.1 - - CUFF.13801 - - chr16:15796991-15950887 6942 0 0 0 0 OK
                          ENST00000396320.3 - - CUFF.13801 - - chr16:15796991-15950887 6942 0 0 0 0 OK
                          ENST00000300036.4 - - CUFF.13801 - - chr16:15796991-15950890 6885 72.7214 9.86435 9.52604 10.2027 OK
                          ENST00000338282.5 - - CUFF.13801 - - chr16:15796991-15950890 6924 86.4312 11.724 11.3666 12.0815 OK

                          I primarily just want gene-level estimates based on a reference annotation - maybe should go for some less sophisticated solution...
                          Attached Files

                          Comment


                          • #14
                            Yep - that's what happens in genes that FAIL. The FPKMs for the individual isoforms could be all over the map (one could be zero, the other might have all the expression for the gene, but if you changed the gene's geometry just a little bit, the zero one might suddenly be expressed). This isn't a bug - this is what happens when the importance sampling procedure can't be executed because there's just too many isoforms (that are too similar to one another) and too little data. That's why we mark it FAIL. FAIL means the underlying FPKMs might be nonsense.

                            Comment


                            • #15
                              Originally posted by Jon_Keats View Post
                              Hi Cole,

                              Thanks for the detailed reply. I'd avoided bugging you directly as I'd incorrectly assumed you were no longer directly involved after completing your degree...congrats and well deserved by the way. Thanks for all the hard work.

                              Jonathan
                              I'm still very much involved - I just need to limit my development time to features that directly support my work in the lab I joined. I'm a postdoc in John Rinn's lab, and we use Cufflinks and its brethren to find new lincRNAs, profile their expression, and analyze their perturbation. I'm also trying to help Steven and Lior (my PhD advisors) train some of their incoming students who've been extending the tools or developing related ones. Those activities unfortunately don't leave me with much time for answering support email and questions on forums like this, but as Adam said, we really do try to fix issues that are reported by users and add questions to the FAQ. Thanks for your patience!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X