Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • drdna
    Member
    • May 2012
    • 76

    #31
    Just to give you an example of how idiotic cufflinks is: For one my RNAseq runs, it reported one gene with an FPKM of 2,085,893,530. Interesting how you can get 2 BILLION fragments per million mapped reads!

    Comment

    • Cole Trapnell
      Senior Member
      • Nov 2008
      • 213

      #32
      Originally posted by drdna View Post
      So, using artificially-generated control datasets, I find that cufflinks is flawed in two ways:

      First, it's FPKM values are inflated. Problem is the magnitude of inflation varies from gene-to-gene - there is no consistency in the error.

      Second the "locus" interval defined in the cuffdiff output is often just plain wrong. In many instances, the reported "locus" frequently spans multiple transcripts and intergenic regions, even though the dataset contains reads from only one transcript. In other words, neither the .gtf file, nor the input sequence data support expansion of the "locus" to cover multiple genes.
      As far as I can tell, both of the "flaws" you have described are actually misunderstandings on your part about how Cufflinks and Cuffdiff work.

      1) FPKM values are inflated beyond a direct counting of reads, but that is by design. Because most RNA-Seq libraries are size selected to exclude library fragments below a certain size (typically 120bp or so), the very end of a transcript is less likely to be covered by library fragments. This is because reverse transcriptase primes and the elongates in one direction, and near one end of the transcript, RT will actually fall off of the mRNA, producing a short fragment. These fragments are probably going to be subtracted during size selection, which makes seeing them in your data less likely than seeing longer library fragments. Normally, this isn't a big problem - most transcripts are long, and so this "edge effect" doesn't really change the number of fragments you get from the transcript too much. However, for short transcripts, the "edge", where RT is less likely to produce fragments that get included in the final library, actually comprises quite a bit of the overall transcript, and so there's a bias for short transcripts, causing them to generate less fragments in the library than if there were no size selection, etc. Cufflinks and Cuffdiff (and other tools - I think RSEM does this as well) account for this effect by "inflating" the FPKM upwards, according to the overall size distribution of library fragments. You can read about this correction (which we call "effective length correction") in the supplement to the Cufflinks paper. The example of the SnoRNA in the above post perfectly illustrates this idea - the RNA is only 52 bases long, which means, in all likelihood, that the *vast* majority of cDNA fragments coming from it have actually been excluded from the library, meaning that its true abundance in your sample is far higher than the library fragment count might lead you to believe. That said, it's my opinion that standard RNA-Seq preps (with polyA selection, size selection, etc) are not appropriate for measuring small RNA (less than the median fragment size of the library) abundance. It's just not the right assay.

      That said, the upcoming version 2.0.1 includes an option --no-effective-length-correction that can be used to disable this feature, for those who want to better compare with their own raw counting or simulation frameworks.

      2) The "locus" column is NOT meant to tell you anything about transcript structure. That field is there for two reasons:
      a) it tells you that Cufflinks or Cuffdiff processed all the transcripts in that locus together, as part of a single isoform deconvolution.
      b) it gives you an easy way to jump to that genome location in your favorite browser (via copy-paste).
      So it's just a convenience column. I think we've answered this question before, and users are sometimes confused about this behavior. I'll either put an entry on the FAQ or change the behavior at some point.

      As far as the transcript merging issues that others have mentioned: Cufflinks will merge transcripts, especially in particularly gene-dense organisms like some fungi, because it tries to fill in coverage gaps (by default of 50bp or less) in order to make more full length structures in low-abundance genes. It's a tradeoff - by not filling in such gaps, your overall transcriptome assembly will be more fragmented, but you'll get fewer erroneous merges. You can disable this behavior by setting --olap-radius 0 in the Cufflinks assembler. We haven't exposed this option yet for cuffmerge but we can do so in a future release.

      Comment

      • drdna
        Member
        • May 2012
        • 76

        #33
        Cole,
        Thanks very much for your comprehensive explanation of how FPKM calculations and locus intervals are defined - it does a lot to demystify the inner workings of the programs. However, most of the transcripts I have looked at are considerably longer than the average fragment length and, so, should not be overly prone to the edge effects that you described. Moreover, the control dataset that I used for one analysis was generated by sampling fragments evenly across the transcript and, hence, had no bias in read distribution. In general, with the exception of very short RNAs which, obviously, will be excluded from size-selected libraries, there should be no selection-associated bias against transcript "edges." The only bias should come from partially degraded RNA preps, or incomplete reverse transcription. I was wondering if you have examined read distribution along a set of known transcripts to quantify the supposed "edge" effect? In addition, I draw you attention to the fact that the manner of shearing has a major effect on sequence read distribution. Specifically, mechanical shearing results in a massive overrepresentation of end fragments (Schwartz and Farman, 2010; BMC Genomics 11:87). For reverse transcripts, this would produce a superabundance of 3' end fragments.

        In your explanation of transcript loci, you stated that cufflinks processes all the transcripts in a "locus" together, as part of a single isoform deconvolution. However the loci that I am referring to span several different genes. Do I understand, then, that cufflinks processes multiple genes "as part of a single isoform deconvolution"?

        A more concerning issue, however, is that cufflinks and its associated programs appear to be processing "loci" that have no reads mapping within them and assigning FPKM values to those loci.

        For example, I have a gene_exp.diff output line that reads:

        XLOC_000005 XLOC_000005 - chr1:7945468-7967929 EG_LVNT EG_LV1 OK 420.778 415.167 -0.0193665 0.0561998 0.955183 0.9758 no

        This suggests that there are 420+ fragments per kb per million mapped reads, yet when I look at the chr1:7945468-7967929 region in the bam file, there are no fragments mapping in this region at all. Furthermore, there are only 8 reads mapping within 20 kb of the boundaries of this region.
        Last edited by drdna; 06-26-2012, 12:20 PM.

        Comment

        • cufflinksman
          Junior Member
          • Jun 2012
          • 2

          #34
          cufflinks

          To Portah. Excellent discover, I'm going to go examine the cds_exp.diff data declare my operates and see if it should you choose. Regardless, there is still a problem with transcript/reference annotation consolidating.
          cufflinks | New York cufflinks

          Comment

          • jy123
            Junior Member
            • Nov 2010
            • 8

            #35
            Please explain how cufflinks judge a read

            Hi Cole,

            Can you explain how a read be counted as a read belonging to an exon? How much overlap percentage used in cufflinks?

            A read may map to multiple exons? in this case, what are the criteria to assign a read to one exon and not others?

            Thank you.

            Comment

            • Portah
              Member
              • Jan 2012
              • 14

              #36
              Cole,
              Thanks for your explanation, after your post I decided read not only articles but also "supplementary methods for the paper" "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation." I'm not a great statistician so not everything for me is clear enough so everything that I'm going to tell is just my view which can be wrong. If some one can approve or reject that I'll tell will be nice.

              I don't know should I ask this question in this thread or open the new one by the way I'll put all my speculations in one post.

              About FPKMs they are not just reads in position, there complicated model that approximate number of fragments which are also approximated by number of reads and this reads should have "mate pair" and model came from that not all fragments have equal lengths supposedly lengths of fragments distributed normally. My simple view of FPKMs.

              About cuffdiff I found 2 different ways of how software should work so I'll use 3'd way function
              bool test_diffexp(const FPKMContext& curr,const FPKMContext& prev, SampleDifference& test) from differential.cpp which I think do calculation of p value.

              All steps in that function mostly explained on cufflinks web page and some of transformation of equation become clear after supplementary paper. What I can't find and explain for myself how FPKMContext.FPKM_variance is calculated. If it possible can you explain on example with no replicates and one isoform? For example total length of transript T is L, total number of reads is X inside all exons of T. Or if you can point to the function that do this calculation inside your codes. I'm interested in calculation under condition with no replicates.

              I found this explanation on web site for . If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed. But for me it is really not enough.

              Inside differential.cpp I found next string:
              if (isnan(t1) || isinf(t1) || isnan(t2) || isnan(t2))
              I highlight by bold position where I think is small mistake there probably should be isnan(t2) || isinf(t2) ?

              Thanks,
              Andrey
              Last edited by Portah; 06-28-2012, 11:17 AM.

              Comment

              • Portah
                Member
                • Jan 2012
                • 14

                #37
                A couple days ago new cufflinks released 2.0.2 but there still persist that problem in differential.cpp file in function:

                // This performs a between-group test on an isoform or TSS grouping, on two
                // different samples.
                bool test_diffexp(const FPKMContext& curr,
                const FPKMContext& prev,
                SampleDifference& test)

                there still persist an error at the followed condition:

                if (isnan(t1) || isinf(t1) || isnan(t2) || isnan(t2))
                {

                //fprintf(stderr, "Warning: test statistic is NaN! %s (samples %lu and %lu)\n", test.locus_desc.c_str(), test.sample_1, test.sample_2);
                p_value = 1.0;
                }

                I highlighted place which for me has no sense because condition (A or A) logically is equal to A, so (isnan(t2) || isnan(t2)) equal to just isnan(t2) so perhaps, there should be other function, maybe isinf(t2).

                Comment

                • colinmolter
                  Member
                  • Jul 2011
                  • 11

                  #38
                  Hi, I do have a problem that may be related:
                  I processed 8 breast cancer samples, 4ER+ and 4ER- using tophat2/cufflinks2/cuffmerge2/cuffdiff2/cummeRbund.

                  The expression of the gene GREB1 is nearly equal to 0 for the 4 ER-
                  and ~100 for ER+:

                  cuff <- readCufflinks()
                  expr=repFpkmMatrix(genes(cuff))
                  expr[rownames(expr)=='XLOC_011466',]
                  Neg_0 Neg_1 Neg_2 Neg_3 Pos_0 Pos_1 Pos_2 Pos_3
                  XLOC_011466 0.182529 0.791468 0.721075 0.15861 35.9609 231.247 64.1414 39.4893

                  However, the result of cuffdiff is:
                  diffData(getGene(cuff, 'GREB1'))
                  gene_id sample_1 sample_2 status value_1 value_2
                  log2_fold_change test_stat p_value q_value significant
                  1 XLOC_011466 Neg Pos OK 0.460534 92.726
                  7.65352 -1.1267 0.25987 0.999954 no

                  How can you explain this. It seems not consistent for me. I would have
                  expected a nice q value, a significant test, etc.

                  thanks for feedbacks,
                  colin

                  Comment

                  • Portah
                    Member
                    • Jan 2012
                    • 14

                    #39
                    Colin,
                    No that error not connected with this example, that error simply some data should be excluded from testing.

                    Your example just shows that FDR calculation is correct, try to exclude second experiment from pipeline.
                    Last edited by Portah; 07-13-2012, 11:32 AM.

                    Comment

                    • colinmolter
                      Member
                      • Jul 2011
                      • 11

                      #40
                      Hi,
                      thank for your reply, unfortunately i don't understand what you mean by :"try to exclude second experiment from pipeline".
                      If I look to expression, it seems that there is clear differential expression:

                      expr[rownames(expr)=='XLOC_011466',]
                      Neg_0 Neg_1 Neg_2 Neg_3 Pos_0 Pos_1 Pos_2 Pos_3
                      XLOC_011466 0.182529 0.791468 0.721075 0.15861 35.9609 231.247 64.1414 39.4893

                      do u suggest that I should remove the sample Pos_1 which has expression of 231?

                      Comment

                      • Portah
                        Member
                        • Jan 2012
                        • 14

                        #41
                        Yes, I think that 231 is almost 4 times more than 64 and a much bigger then 35, so software decide that your data are not consistent in replicates that is why FDR is big.

                        Comment

                        • colinmolter
                          Member
                          • Jul 2011
                          • 11

                          #42
                          the problem is that that sample corresponds to cell line MCF7 and expression looks ok. I would have preferred to find a way to keep it.

                          Comment

                          • Portah
                            Member
                            • Jan 2012
                            • 14

                            #43
                            Sorry, perhaps I misunderstand you, I though that you run cuffdiff like you have 4 replicates for ER+ and ER-, but you have run it with different samples, I didn't work with more than two samples, maybe somebody other knows the answer.

                            Comment

                            • tyomach
                              Junior Member
                              • Nov 2011
                              • 4

                              #44
                              The numbers are not significantly different even in t-test : p>0.14, thus it's reasonable that cuffdiff does not consider the gene to be differentially expressed. You can consider removing the outlier if you want.

                              Comment

                              • drdna
                                Member
                                • May 2012
                                • 76

                                #45
                                I've come to the conclusion that the tophat, cufflinks, cuffdiff suite absolutely cannot be trusted to provide reliable results. Just recently I've been mapping paired end RNAseq reads to a bacterial genome using the -r and --mate-std-dev options. In the "bowtie.[left/right]_kept_reads.fixmap.log" files it states that the number of unpaired reads was 100%. However, the accepted_hits.bam file contains millions of pairs of lines that represent paired end data. Can we really trust that the programs aren't treating these reads as independent, if they don't even know they are dealing with paired ends? I'm not prepared to take the risk and will have to resort to throwing away the paired-end data and using reads coming from just one end. Either that or switch to CLC Bio...

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                  by SEQadmin2


                                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                  Here are nine questions we think about, in roughly the order they matter, before...
                                  06-18-2026, 07:11 AM
                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-17-2026, 06:09 AM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-09-2026, 11:58 AM
                                0 responses
                                38 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-05-2026, 10:09 AM
                                0 responses
                                45 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-04-2026, 08:59 AM
                                0 responses
                                49 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...