View Single Post
Old 06-15-2012, 09:37 AM   #32
Cole Trapnell
Senior Member
Location: Boston, MA

Join Date: Nov 2008
Posts: 212

Originally Posted by drdna View Post
So, using artificially-generated control datasets, I find that cufflinks is flawed in two ways:

First, it's FPKM values are inflated. Problem is the magnitude of inflation varies from gene-to-gene - there is no consistency in the error.

Second the "locus" interval defined in the cuffdiff output is often just plain wrong. In many instances, the reported "locus" frequently spans multiple transcripts and intergenic regions, even though the dataset contains reads from only one transcript. In other words, neither the .gtf file, nor the input sequence data support expansion of the "locus" to cover multiple genes.
As far as I can tell, both of the "flaws" you have described are actually misunderstandings on your part about how Cufflinks and Cuffdiff work.

1) FPKM values are inflated beyond a direct counting of reads, but that is by design. Because most RNA-Seq libraries are size selected to exclude library fragments below a certain size (typically 120bp or so), the very end of a transcript is less likely to be covered by library fragments. This is because reverse transcriptase primes and the elongates in one direction, and near one end of the transcript, RT will actually fall off of the mRNA, producing a short fragment. These fragments are probably going to be subtracted during size selection, which makes seeing them in your data less likely than seeing longer library fragments. Normally, this isn't a big problem - most transcripts are long, and so this "edge effect" doesn't really change the number of fragments you get from the transcript too much. However, for short transcripts, the "edge", where RT is less likely to produce fragments that get included in the final library, actually comprises quite a bit of the overall transcript, and so there's a bias for short transcripts, causing them to generate less fragments in the library than if there were no size selection, etc. Cufflinks and Cuffdiff (and other tools - I think RSEM does this as well) account for this effect by "inflating" the FPKM upwards, according to the overall size distribution of library fragments. You can read about this correction (which we call "effective length correction") in the supplement to the Cufflinks paper. The example of the SnoRNA in the above post perfectly illustrates this idea - the RNA is only 52 bases long, which means, in all likelihood, that the *vast* majority of cDNA fragments coming from it have actually been excluded from the library, meaning that its true abundance in your sample is far higher than the library fragment count might lead you to believe. That said, it's my opinion that standard RNA-Seq preps (with polyA selection, size selection, etc) are not appropriate for measuring small RNA (less than the median fragment size of the library) abundance. It's just not the right assay.

That said, the upcoming version 2.0.1 includes an option --no-effective-length-correction that can be used to disable this feature, for those who want to better compare with their own raw counting or simulation frameworks.

2) The "locus" column is NOT meant to tell you anything about transcript structure. That field is there for two reasons:
a) it tells you that Cufflinks or Cuffdiff processed all the transcripts in that locus together, as part of a single isoform deconvolution.
b) it gives you an easy way to jump to that genome location in your favorite browser (via copy-paste).
So it's just a convenience column. I think we've answered this question before, and users are sometimes confused about this behavior. I'll either put an entry on the FAQ or change the behavior at some point.

As far as the transcript merging issues that others have mentioned: Cufflinks will merge transcripts, especially in particularly gene-dense organisms like some fungi, because it tries to fill in coverage gaps (by default of 50bp or less) in order to make more full length structures in low-abundance genes. It's a tradeoff - by not filling in such gaps, your overall transcriptome assembly will be more fragmented, but you'll get fewer erroneous merges. You can disable this behavior by setting --olap-radius 0 in the Cufflinks assembler. We haven't exposed this option yet for cuffmerge but we can do so in a future release.
Cole Trapnell is offline   Reply With Quote