Unconfigured Ad

**drdna** · 06-13-2012, 08:01 PM

Just to give you an example of how idiotic cufflinks is: For one my RNAseq runs, it reported one gene with an FPKM of 2,085,893,530. Interesting how you can get 2 BILLION fragments per million mapped reads!

**Cole Trapnell** · 06-15-2012, 09:37 AM

Originally posted by drdna View Post

So, using artificially-generated control datasets, I find that cufflinks is flawed in two ways:

First, it's FPKM values are inflated. Problem is the magnitude of inflation varies from gene-to-gene - there is no consistency in the error.

Second the "locus" interval defined in the cuffdiff output is often just plain wrong. In many instances, the reported "locus" frequently spans multiple transcripts and intergenic regions, even though the dataset contains reads from only one transcript. In other words, neither the .gtf file, nor the input sequence data support expansion of the "locus" to cover multiple genes.

As far as I can tell, both of the "flaws" you have described are actually misunderstandings on your part about how Cufflinks and Cuffdiff work.

1) FPKM values are inflated beyond a direct counting of reads, but that is by design. Because most RNA-Seq libraries are size selected to exclude library fragments below a certain size (typically 120bp or so), the very end of a transcript is less likely to be covered by library fragments. This is because reverse transcriptase primes and the elongates in one direction, and near one end of the transcript, RT will actually fall off of the mRNA, producing a short fragment. These fragments are probably going to be subtracted during size selection, which makes seeing them in your data less likely than seeing longer library fragments. Normally, this isn't a big problem - most transcripts are long, and so this "edge effect" doesn't really change the number of fragments you get from the transcript too much. However, for short transcripts, the "edge", where RT is less likely to produce fragments that get included in the final library, actually comprises quite a bit of the overall transcript, and so there's a bias for short transcripts, causing them to generate less fragments in the library than if there were no size selection, etc. Cufflinks and Cuffdiff (and other tools - I think RSEM does this as well) account for this effect by "inflating" the FPKM upwards, according to the overall size distribution of library fragments. You can read about this correction (which we call "effective length correction") in the supplement to the Cufflinks paper. The example of the SnoRNA in the above post perfectly illustrates this idea - the RNA is only 52 bases long, which means, in all likelihood, that the *vast* majority of cDNA fragments coming from it have actually been excluded from the library, meaning that its true abundance in your sample is far higher than the library fragment count might lead you to believe. That said, it's my opinion that standard RNA-Seq preps (with polyA selection, size selection, etc) are not appropriate for measuring small RNA (less than the median fragment size of the library) abundance. It's just not the right assay.

That said, the upcoming version 2.0.1 includes an option --no-effective-length-correction that can be used to disable this feature, for those who want to better compare with their own raw counting or simulation frameworks.

2) The "locus" column is NOT meant to tell you anything about transcript structure. That field is there for two reasons:
a) it tells you that Cufflinks or Cuffdiff processed all the transcripts in that locus together, as part of a single isoform deconvolution.
b) it gives you an easy way to jump to that genome location in your favorite browser (via copy-paste).
So it's just a convenience column. I think we've answered this question before, and users are sometimes confused about this behavior. I'll either put an entry on the FAQ or change the behavior at some point.

As far as the transcript merging issues that others have mentioned: Cufflinks will merge transcripts, especially in particularly gene-dense organisms like some fungi, because it tries to fill in coverage gaps (by default of 50bp or less) in order to make more full length structures in low-abundance genes. It's a tradeoff - by not filling in such gaps, your overall transcriptome assembly will be more fragmented, but you'll get fewer erroneous merges. You can disable this behavior by setting --olap-radius 0 in the Cufflinks assembler. We haven't exposed this option yet for cuffmerge but we can do so in a future release.

**drdna** · 06-26-2012, 12:15 PM

Cole,
Thanks very much for your comprehensive explanation of how FPKM calculations and locus intervals are defined - it does a lot to demystify the inner workings of the programs. However, most of the transcripts I have looked at are considerably longer than the average fragment length and, so, should not be overly prone to the edge effects that you described. Moreover, the control dataset that I used for one analysis was generated by sampling fragments evenly across the transcript and, hence, had no bias in read distribution. In general, with the exception of very short RNAs which, obviously, will be excluded from size-selected libraries, there should be no selection-associated bias against transcript "edges." The only bias should come from partially degraded RNA preps, or incomplete reverse transcription. I was wondering if you have examined read distribution along a set of known transcripts to quantify the supposed "edge" effect? In addition, I draw you attention to the fact that the manner of shearing has a major effect on sequence read distribution. Specifically, mechanical shearing results in a massive overrepresentation of end fragments (Schwartz and Farman, 2010; BMC Genomics 11:87). For reverse transcripts, this would produce a superabundance of 3' end fragments.

In your explanation of transcript loci, you stated that cufflinks processes all the transcripts in a "locus" together, as part of a single isoform deconvolution. However the loci that I am referring to span several different genes. Do I understand, then, that cufflinks processes multiple genes "as part of a single isoform deconvolution"?

A more concerning issue, however, is that cufflinks and its associated programs appear to be processing "loci" that have no reads mapping within them and assigning FPKM values to those loci.

For example, I have a gene_exp.diff output line that reads:

XLOC_000005 XLOC_000005 - chr1:7945468-7967929 EG_LVNT EG_LV1 OK 420.778 415.167 -0.0193665 0.0561998 0.955183 0.9758 no

This suggests that there are 420+ fragments per kb per million mapped reads, yet when I look at the chr1:7945468-7967929 region in the bam file, there are no fragments mapping in this region at all. Furthermore, there are only 8 reads mapping within 20 kb of the boundaries of this region.

**cufflinksman** · 06-26-2012, 09:58 PM

cufflinks

To Portah. Excellent discover, I'm going to go examine the cds_exp.diff data declare my operates and see if it should you choose. Regardless, there is still a problem with transcript/reference annotation consolidating.

**jy123** · 06-27-2012, 08:42 AM

Please explain how cufflinks judge a read

Hi Cole,

Can you explain how a read be counted as a read belonging to an exon? How much overlap percentage used in cufflinks?

A read may map to multiple exons? in this case, what are the criteria to assign a read to one exon and not others?

Thank you.

**Portah** · 06-28-2012, 11:13 AM

Cole,
Thanks for your explanation, after your post I decided read not only articles but also "supplementary methods for the paper" "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation." I'm not a great statistician so not everything for me is clear enough so everything that I'm going to tell is just my view which can be wrong. If some one can approve or reject that I'll tell will be nice.

I don't know should I ask this question in this thread or open the new one by the way I'll put all my speculations in one post.

About FPKMs they are not just reads in position, there complicated model that approximate number of fragments which are also approximated by number of reads and this reads should have "mate pair" and model came from that not all fragments have equal lengths supposedly lengths of fragments distributed normally. My simple view of FPKMs.

About cuffdiff I found 2 different ways of how software should work so I'll use 3'd way function
bool test_diffexp(const FPKMContext& curr,const FPKMContext& prev, SampleDifference& test) from differential.cpp which I think do calculation of p value.

All steps in that function mostly explained on cufflinks web page and some of transformation of equation become clear after supplementary paper. What I can't find and explain for myself how FPKMContext.FPKM_variance is calculated. If it possible can you explain on example with no replicates and one isoform? For example total length of transript T is L, total number of reads is X inside all exons of T. Or if you can point to the function that do this calculation inside your codes. I'm interested in calculation under condition with no replicates.

I found this explanation on web site for . If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed. But for me it is really not enough.

Inside differential.cpp I found next string:
if (isnan(t1) || isinf(t1) || isnan(t2) || isnan(t2))
I highlight by bold position where I think is small mistake there probably should be isnan(t2) || isinf(t2) ?

Thanks,
Andrey

**Portah** · 07-13-2012, 07:35 AM

A couple days ago new cufflinks released 2.0.2 but there still persist that problem in differential.cpp file in function:

// This performs a between-group test on an isoform or TSS grouping, on two
// different samples.
bool test_diffexp(const FPKMContext& curr,
const FPKMContext& prev,
SampleDifference& test)

there still persist an error at the followed condition:

if (isnan(t1) || isinf(t1) || isnan(t2) || isnan(t2))
{

//fprintf(stderr, "Warning: test statistic is NaN! %s (samples %lu and %lu)\n", test.locus_desc.c_str(), test.sample_1, test.sample_2);
p_value = 1.0;
}

I highlighted place which for me has no sense because condition (A or A) logically is equal to A, so (isnan(t2) || isnan(t2)) equal to just isnan(t2) so perhaps, there should be other function, maybe isinf(t2).

**colinmolter** · 07-13-2012, 09:56 AM

Hi, I do have a problem that may be related:
I processed 8 breast cancer samples, 4ER+ and 4ER- using tophat2/cufflinks2/cuffmerge2/cuffdiff2/cummeRbund.

The expression of the gene GREB1 is nearly equal to 0 for the 4 ER-
and ~100 for ER+:

cuff <- readCufflinks()
expr=repFpkmMatrix(genes(cuff))
expr[rownames(expr)=='XLOC_011466',]
Neg_0 Neg_1 Neg_2 Neg_3 Pos_0 Pos_1 Pos_2 Pos_3
XLOC_011466 0.182529 0.791468 0.721075 0.15861 35.9609 231.247 64.1414 39.4893

However, the result of cuffdiff is:
diffData(getGene(cuff, 'GREB1'))
gene_id sample_1 sample_2 status value_1 value_2
log2_fold_change test_stat p_value q_value significant
1 XLOC_011466 Neg Pos OK 0.460534 92.726
7.65352 -1.1267 0.25987 0.999954 no

How can you explain this. It seems not consistent for me. I would have
expected a nice q value, a significant test, etc.

thanks for feedbacks,
colin

**Portah** · 07-13-2012, 11:18 AM

Colin,
No that error not connected with this example, that error simply some data should be excluded from testing.

Your example just shows that FDR calculation is correct, try to exclude second experiment from pipeline.

**colinmolter** · 07-13-2012, 11:27 AM

Hi,
thank for your reply, unfortunately i don't understand what you mean by :"try to exclude second experiment from pipeline".
If I look to expression, it seems that there is clear differential expression:

expr[rownames(expr)=='XLOC_011466',]
Neg_0 Neg_1 Neg_2 Neg_3 Pos_0 Pos_1 Pos_2 Pos_3
XLOC_011466 0.182529 0.791468 0.721075 0.15861 35.9609 231.247 64.1414 39.4893

do u suggest that I should remove the sample Pos_1 which has expression of 231?

**Portah** · 07-13-2012, 11:31 AM

Yes, I think that 231 is almost 4 times more than 64 and a much bigger then 35, so software decide that your data are not consistent in replicates that is why FDR is big.

**colinmolter** · 07-13-2012, 11:33 AM

the problem is that that sample corresponds to cell line MCF7 and expression looks ok. I would have preferred to find a way to keep it.

**Portah** · 07-13-2012, 11:46 AM

Sorry, perhaps I misunderstand you, I though that you run cuffdiff like you have 4 replicates for ER+ and ER-, but you have run it with different samples, I didn't work with more than two samples, maybe somebody other knows the answer.

**tyomach** · 07-13-2012, 12:32 PM

The numbers are not significantly different even in t-test : p>0.14, thus it's reasonable that cuffdiff does not consider the gene to be differentially expressed. You can consider removing the outlier if you want.

**drdna** · 07-25-2012, 08:22 PM

I've come to the conclusion that the tophat, cufflinks, cuffdiff suite absolutely cannot be trusted to provide reliable results. Just recently I've been mapping paired end RNAseq reads to a bacterial genome using the -r and --mate-std-dev options. In the "bowtie.[left/right]_kept_reads.fixmap.log" files it states that the number of unpaired reads was 100%. However, the accepted_hits.bam file contains millions of pairs of lines that represent paired end data. Can we really trust that the programs aren't treating these reads as independent, if they don't even know they are dealing with paired ends? I'm not prepared to take the risk and will have to resort to throwing away the paired-end data and using reads coming from just one end. Either that or switch to CLC Bio...

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News