Seqanswers Leaderboard Ad

**dpryan** · 05-08-2014, 06:27 AM

miRNAs are going to be less than 1kb, so...

**Ohad** · 05-09-2014, 02:55 AM

Thanks

But are those values represent a good value ?

**dpryan** · 05-09-2014, 03:03 AM

I haven't a clue what "good" would mean in this context. An expressed microRNA is going to have a high FPKM, so if that's what you mean then yes.

**Ohad** · 05-09-2014, 03:09 AM

That is what I mean

I wanted to know those values represent a true expression, as it is the first time I get such high values on RNA-seq

I was worried they might be contamination of some sort and should be removed before cufflinks

Anyway, those MicroRNA are cancer-related according to NCBI and since we used Hela cells I guess those numbers could be values of true MicroRNA expression.

Thank you Ryan

**blancha** · 05-09-2014, 03:31 AM

You've just discovered one of the wonderful quirks of Cufflinks. D)

During RNA-Seq library preparation, the short RNA molecules (e.g. miRNAs) get filtered out. Given that the majority of the very short reads get filtered out, the Cufflinks programmers assume that any short sequences that do make it through are actually representative of a much larger population. So they decided, somewhat arbitrarily and without properly documenting their decision, to assign extremely high FPKM values to very short sequences even if the number of reads actually aligning to these very short sequences is very low.

Given that most short sequences are lost during library preparation, the best solution is to simply ignore them in the analysis. If the researcher is interested in small RNAs, he can do smallRNA-Seq which does not include any filtering step (resulting in a lot of junk, but that is another problem).

Please do not waste any time analyzing the small RNAs in RNA-Seq. I haven't seen any papers analyzing small RNAs from RNA-Seq but I'm sure they must exist. Any paper analyzing small RNAs from RNA-Seq data should be dismissed. I do know countless people who have wasted time trying to make sense of small RNA counts from Cufflinks results.

I just add the gene biotypes with BioMart to my FPKM counts so that researchers can identify the small RNAs (miRNAs, snoRNAs, ...) and know to treat the counts with extreme caution.

The htseq-count and DESeq pipeline does not have this issue. Ultimately, actually examining the alignment file in IGV or the UCSC genome browser is always the best solution for individual genes.

Here is the full justification from Cole Trapnell. I should say that he did take the time to post on seqanswers.com. I do like his software and all the work the team has put into Cufflinks, even though I may appear to be a bit frustrated with some of their opaque decision making process regarding FPKM values in this post.

"This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.

This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.

I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.

In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate."
Cole Trapnell

very high RPKM values from Cufflink - SEQanswers

http://seqanswers.com/forums/archive/index.php/t-17404.html

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**Ohad** · 05-09-2014, 03:56 AM

Oh my

I think you both won the discussion

I guess everyone agrees this values are just to say - be aware that those Mirs are highly expressed, but don't assume the numbers are reliable.

Thanks a lot blancha !

**N00bSeq** · 05-09-2014, 04:21 AM

Is there any way of turning off this feature? I noticed the options

--no-effective-length-correction

and

--no-length-correction

in the cufflinks manual, which may or may not do this. Though I am not entirely sure if that truly is what those options are intended to do (and in that case, which of the two options I should use). I have "regular" RNA-seq data (non-microRNA-specific sequencing), but the sequencing company claimed they had not done any fragment size selection. Therefore, I suspect that it would be better to turn off this "correction".

**Ohad** · 05-09-2014, 04:26 AM

For what I understand --no-length-correction is just the FPM out of FPKM

If no fragmentation was done, no need to add the per Kilo-base

**N00bSeq** · 05-09-2014, 04:59 AM

Originally posted by Ohad View Post

For what I understand --no-length-correction is just the FPM out of FPKM

If no fragmentation was done, no need to add the per Kilo-base

I'm pretty sure fragmentation was done, just no size selection on those fragments.

**blancha** · 05-09-2014, 05:07 AM

I would double-check that there was no "size selection".
Small RNAs are filtered out during the standard RNA-Seq library preparation protocol.
I've verified this with the technician who prepares our samples.
There is just no specific step in the protocol lapelled size selection, so the "sequencing company" may not even be aware that most small RNAs were removed, and may not inform the customers on the impact on the downstream bioinformatics analysis.
I've had terrible correlation between replicates on RNA-Seq results for very short reads so I've learnt to disregard RNA-Seq results for short RNA sequences.

**N00bSeq** · 05-09-2014, 05:22 AM

Originally posted by blancha View Post

I would double-check that there was no "size selection".
Small RNAs are filtered out during the standard RNA-Seq library preparation protocol.
I've verified this with the technician who prepares our samples.
There is just no specific step in the protocol lapelled size selection, so the "sequencing company" may not even be aware that most small RNAs were removed, and may not inform the customers on the impact on the downstream bioinformatics analysis.
I've had terrible correlation between replicates on RNA-Seq results for very short reads so I've learnt to disregard RNA-Seq results for short RNA sequences.

You may be right. The company in question did at first make a false claim that the library preparation method they used was not strand specific (which I have later confirmed it to be), so I would not be surprised if they are wrong about this as well. Though what makes me believe that there was indeed no size selection is that fact that my average mate inner distance (TopHat "-r" option, inferred with RSeQC (and independently by calculations on TLEN in the sam files from alignment)) is -30. So at least most insert sizes must have been really small.

**Ohad** · 05-09-2014, 05:35 AM

* What I meant was --no-length-correction should be used WHEN no fragmentation was done. Sorry for the confusion

NOObseq, you should add the reads lengths of both mates themselves to that -30 and view the distribution around the AVG to spot weather size selection took place, and keep in mind that Tophat may have not include in your SAM bigger fragments as it labeled them as "not proper pair"

**N00bSeq** · 05-09-2014, 06:13 AM

Originally posted by Ohad View Post

* What I meant was --no-length-correction should be used WHEN no fragmentation was done. Sorry for the confusion

NOObseq, you should add the reads lengths of both mates themselves to that -30 and view the distribution around the AVG to spot weather size selection took place, and keep in mind that Tophat may have not include in your SAM bigger fragments as it labeled them as "not proper pair"

OK, thank you for clarifying. In order to get the fragment length I would then add 2*100, yielding an average fragment size of 170. This is less than the sum of the read lengths (2*100), and thus the reads overlap. The cause for this overlap would be a too short insert size (or analogously, fragment size). My distribution of insert sizes is shown in the attached file. Compare with the "typical" distribution from the RSeQC manual http://rseqc.sourceforge.net/#inner-distance-py.

I don't know if TopHat has excluded a lot of longer fragments, I may need to look into that. But the distribution looks suspicious enough as it is. I also get lots of artifacts ("novel" isoforms) in cufflinks, which is another reason to take a closer look at these bias correction parameters.

Attached Files

inner_distance.pdf (8.3 KB, 35 views)

**Ohad** · 05-09-2014, 07:11 AM

For my taste your distribution looks fine, and I think that an avg of 170 is fine as well.
I don't understand why novel transcripts are suspicious to you regarding fragment distribution.

I suggest that you post your questions in a new thread for the purpose of future searches people do.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

High values of FPKM on cuffdiff

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News