SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cuffdiff FPKM values from same samples but different comparisons himanshu04 Bioinformatics 2 04-22-2013 05:02 AM
Different fpkm values for cuffdiff and cuffcompare madsaan Bioinformatics 3 12-12-2012 04:14 PM
Different FPKM values of cufflinks and cuffdiff mrfox Bioinformatics 5 10-17-2012 01:10 PM
Cufflinks and cuffdiff FPKM values combiochem Bioinformatics 12 10-13-2012 11:37 PM
Cuffdiff DE significance of zero FPKM values jwalker_tgi Bioinformatics 7 10-13-2012 03:14 PM

Reply
 
Thread Tools
Old 05-08-2014, 04:59 AM   #1
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default High values of FPKM on cuffdiff

Hi all,

I run cuffdiff on my control vs condition to check if the mRNA expression of the gene I've knocked-down (from Hela cells) is lower than the mRNA expression of my control.

Gladly I saw that indeed the knock-down sample expresses lower FPKM values of that specific gene. great.

However, I notice very high values of some microRNA:
NR_106781 has a value of 1,460,000
NR_039666 has a value of 995,697
NR_039828 & NR_002574 & NR_037428 are around 15,000
and around 15 more tracks are above 4000

the command line I used:

cuffdiff -p 6 -b hg19/fasta/ -u --no-update-check -v -L control,condition -o /cuffdiff hg19refseq.gtf control/accepted_hits.bam condition/accepted_hits.bam

* I used Refseq hg19 downloaded from UCSC table browser

So these results obviously raises question:

Why those MicroRNA FPKM are so high ?
And how could FPKM values be over million considering that FPKM stands for fragments per kilo per million ?

Anyone ?
Ohad is offline   Reply With Quote
Old 05-08-2014, 06:27 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

miRNAs are going to be less than 1kb, so...
dpryan is offline   Reply With Quote
Old 05-09-2014, 02:55 AM   #3
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

Thanks

But are those values represent a good value ?
Ohad is offline   Reply With Quote
Old 05-09-2014, 03:03 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I haven't a clue what "good" would mean in this context. An expressed microRNA is going to have a high FPKM, so if that's what you mean then yes.
dpryan is offline   Reply With Quote
Old 05-09-2014, 03:09 AM   #5
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

That is what I mean

I wanted to know those values represent a true expression, as it is the first time I get such high values on RNA-seq

I was worried they might be contamination of some sort and should be removed before cufflinks

Anyway, those MicroRNA are cancer-related according to NCBI and since we used Hela cells I guess those numbers could be values of true MicroRNA expression.

Thank you Ryan
Ohad is offline   Reply With Quote
Old 05-09-2014, 03:31 AM   #6
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

You've just discovered one of the wonderful quirks of Cufflinks. D)

During RNA-Seq library preparation, the short RNA molecules (e.g. miRNAs) get filtered out. Given that the majority of the very short reads get filtered out, the Cufflinks programmers assume that any short sequences that do make it through are actually representative of a much larger population. So they decided, somewhat arbitrarily and without properly documenting their decision, to assign extremely high FPKM values to very short sequences even if the number of reads actually aligning to these very short sequences is very low.

Given that most short sequences are lost during library preparation, the best solution is to simply ignore them in the analysis. If the researcher is interested in small RNAs, he can do smallRNA-Seq which does not include any filtering step (resulting in a lot of junk, but that is another problem).

Please do not waste any time analyzing the small RNAs in RNA-Seq. I haven't seen any papers analyzing small RNAs from RNA-Seq but I'm sure they must exist. Any paper analyzing small RNAs from RNA-Seq data should be dismissed. I do know countless people who have wasted time trying to make sense of small RNA counts from Cufflinks results.

I just add the gene biotypes with BioMart to my FPKM counts so that researchers can identify the small RNAs (miRNAs, snoRNAs, ...) and know to treat the counts with extreme caution.

The htseq-count and DESeq pipeline does not have this issue. Ultimately, actually examining the alignment file in IGV or the UCSC genome browser is always the best solution for individual genes.

Here is the full justification from Cole Trapnell. I should say that he did take the time to post on seqanswers.com. I do like his software and all the work the team has put into Cufflinks, even though I may appear to be a bit frustrated with some of their opaque decision making process regarding FPKM values in this post.

"This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.

This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.

I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.

In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate."
Cole Trapnell
http://seqanswers.com/forums/archive...p/t-17404.html
blancha is offline   Reply With Quote
Old 05-09-2014, 03:56 AM   #7
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

Oh my

I think you both won the discussion

I guess everyone agrees this values are just to say - be aware that those Mirs are highly expressed, but don't assume the numbers are reliable.

Thanks a lot blancha !
Ohad is offline   Reply With Quote
Old 05-09-2014, 04:21 AM   #8
N00bSeq
Member
 
Location: Sweden

Join Date: Mar 2014
Posts: 12
Default

Is there any way of turning off this feature? I noticed the options

--no-effective-length-correction

and

--no-length-correction

in the cufflinks manual, which may or may not do this. Though I am not entirely sure if that truly is what those options are intended to do (and in that case, which of the two options I should use). I have "regular" RNA-seq data (non-microRNA-specific sequencing), but the sequencing company claimed they had not done any fragment size selection. Therefore, I suspect that it would be better to turn off this "correction".
N00bSeq is offline   Reply With Quote
Old 05-09-2014, 04:26 AM   #9
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

For what I understand --no-length-correction is just the FPM out of FPKM

If no fragmentation was done, no need to add the per Kilo-base

Last edited by Ohad; 05-09-2014 at 04:28 AM.
Ohad is offline   Reply With Quote
Old 05-09-2014, 04:59 AM   #10
N00bSeq
Member
 
Location: Sweden

Join Date: Mar 2014
Posts: 12
Default

Quote:
Originally Posted by Ohad View Post
For what I understand --no-length-correction is just the FPM out of FPKM

If no fragmentation was done, no need to add the per Kilo-base
I'm pretty sure fragmentation was done, just no size selection on those fragments.
N00bSeq is offline   Reply With Quote
Old 05-09-2014, 05:07 AM   #11
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I would double-check that there was no "size selection".
Small RNAs are filtered out during the standard RNA-Seq library preparation protocol.
I've verified this with the technician who prepares our samples.
There is just no specific step in the protocol lapelled size selection, so the "sequencing company" may not even be aware that most small RNAs were removed, and may not inform the customers on the impact on the downstream bioinformatics analysis.
I've had terrible correlation between replicates on RNA-Seq results for very short reads so I've learnt to disregard RNA-Seq results for short RNA sequences.
blancha is offline   Reply With Quote
Old 05-09-2014, 05:22 AM   #12
N00bSeq
Member
 
Location: Sweden

Join Date: Mar 2014
Posts: 12
Default

Quote:
Originally Posted by blancha View Post
I would double-check that there was no "size selection".
Small RNAs are filtered out during the standard RNA-Seq library preparation protocol.
I've verified this with the technician who prepares our samples.
There is just no specific step in the protocol lapelled size selection, so the "sequencing company" may not even be aware that most small RNAs were removed, and may not inform the customers on the impact on the downstream bioinformatics analysis.
I've had terrible correlation between replicates on RNA-Seq results for very short reads so I've learnt to disregard RNA-Seq results for short RNA sequences.
You may be right. The company in question did at first make a false claim that the library preparation method they used was not strand specific (which I have later confirmed it to be), so I would not be surprised if they are wrong about this as well. Though what makes me believe that there was indeed no size selection is that fact that my average mate inner distance (TopHat "-r" option, inferred with RSeQC (and independently by calculations on TLEN in the sam files from alignment)) is -30. So at least most insert sizes must have been really small.
N00bSeq is offline   Reply With Quote
Old 05-09-2014, 05:35 AM   #13
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

* What I meant was --no-length-correction should be used WHEN no fragmentation was done. Sorry for the confusion

NOObseq, you should add the reads lengths of both mates themselves to that -30 and view the distribution around the AVG to spot weather size selection took place, and keep in mind that Tophat may have not include in your SAM bigger fragments as it labeled them as "not proper pair"
Ohad is offline   Reply With Quote
Old 05-09-2014, 06:13 AM   #14
N00bSeq
Member
 
Location: Sweden

Join Date: Mar 2014
Posts: 12
Default

Quote:
Originally Posted by Ohad View Post
* What I meant was --no-length-correction should be used WHEN no fragmentation was done. Sorry for the confusion

NOObseq, you should add the reads lengths of both mates themselves to that -30 and view the distribution around the AVG to spot weather size selection took place, and keep in mind that Tophat may have not include in your SAM bigger fragments as it labeled them as "not proper pair"
OK, thank you for clarifying. In order to get the fragment length I would then add 2*100, yielding an average fragment size of 170. This is less than the sum of the read lengths (2*100), and thus the reads overlap. The cause for this overlap would be a too short insert size (or analogously, fragment size). My distribution of insert sizes is shown in the attached file. Compare with the "typical" distribution from the RSeQC manual http://rseqc.sourceforge.net/#inner-distance-py.

I don't know if TopHat has excluded a lot of longer fragments, I may need to look into that. But the distribution looks suspicious enough as it is. I also get lots of artifacts ("novel" isoforms) in cufflinks, which is another reason to take a closer look at these bias correction parameters.
Attached Files
File Type: pdf inner_distance.pdf (8.3 KB, 8 views)
N00bSeq is offline   Reply With Quote
Old 05-09-2014, 07:11 AM   #15
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

For my taste your distribution looks fine, and I think that an avg of 170 is fine as well.
I don't understand why novel transcripts are suspicious to you regarding fragment distribution.

I suggest that you post your questions in a new thread for the purpose of future searches people do.
Ohad is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:07 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO