SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
limma on log tranformed RPKM values biofreak General 1 08-30-2012 11:58 PM
generating P values from a RPKM score Kieran Mace Bioinformatics 0 09-12-2011 10:34 AM
How can one get raw read counts from RPKM values gen2prot General 6 06-24-2011 11:08 AM
How to find DE genes using RPKM values? casshyr Bioinformatics 2 10-08-2010 07:03 AM
tophat, not producing rpkm values with -G option warrenemmett Bioinformatics 6 06-02-2010 07:31 AM

Reply
 
Thread Tools
Old 02-03-2012, 10:01 PM   #1
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default very high RPKM values from Cufflink

I ran an RNA-seq experiment and used TopHat> cufflink It is a time series experiment when I looked at RPKM values in some of the transcripts the RPKM values goes upto 32765.04908; 2073.978485 . Is is reasonable? I also looked at Bam files there weresecrtainly very large number of reads. Any feedback/ suggestion please how to explain this high RPKM values?
honey is offline   Reply With Quote
Old 02-04-2012, 06:47 AM   #2
peromhc
Senior Member
 
Location: Durham, NH

Join Date: Sep 2009
Posts: 108
Default

Quote:
Originally Posted by honey View Post
I ran an RNA-seq experiment and used TopHat> cufflink It is a time series experiment when I looked at RPKM values in some of the transcripts the RPKM values goes upto 32765.04908; 2073.978485 . Is is reasonable? I also looked at Bam files there weresecrtainly very large number of reads. Any feedback/ suggestion please how to explain this high RPKM values?
I think that these values should be taken to the log(10).. this is not documented, but my suspicion.

log(10) values from cufflinks roughly equals FPKM values from cuffdiff..
peromhc is offline   Reply With Quote
Old 02-04-2012, 07:39 PM   #3
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default very high RPKM values from 4.5 to sevreal thousands

Howver the problem which I have is that the RPKM are high apx. 5% are > 1000RPKM (like 3245, 4356 and so on) in the same sample. If I change to log 10 than then what will happen to values around zero. Is it a usual method to log transform RPKM value?
Any feedback is welcome
Thanks
honey is offline   Reply With Quote
Old 02-04-2012, 07:44 PM   #4
peromhc
Senior Member
 
Location: Durham, NH

Join Date: Sep 2009
Posts: 108
Default

Honey, I could be totally wrong here about the log(10) thing, but I don't think I am..

Can you look at the mappings for some of those transcripts where 'raw' FPMK is about 0-- do they have few reads mapped?

See:
http://seqanswers.com/forums/showthread.php?t=16962
peromhc is offline   Reply With Quote
Old 02-04-2012, 07:57 PM   #5
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default High RPKM

Thanks I looked at the Bam files and can say that there are very few reads wherver it is 0 values of RPKM however where the values are very high those are the kind of hot spots there are large no of reads. Now the question is is this an artifact -High RPKM or very low RPKM how we rope up both extreme values?
honey is offline   Reply With Quote
Old 02-06-2012, 03:52 AM   #6
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

honey, it might be a good idea to look a bit more in depth into that specific gene. You can certainly get high FPKMs mapping to genes like actin that make up a lot of the mRNA percentage of a cell. I had huge numbers of reads mapping to one region of a miRNA gene once that all turned out to be within a LINE and a SINE. For that gene at least, it was clear the repeat regions skewed the results.
pbluescript is offline   Reply With Quote
Old 02-07-2012, 03:40 AM   #7
yumtaoist
Member
 
Location: lanzhou

Join Date: Dec 2011
Posts: 10
Default

Is your reference very short?
yumtaoist is offline   Reply With Quote
Old 02-07-2012, 10:02 AM   #8
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default very large RPKM

It is human genome so it is not small.
The egnes which have very high RPKM values are relavnt to biology of the tissue samples, but my problem is how to provide a scientific rational that our results are not nonspecific.
Thanks for the input
honey is offline   Reply With Quote
Old 02-07-2012, 10:34 AM   #9
Nicolas
Member
 
Location: new york city

Join Date: Apr 2009
Posts: 40
Default

Quote:
Originally Posted by peromhc View Post
I think that these values should be taken to the log(10).. this is not documented, but my suspicion.

log(10) values from cufflinks roughly equals FPKM values from cuffdiff..
That does not make sense to me. Unless it is an option in either Cufflinks or Cuffdiff, but I have never saw a log relationship between Cufflinks and Cuffdiff outputs.

Honey, how did you run Cufflinks? RABT mode or simple "quantification" mode? How long are the genes with super-high RPKM?

It seems to me that Cufflinks has a tendency to report super-high RPKM for very short transcripts (such as microRNA). I now routinely filter out the transcripts shorter than the expected fragment size (from the GTF annotation file). I think there is a good rationale to filter them out, because they can not be accurately captured by the RNA-Seq protocol....

In RABT mode, Cufflinks also reports a large number of short transcripts with crazy high values. A solution could be to re-quantify the discovered transcripts with something like BEDtools or HTSeq-count...
Nicolas is offline   Reply With Quote
Old 02-07-2012, 01:06 PM   #10
Xiaobin
Junior Member
 
Location: Baltimore

Join Date: Jun 2011
Posts: 2
Default

Are those genes very short?
As cufflinks will remove the fragment length from gene length in calculating FPKM, sometimes it will give this kind of results.
Xiaobin is offline   Reply With Quote
Old 02-07-2012, 01:27 PM   #11
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default very high RPKM

Here are three examples with genomic coordinates

CGA - 87795222 to 87804824
KISS1 204159469- 204165619
TFP12- 93515745- 93520065

Should I then go back to count method?

Thanks for all your help.
honey is offline   Reply With Quote
Old 02-07-2012, 01:29 PM   #12
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default High RPKM

Quote:
Originally Posted by Nicolas View Post
That does not make sense to me. Unless it is an option in either Cufflinks or Cuffdiff, but I have never saw a log relationship between Cufflinks and Cuffdiff outputs.

Honey, how did you run Cufflinks? RABT mode or simple "quantification" mode? How long are the genes with super-high RPKM?


It seems to me that Cufflinks has a tendency to report super-high RPKM for very short transcripts (such as microRNA). I now routinely filter out the transcripts shorter than the expected fragment size (from the GTF annotation file). I think there is a good rationale to filter them out, because they can not be accurately captured by the RNA-Seq protocol....

In RABT mode, Cufflinks also reports a large number of short transcripts with crazy high values. A solution could be to re-quantify the discovered transcripts with something like BEDtools or HTSeq-count...
I used simple quantification

So you mean probably count method is better?
honey is offline   Reply With Quote
Old 02-07-2012, 01:41 PM   #13
Xiaobin
Junior Member
 
Location: Baltimore

Join Date: Jun 2011
Posts: 2
Default

These genes don't seem to be that short. There must be other reasons.
I suggest you try count method first. Cufflinks is just too complex to be understood.
Xiaobin is offline   Reply With Quote
Old 02-28-2012, 03:45 PM   #14
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.

This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.

I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.

In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate.
Cole Trapnell is offline   Reply With Quote
Old 02-29-2012, 07:47 AM   #15
epi
Member
 
Location: USA

Join Date: Jan 2012
Posts: 38
Default

Hi Cole, Thanks for your post. I keep reading your comments here which are useful for many including me. I asked a similar question, with a twist, here: http://seqanswers.com/forums/showthread.php?t=17992

Can you comment please. In short, it is about how to deal with larger(>300 bp) transcripts with high FPKMs.




Quote:
Originally Posted by Cole Trapnell View Post
This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.

This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.

I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.

In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate.
epi is offline   Reply With Quote
Old 03-01-2012, 01:52 AM   #16
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default short genes

hello cole and epi,

i have made some comparisons of FPKM values (calculated from count data, not with cufflinks) with corresponding microarray data regarding gene length, and found some interesting details.

it is true that FPKM values from genes with length below 500 bp correlate much less with expression values derived from microarrays, but the differences between tissues (i.e. normal vs. cancer) from small genes correlated even better (NGS vs microarray) than the differences from larger genes (interestingly, the correlations go continuos down. see figure).

and in most study designs, the differences from two conditions are important, not the absolute expression values. therefore, i would not exclude small genes from statistical analysis!

in green and blue are the correlations of lg2_FPKM values from each 12 normal and 12 cancer tissues with corresponding lg2_microarray_expression values from 26 normal and 26 cancer tissues. in red the correlations of the differences (lg2_FPKM_NORMAL - lg2_FPKM_CANCER vs. lg2_microarray_NORMAL - lg2_microarray_CANCER) are shown. on the x-axis genes are grouped according gene length (and the number of genes in each bin are shown), e.g. 190 genes are below 500 bp length.
dietmar13 is offline   Reply With Quote
Old 03-01-2012, 05:00 AM   #17
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

@dietmar

What method did you use to calculate expression on the microarrays and what kind of microarrays were they?
arvid is offline   Reply With Quote
Old 03-01-2012, 06:35 AM   #18
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default @arvid

publicly available: GSE25070

http://www.ncbi.nlm.nih.gov/projects...i?acc=GSE25070
dietmar13 is offline   Reply With Quote
Old 03-02-2012, 03:10 AM   #19
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by dietmar13 View Post
Thanks! Is the corresponding RNA-seq data available too by any chance?
steven is offline   Reply With Quote
Old 03-02-2012, 04:48 AM   #20
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default steven, you are lucky,

both data are not from us, i am only playing around...

12 normal vs 12 colon cancer, paired:
sra:
SRP007584
dietmar13 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO