SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
the gene fpkm and isoform fpkm are zero with Cufflink ??? fulxie RNA Sequencing 5 05-10-2012 12:41 AM
cut fasta spike1985 General 1 02-14-2012 08:16 AM
How to choose aligners? hajime Bioinformatics 9 12-21-2011 09:47 PM
FPKM/RPKM cut-off question lewewoo RNA Sequencing 1 05-05-2011 11:54 PM
SOAP2, BWA... which one to choose dingxiaofan1 Bioinformatics 2 10-21-2010 12:00 AM

Reply
 
Thread Tools
Old 05-16-2013, 12:50 AM   #1
wupengpro
Junior Member
 
Location: China

Join Date: Jun 2012
Posts: 5
Default How to choose a FPKM cut-off

Dear all,
I met one problem about how to choose an FPKM cut-off when using cufflinks. Could anyone provides some detailed suggestion about how to choose appropriate FPKM cut-off to judge whether the corresponding gene is expressed or not? Now my RNA-seq data(90PE, generated by Illumina Hiseq2000) had about 40 million reads per sample.
Thx.
wupengpro is offline   Reply With Quote
Old 05-16-2013, 01:10 AM   #2
wupengpro
Junior Member
 
Location: China

Join Date: Jun 2012
Posts: 5
Default

Maybe my understanding about this question is not true, but I really want to get specifically expressed gene between distinct samples.
Thanks for any suggestions.
wupengpro is offline   Reply With Quote
Old 05-16-2013, 08:22 AM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 868
Default

My lab includes Ambion ERCC spike-ins in our RNA samples. We use those to judge how far down expression is still quantitative.
swbarnes2 is offline   Reply With Quote
Old 05-16-2013, 08:25 AM   #4
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 253
Default

That's really not an answerable question. Lowly expressed genes could be important, especially for things like transcription factors. So, its pretty hard to come up with a number that matches with biological significance.

Anyway, what is it you're trying to get from this RNAseq data? If you're looking for candidates to investigate further, I would suggest coming up with some sort of ranking based on adjusted p-value, fold change, FPKM (for cutoff or rankings I often just use the highest FPKM between all the samples), GO classes, any biological knowledge currently available, or anything else you can think off.
Wallysb01 is offline   Reply With Quote
Old 05-16-2013, 09:57 AM   #5
colaneri
Member
 
Location: Durham

Join Date: Jul 2012
Posts: 20
Default

I agree that lowly expressed genes can be important, however STILL TO ME IS IMPORTANT TO UNDERSTAND IF A FPKM < 0.0001 HAVE ANY SENSE. Example: you have two situation (A and B) with 3 replicate each and for a low expressing gene you get the following FPKM:
A1: 0.00001 A2: 0.000012 A3: 0.000015 and B1: 0.001 B2: 0.0012 B3: 0.0015

An statistical comparison (blind to biological meaning) will find that biological replicates are consistent. Then that gene will be accepted as differentially regulated (rejection of null hypothesis).
Then the fold change will be big (100 in this example).
However although the difference is real in the library not necessary will represent a different in gene expression. It is now that the composition of very abundant genes can affect the entire composition of the library, and poor expressed genes will be the most affected. Then the differences in this example could be the result of change of expression of very abundant genes.
I still would like to know how to select for a meaningful cut off.
colaneri is offline   Reply With Quote
Old 05-16-2013, 11:14 AM   #6
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 253
Default

If you're worried about changes in the most highly expressed genes being a major driving influence you can do a top quartile normalization in cufflinks. That is the whole reason the developers created that option.

Otherwise, I don't know that anyone will be able to tell you what you want to hear. To me, I wouldn't worry about a differentially expressed gene under an FPKM of ~1, unless I had good reason to care about it. So, it all comes back to why you want to set this cut off in the first place. Are you coming up with candidates to screen for something? Is it just to create DE gene heatmaps and do GO/KEGG analysis? The biological reaon for the cutoff is important, So, I could help more if you answered my earlier question, what is it you're trying to get from this RNAseq data?
Wallysb01 is offline   Reply With Quote
Old 05-16-2013, 11:26 AM   #7
colaneri
Member
 
Location: Durham

Join Date: Jul 2012
Posts: 20
Default

I'm comparing the responses of two genotypes to a hormone. I extracted total RNA from whole seedlings. I suspect that the mutant genotype has an alterated response in the translational swithch, then I can not get ride of high expressed genes (difference in ribosomal genes can be important to me). In addition the mutant gene must be important in the meristem (which is a small population of cell from the whole organism). Then you expect that genes tissue specific and differentially expressed will be lowly represented in the library but still important to me.
colaneri is offline   Reply With Quote
Old 05-16-2013, 11:39 AM   #8
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 253
Default

First, I don't think the upper quartile normalization removes the highly expressed genes, it just doesn't use the reads that map to them for the "per million reads" part of the FPKM. It is different from giving cufflinks a GTF file of genes to completely exclude.

And for the broader biological question, if you're specifically interested in the meristem, why not do RNAseq specifically on that tissue (do they call it tissue in plants?). Or you could at least come up with candidates of lowly expressed, but differentially expressed, genes to do qPCR on the meristem. Otherwise, how are you going to know that the DE call for this gene is actually from changes in the meristem? Once you do this a couple times, especially if you have some positive controls you already expect to be specific to the meristem and differentially expressed in the meristem, you might come up with a FPKM range to expect from similar genes? But of course you'd still have to validate this meristem connection.

Sorry, I just don't see a very clear way to get to where you want to be from your current data.
Wallysb01 is offline   Reply With Quote
Old 05-16-2013, 11:51 AM   #9
colaneri
Member
 
Location: Durham

Join Date: Jul 2012
Posts: 20
Default

Wally THANK FOR YOUR TIME AND ANSWERS!
I had the wrong idea about quartile normalization! I got the idea that highly expressed genes were complete excluded from the analysis.
YES you can call meristematic tissue in plants. I give up with the idea to take samples from meristem for technical reasons, your picked up my intention, "identify differentialy expressed putative meristematic candidates"
But going back to the bioinformatic problem.
you said that you would not care about genes with FPKM less that one, ok may be I will take that as an start point.. but why FPKM < 1?
colaneri is offline   Reply With Quote
Old 05-16-2013, 12:18 PM   #10
sdriscoll
I like code
 
Location: La Mesa, CA, USA

Join Date: Sep 2009
Posts: 338
Default

I kind of think about this in two ways with the researchers I work with.

first, presumably, the FPKM value tells you something about how rare or frequent a gene is within your sample. So presumably that means it could be true, depending on your sample, that a gene is expressed low enough to where it's not really making an impact. That kind of assumes that there are expressed genes floating around serving absolutely no purpose which is probably not the case. So although a gene has a low expression it could still be important especially since the nature of experimentation is to learn about things previously unknown.

second there's a technical aspect here but you need access to the read count data. read counts below a certain level are pretty unreliable. for example you might find very different counts for genes with less than 20 hits solely depending on which aligner you use and which method you use to count those hits. as the count levels go up they tend to become more stable. So clearly this is directly tied to your total read depth in the experiment.

A quick example of the technical aspect:
assume a 1,000 bp transcript. experiment 1 is 5,000,000 total reads and this transcript received 5 hits. This calculates out to an FPKM of 1.0. But that FPKM is based on only 5 hits which is entirely unreliable. experiment 2 has 100,000,000 total reads and this transcript has 100 hits. This also calculates out to an FPKM of 1.0 however this FPKM is much more reliable as it's based on 100 hits which is a more stable count level. the variance due to aligner error and count methods might only vary that count value by 5% whereas the count of 5 could vary by 80% or more.

I believe that cuffdiff/cufflinks attempts to inform us of these things kind of red-flags with their LOWDATA flag but I haven't really tried to confirm that.

so there IS a way to apply a cutoff of FPKMs by applying the technical limitations of your experiment to the count data. There isn't, however, a biologically relevant cutoff that I know of...nor, I'd think, would you want one.
sdriscoll is offline   Reply With Quote
Old 05-16-2013, 05:19 PM   #11
wupengpro
Junior Member
 
Location: China

Join Date: Jun 2012
Posts: 5
Default

Quote:
Originally Posted by Wallysb01 View Post
If you're worried about changes in the most highly expressed genes being a major driving influence you can do a top quartile normalization in cufflinks. That is the whole reason the developers created that option.

Otherwise, I don't know that anyone will be able to tell you what you want to hear. To me, I wouldn't worry about a differentially expressed gene under an FPKM of ~1, unless I had good reason to care about it. So, it all comes back to why you want to set this cut off in the first place. Are you coming up with candidates to screen for something? Is it just to create DE gene heatmaps and do GO/KEGG analysis? The biological reaon for the cutoff is important, So, I could help more if you answered my earlier question, what is it you're trying to get from this RNAseq data?
Thank you for kind answers. Now my data include 3 distinct cells. I attempt to obtain cell-specific genes(only expressed in one cell).
wupengpro is offline   Reply With Quote
Reply

Tags
cufflinks, cutoff, expressed, fpkm

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:32 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.