Unconfigured Ad

**wupengpro** · 05-16-2013, 01:10 AM

Maybe my understanding about this question is not true, but I really want to get specifically expressed gene between distinct samples.
Thanks for any suggestions.

**swbarnes2** · 05-16-2013, 08:22 AM

My lab includes Ambion ERCC spike-ins in our RNA samples. We use those to judge how far down expression is still quantitative.

**Wallysb01** · 05-16-2013, 08:25 AM

That's really not an answerable question. Lowly expressed genes could be important, especially for things like transcription factors. So, its pretty hard to come up with a number that matches with biological significance.

Anyway, what is it you're trying to get from this RNAseq data? If you're looking for candidates to investigate further, I would suggest coming up with some sort of ranking based on adjusted p-value, fold change, FPKM (for cutoff or rankings I often just use the highest FPKM between all the samples), GO classes, any biological knowledge currently available, or anything else you can think off.

**colaneri** · 05-16-2013, 09:57 AM

I agree that lowly expressed genes can be important, however STILL TO ME IS IMPORTANT TO UNDERSTAND IF A FPKM < 0.0001 HAVE ANY SENSE. Example: you have two situation (A and B) with 3 replicate each and for a low expressing gene you get the following FPKM:
A1: 0.00001 A2: 0.000012 A3: 0.000015 and B1: 0.001 B2: 0.0012 B3: 0.0015

An statistical comparison (blind to biological meaning) will find that biological replicates are consistent. Then that gene will be accepted as differentially regulated (rejection of null hypothesis).
Then the fold change will be big (100 in this example).
However although the difference is real in the library not necessary will represent a different in gene expression. It is now that the composition of very abundant genes can affect the entire composition of the library, and poor expressed genes will be the most affected. Then the differences in this example could be the result of change of expression of very abundant genes.
I still would like to know how to select for a meaningful cut off.

**Wallysb01** · 05-16-2013, 11:14 AM

If you're worried about changes in the most highly expressed genes being a major driving influence you can do a top quartile normalization in cufflinks. That is the whole reason the developers created that option.

Otherwise, I don't know that anyone will be able to tell you what you want to hear. To me, I wouldn't worry about a differentially expressed gene under an FPKM of ~1, unless I had good reason to care about it. So, it all comes back to why you want to set this cut off in the first place. Are you coming up with candidates to screen for something? Is it just to create DE gene heatmaps and do GO/KEGG analysis? The biological reaon for the cutoff is important, So, I could help more if you answered my earlier question, what is it you're trying to get from this RNAseq data?

**colaneri** · 05-16-2013, 11:26 AM

I'm comparing the responses of two genotypes to a hormone. I extracted total RNA from whole seedlings. I suspect that the mutant genotype has an alterated response in the translational swithch, then I can not get ride of high expressed genes (difference in ribosomal genes can be important to me). In addition the mutant gene must be important in the meristem (which is a small population of cell from the whole organism). Then you expect that genes tissue specific and differentially expressed will be lowly represented in the library but still important to me.

**Wallysb01** · 05-16-2013, 11:39 AM

First, I don't think the upper quartile normalization removes the highly expressed genes, it just doesn't use the reads that map to them for the "per million reads" part of the FPKM. It is different from giving cufflinks a GTF file of genes to completely exclude.

And for the broader biological question, if you're specifically interested in the meristem, why not do RNAseq specifically on that tissue (do they call it tissue in plants?). Or you could at least come up with candidates of lowly expressed, but differentially expressed, genes to do qPCR on the meristem. Otherwise, how are you going to know that the DE call for this gene is actually from changes in the meristem? Once you do this a couple times, especially if you have some positive controls you already expect to be specific to the meristem and differentially expressed in the meristem, you might come up with a FPKM range to expect from similar genes? But of course you'd still have to validate this meristem connection.

Sorry, I just don't see a very clear way to get to where you want to be from your current data.

**colaneri** · 05-16-2013, 11:51 AM

Wally THANK FOR YOUR TIME AND ANSWERS!
I had the wrong idea about quartile normalization! I got the idea that highly expressed genes were complete excluded from the analysis.
YES you can call meristematic tissue in plants. I give up with the idea to take samples from meristem for technical reasons, your picked up my intention, "identify differentialy expressed putative meristematic candidates"
But going back to the bioinformatic problem.
you said that you would not care about genes with FPKM less that one, ok may be I will take that as an start point.. but why FPKM < 1?

**sdriscoll** · 05-16-2013, 12:18 PM

I kind of think about this in two ways with the researchers I work with.

first, presumably, the FPKM value tells you something about how rare or frequent a gene is within your sample. So presumably that means it could be true, depending on your sample, that a gene is expressed low enough to where it's not really making an impact. That kind of assumes that there are expressed genes floating around serving absolutely no purpose which is probably not the case. So although a gene has a low expression it could still be important especially since the nature of experimentation is to learn about things previously unknown.

second there's a technical aspect here but you need access to the read count data. read counts below a certain level are pretty unreliable. for example you might find very different counts for genes with less than 20 hits solely depending on which aligner you use and which method you use to count those hits. as the count levels go up they tend to become more stable. So clearly this is directly tied to your total read depth in the experiment.

A quick example of the technical aspect:
assume a 1,000 bp transcript. experiment 1 is 5,000,000 total reads and this transcript received 5 hits. This calculates out to an FPKM of 1.0. But that FPKM is based on only 5 hits which is entirely unreliable. experiment 2 has 100,000,000 total reads and this transcript has 100 hits. This also calculates out to an FPKM of 1.0 however this FPKM is much more reliable as it's based on 100 hits which is a more stable count level. the variance due to aligner error and count methods might only vary that count value by 5% whereas the count of 5 could vary by 80% or more.

I believe that cuffdiff/cufflinks attempts to inform us of these things kind of red-flags with their LOWDATA flag but I haven't really tried to confirm that.

so there IS a way to apply a cutoff of FPKMs by applying the technical limitations of your experiment to the count data. There isn't, however, a biologically relevant cutoff that I know of...nor, I'd think, would you want one.

**wupengpro** · 05-16-2013, 05:19 PM

Originally posted by Wallysb01 View Post

If you're worried about changes in the most highly expressed genes being a major driving influence you can do a top quartile normalization in cufflinks. That is the whole reason the developers created that option.

Otherwise, I don't know that anyone will be able to tell you what you want to hear. To me, I wouldn't worry about a differentially expressed gene under an FPKM of ~1, unless I had good reason to care about it. So, it all comes back to why you want to set this cut off in the first place. Are you coming up with candidates to screen for something? Is it just to create DE gene heatmaps and do GO/KEGG analysis? The biological reaon for the cutoff is important, So, I could help more if you answered my earlier question, what is it you're trying to get from this RNAseq data?

Thank you for kind answers. Now my data include 3 distinct cells. I attempt to obtain cell-specific genes(only expressed in one cell).

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 35 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

How to choose a FPKM cut-off

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News