SEQanswers cuffdiff p value for 2 conditions without replicates
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post foxriverlin RNA Sequencing 0 06-21-2012 09:02 AM guzhi100 RNA Sequencing 1 05-06-2012 12:30 PM Jane M RNA Sequencing 0 09-01-2011 12:42 AM PFS Bioinformatics 1 06-14-2011 06:51 PM Jouneau Luc RNA Sequencing 1 04-05-2011 03:00 AM

 09-24-2012, 02:08 PM #1 hug2001 Junior Member   Location: SF Join Date: Feb 2012 Posts: 4 cuffdiff p value for 2 conditions without replicates Hi guys, Can anyone help explain 1) how p value is calculated for the differential expression using cuffdiff for two conditions without replicates? 2) is this p value useful or we can just look at the fold change to select DE genes? It is easy to understand p values with replicates, but doesn't make much sense to me to use p or q values for 2 samples without replicates. Many thanks!
09-24-2012, 11:13 PM   #2
masterpiece
Member

Location: malaysia

Join Date: Mar 2009
Posts: 40

Hi there,

I'm not a statistician, so I can't answer you how the software count the p-value. But I can suggest you reads this thread http://seqanswers.com/forums/newrepl...wreply&p=83270 ( look for mbblack comment #10). He did mentioned bout doing differential expression analysis without replicate which I agree on his opinion.

Quote:
 Originally Posted by mbblack If you actually have no replicates, then it really is pointless to even bother computing the statistics. In that worst case scenario, you'd do best by simply ranking genes by normalized expression or raw counts, and pick those with the greatest difference in observed values (and then validate them independently). So you have to interpret your results in light of your experimental limitations, as well as what your goal from the analysis was, and adjust things as the situation calls for. The stats are just tools to guide you and add some rigor to your analysis.

 09-25-2012, 03:01 PM #3 hug2001 Junior Member   Location: SF Join Date: Feb 2012 Posts: 4 Thanks masterpiece. It makes sense. I also read the FAQ on cufflinks website. It says that Cuffdiff compares the log ratio of FPKM in two conditions (i.e. log(FPKMa/FPKMb)) in the following test statistic, which is approximately normally distributed: T=E[log(FPKMa/FPKMb)] / Var[log(FPKMa/FPKMb)] ~= log(FPKMa/FPKMb) /sqrt(Var(FPKMa)/(FPKMa)^2 + Var(FPKMb)/(FPKMb)^2) So once we get the statistic we can get a corresponding p value. To get a statistic T, we need to get the Var values. If there are replicates in each condition, then the Var can be calculated from the replicates; If there are no replicates, then the Var is calculated under the assumption that the genes/transcripts are not differentially expressed. I guess here the Var is calculated as the Var between the two samples (one in each condition). But definitely Var would be big if the gene is actually differentially expressed in the two conditions and this will make an underestimation of p value under the assumption of "not differentially expressed". So I think the p values would not be precise for no replicates experiments, as masterpiece and mbblack has commented. More discussion is appreciated if you had such no replicate RNAseq data analyzed with Cuffdiff. How do you feel about the p values?
 09-26-2012, 04:41 AM #4 mbblack Senior Member   Location: Research Triangle Park, NC Join Date: Aug 2009 Posts: 245 hug2001 - you've pretty much hit the nail. Without replicates, you cannot estimate variance, thus you cannot adequately estimate your test statistic in the first place. Thus any p-value derived from that test statistic is meaningless. And in conventional statistical differential gene expression, the central issue is being able to estimate the variance in each of your comparison groups so as to compute robust statistical significance levels for the observed mean differences. Without replicates you simply cannot do that. That is why I suggest if you do not have replicates, then forget about the statistics altogether, and just focus on the observed difference in magnitude as your means of differentiating genes. __________________ Michael Black, Ph.D. ScitoVation LLC. RTP, N.C.
 09-26-2012, 08:38 AM #5 hug2001 Junior Member   Location: SF Join Date: Feb 2012 Posts: 4 Agree with mbblack. Magnitude of changes is more meaningful. We are doing PCR/WB to validate some genes showing DE. Thanks for your discussion, masterpiece and mbblack
 10-16-2012, 02:39 AM #6 IBseq Member   Location: uk Join Date: Jul 2012 Posts: 56 Hi guys, just a question to follow up the conversation: if I do not consider at all the p and/or q values, and only look at the fold change, how this can be relieable considerigtn that there a re genes turned off in oneof the two samples? I had the output from cuffdiff, filtered the reusts by significnace (thus I only took those with "yes") and then filtered by fold chnage and the top/bottoem ones were the up pr downregulated genes. Is this approach correct? If I don't dop this, then I have fold chnage values as infinite becuase some genes have fpkm=0 in one of the two conditions,.. any suggestion is much appreciated! thanks, ib
 10-16-2012, 07:56 AM #7 mbblack Senior Member   Location: Research Triangle Park, NC Join Date: Aug 2009 Posts: 245 I would discount any gene where the FPKM was zero in either condition. That simply indicates a lack of data, so one should not make any inference about such genes (you cannot estimate the significance of something that was not measured in the first place). Without replicates, the statistical significance values (pValue and FDR) are meaningless. So drop those genes where FPKM is zero, then rank order the ones that remain by fold change and select from those with the greatest apparent fold change. In performing a differential gene expression analysis, it's probably wiser to exclude genes with too little data for proper abundance estimation right at the beginning. A common cutoff used in published papers is only working with genes with a raw read count of at least > 10 (or I've also seen a cutoff using an RPKM > 0.1 published as well). __________________ Michael Black, Ph.D. ScitoVation LLC. RTP, N.C.
 10-16-2012, 08:57 AM #8 IBseq Member   Location: uk Join Date: Jul 2012 Posts: 56 thanks mbblack, one thing i don't get...so you suggest to rank by fpkm value, exclude those where this is=0 and then rank the rest by fold change? so this means that we do not consider whether they are significant? thanks, ib
 10-16-2012, 09:06 AM #9 mbblack Senior Member   Location: Research Triangle Park, NC Join Date: Aug 2009 Posts: 245 Without biological replicates, you cannot compute reliable nor robust statistical significance. To do that, you need to sample the within-group biological variation so you can estimate variance around each mean, which is impossible without replicates. Without replicates, you cannot even compute a mean. You just have two numbers, and the difference between them, with no idea of the variation about those two numbers. So yes, I don't see any point in even computing statistics for the situation where you have no replicates - the numbers are meaningless and thus so is the inference of significance or non-significance (by statistical criteria at least). Really, the only interpretable measure you have without replicates is the difference between your pairs of numbers. So, rank order those differences in observed abundance, and ignore those with insufficient data to adequately compute an estimated abundance (ie. the FPKM = 0 ones). __________________ Michael Black, Ph.D. ScitoVation LLC. RTP, N.C.
 10-17-2012, 02:20 AM #10 IBseq Member   Location: uk Join Date: Jul 2012 Posts: 56 Hi, I was going through the cuffdiff manual where it says "If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed However, if your experiment has replicates for each condition, Cuffdiff can better estimate the variance for fragment counts and thus provides more accurate differential expression calls." I would appreciate your opinion abt this.. best, ib
10-17-2012, 04:52 AM   #11
mbblack
Senior Member

Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245

Quote:
 Originally Posted by IBseq Hi, I was going through the cuffdiff manual where it says "If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed However, if your experiment has replicates for each condition, Cuffdiff can better estimate the variance for fragment counts and thus provides more accurate differential expression calls." I would appreciate your opinion abt this.. best, ib
In the absence of replicates, CuffDiff uses a heuristic approach to basically make the most of the data (or lack of it). In my opinion, it is in no way a reliable replacement for actually having biological replicates for a differential gene expression analysis. The few statisticians I work with also seem of the opinion that there is no real algorithmic substitution for proper biological sampling of variance.

So, again I will say, in my opinion and experience, in the absence of biological replicates any statistical significance measure is meaningless. Without replicates, you do not have the appropriate data to compute statistical significance (and no manipulation nor permutation of the limited non-replicated data you do have can truly make up for that lack of appropriate replicate data).

That is why I suggest ignoring any statistics altogether. Just use what you actually do have, which is difference (NOT mean difference, just difference) between your two samples.

(P.S. you always need to be skeptical of the fact that a mathematical formula or computational algorithm can produce some kind of number you want. Generally speaking, such things will always produce a result because that is what they are designed to do - they churn through the input values and spit out a result. The mere fact that you can run an algorithm to completion does not mean that you should be doing so with any given set of data, or that you should be trusting or relying on the results it produces. For example, most software will run t-tests on simple pairs of numbers without replicates, but the fact the software produces output does not mean the statistical test was a mathematically valid one to perform in the first place. A t-Test needs to be run on differences of means, and if you do not have replicates in your data, you cannot compute means nor reliable estimates of variance, and thus a t-test is an inapporpriate test, despite the fact the algorithm ran properly and produced output values).
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 10-17-2012 at 05:02 AM.

 10-17-2012, 05:17 AM #12 IBseq Member   Location: uk Join Date: Jul 2012 Posts: 56 Hi mbblack, thanks. I think I understood your point of view. I shall look at the data by simply looking at the fold change and ignoring the fpkm values equal to zero from both samples. Thanks a lot for your time. irene
 01-22-2013, 09:42 PM #13 bvb1909 Junior Member   Location: Perth Join Date: Feb 2012 Posts: 2 Hi Folks, ignoring fkpm values close to zero in one condition might actually is not good for genes that are true on/off responses. I have a couple of those and these are probably the most interesting ones...
01-23-2013, 04:16 AM   #14
mbblack
Senior Member

Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245

Quote:
 Originally Posted by bvb1909 Hi Folks, ignoring fkpm values close to zero in one condition might actually is not good for genes that are true on/off responses. I have a couple of those and these are probably the most interesting ones...
Except the problem is you really do not know if those FPKM=0 values are real or not. At low counts (and by low, that may mean anything with a feature count of less than at least a 100 or more), the variance of the estimated transcript abundance (even with many biological replicates) is very large, making your estimate of low count features far less reliable than for high(er) count features.

And an FPKM=0 simply means you failed to detect that transcript, not that it was not actually expressed. Including those values puts your results into, at best, the realm of speculation, not inference based on actual data. No data is not evidence of non-expression. So you cannot make the inference of a true on/off feature, since you literally have no data for the feature in one condition. You would need to independently confirm that the non-detected feature was actually absent because it was not expressed, as opposed to simply being so rare your RNA-Seq experiment failed to detect it.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 01-23-2013 at 04:18 AM.

 01-23-2013, 04:41 PM #15 bvb1909 Junior Member   Location: Perth Join Date: Feb 2012 Posts: 2 I think that a low read count is in many cases a pretty good indication of low expression. Simply excluding those on the basis that I cannot statistically be sure means you are missing out on a lot of differentially expressed genes. Just as an example, a well known differentially expressed gene under our condition would be ruled out by you because the statistics tells you so (because FPKM = 0 under control condition), biology tells us it is the one of the most important genes.... Guess it is also a matter of what you want to get from the data
 01-23-2013, 04:48 PM #16 Jeremy Senior Member   Location: Pathum Thani, Thailand Join Date: Nov 2009 Posts: 190 It might be better to use normalized raw reads. That way you can see how reliable the fold difference actually is, lower read counts are more prone to noise induced errors. For example if sample A has 5 reads and sample B has 0 reads then it may not be so reliable. Whereas if sample A has 3000 reads and sample B has 0 it would be more reliable. FPKM can change the numbers making low read count small fragments look more reliable than they actually are (not sure if that has changed, it's been a while since I tried FPKM).
 01-23-2013, 04:49 PM #17 chadn737 Senior Member   Location: US Join Date: Jan 2009 Posts: 392 I don't like FPKM/RPKM because it can mask how many reads there actually are, even inflating values when there are actually very few reads. That being said, I disagree with filtering out genes that have a 0 FPKM or read count in one condition. I have seen clear examples where one conditions will have dozens, even hundreds of reads, while the other has zero. Clearly you can observe expression in one condition and call this as differentially expressed. Where things become murky is when one condition has maybe 2-4 reads and the other 0. A better filter then would take into account this possibility. Last edited by chadn737; 01-24-2013 at 05:54 AM.
01-24-2013, 04:59 AM   #18
mbblack
Senior Member

Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245

Quote:
 Originally Posted by bvb1909 I think that a low read count is in many cases a pretty good indication of low expression. Simply excluding those on the basis that I cannot statistically be sure means you are missing out on a lot of differentially expressed genes. Just as an example, a well known differentially expressed gene under our condition would be ruled out by you because the statistics tells you so (because FPKM = 0 under control condition), biology tells us it is the one of the most important genes.... Guess it is also a matter of what you want to get from the data
The problem with low count data is that it results in a very high false positive error rate for differentially expressed genes. I suppose it then depends on how tolerant your study is to false positives and what your objective is with the data. However, I will make a final comment that in the published RNA-Seq DGE papers of the last 2 or 3 years, there seems to be a clear and growing consensus that low read count data should be excluded from DGE analyses to avoid the bias of much higher false positive errors especially in low expressors. Published papers seem to vary when using FPKM/RPKM normalization, with cutoffs varying from 0.1 to 0.5 (one paper I seem to recall even using higher, but I remember the read depth was quite low in that work as well). However, I too am becoming a non-fan of RPKM or similar methods, as they can be very misleading for some genes.

In my own work, we have settled on excluding raw counts less than 11 (so I actually filter on count > 10), and then normalize what remains. Even then, It's simply though to plot and show that genes with raw counts between 11 and about 150 or so, have very high variance in their transcript abundance estimates, while for those with counts > about 150, the variance tightens up dramatically. We also always run 5 biological replicates for all treatments and controls.

Working in toxicology and particularly with risk assessment type studies, we do not have the option of dismissing statistical significance, and in fact almost always base our DGE assessments on simultaneously filtering results for statistical significance and minimal fold change difference (although for initial exploratory analyses, we may relax those criteria - as you say, it depends on what one's goals for the data are).

Just as an aside, in the limited qPCR validations series that I've run, we get very poor correspondence with RNA-Seq results base solely on statistical significance or solely on fold change. Correspondence (using ABI TaqMan rtPCR assays) improves dramatically when comparing genes that were both statistically significant and met minimum fold change differences (I usually filter for genes with FDR < 0.05 and FC > +/- 1.5). Nothing novel in that result, and of course, the same applies for microarray data for that matter: combining statistical significance and some minimum magnitude of relative change proves a far more robust estimator of differential gene expression than either cutoff alone. The problem with the original post that started this thread, is that you cannot compute statistical significance in the absence or replicates, so you are left with just raw differences in magnitude based on single measures of abundance.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.