
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
cuffdiff: # SD genes come down with more replicates  foxriverlin  RNA Sequencing  0  06212012 09:02 AM 
Cuffdiff replicates  guzhi100  RNA Sequencing  1  05062012 12:30 PM 
using Cuffdiff with biological replicates  Jane M  RNA Sequencing  0  09012011 12:42 AM 
cuffdiff with biological replicates  PFS  Bioinformatics  1  06142011 06:51 PM 
RNASeq experiments with 2 conditions and more than 3 replicates  Jouneau Luc  RNA Sequencing  1  04052011 03:00 AM 

Thread Tools 
09242012, 02:08 PM  #1 
Junior Member
Location: SF Join Date: Feb 2012
Posts: 4

cuffdiff p value for 2 conditions without replicates
Hi guys,
Can anyone help explain 1) how p value is calculated for the differential expression using cuffdiff for two conditions without replicates? 2) is this p value useful or we can just look at the fold change to select DE genes? It is easy to understand p values with replicates, but doesn't make much sense to me to use p or q values for 2 samples without replicates. Many thanks! 
09242012, 11:13 PM  #2  
Member
Location: malaysia Join Date: Mar 2009
Posts: 40

Hi there,
I'm not a statistician, so I can't answer you how the software count the pvalue. But I can suggest you reads this thread http://seqanswers.com/forums/newrepl...wreply&p=83270 ( look for mbblack comment #10). He did mentioned bout doing differential expression analysis without replicate which I agree on his opinion. Quote:


09252012, 03:01 PM  #3 
Junior Member
Location: SF Join Date: Feb 2012
Posts: 4

Thanks masterpiece. It makes sense.
I also read the FAQ on cufflinks website. It says that Cuffdiff compares the log ratio of FPKM in two conditions (i.e. log(FPKMa/FPKMb)) in the following test statistic, which is approximately normally distributed: T=E[log(FPKMa/FPKMb)] / Var[log(FPKMa/FPKMb)] ~= log(FPKMa/FPKMb) /sqrt(Var(FPKMa)/(FPKMa)^2 + Var(FPKMb)/(FPKMb)^2) So once we get the statistic we can get a corresponding p value. To get a statistic T, we need to get the Var values. If there are replicates in each condition, then the Var can be calculated from the replicates; If there are no replicates, then the Var is calculated under the assumption that the genes/transcripts are not differentially expressed. I guess here the Var is calculated as the Var between the two samples (one in each condition). But definitely Var would be big if the gene is actually differentially expressed in the two conditions and this will make an underestimation of p value under the assumption of "not differentially expressed". So I think the p values would not be precise for no replicates experiments, as masterpiece and mbblack has commented. More discussion is appreciated if you had such no replicate RNAseq data analyzed with Cuffdiff. How do you feel about the p values? 
09262012, 04:41 AM  #4 
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245

hug2001  you've pretty much hit the nail. Without replicates, you cannot estimate variance, thus you cannot adequately estimate your test statistic in the first place. Thus any pvalue derived from that test statistic is meaningless. And in conventional statistical differential gene expression, the central issue is being able to estimate the variance in each of your comparison groups so as to compute robust statistical significance levels for the observed mean differences. Without replicates you simply cannot do that.
That is why I suggest if you do not have replicates, then forget about the statistics altogether, and just focus on the observed difference in magnitude as your means of differentiating genes.
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. 
09262012, 08:38 AM  #5 
Junior Member
Location: SF Join Date: Feb 2012
Posts: 4

Agree with mbblack. Magnitude of changes is more meaningful.
We are doing PCR/WB to validate some genes showing DE. Thanks for your discussion, masterpiece and mbblack 
10162012, 02:39 AM  #6 
Member
Location: uk Join Date: Jul 2012
Posts: 56

Hi guys,
just a question to follow up the conversation: if I do not consider at all the p and/or q values, and only look at the fold change, how this can be relieable considerigtn that there a re genes turned off in oneof the two samples? I had the output from cuffdiff, filtered the reusts by significnace (thus I only took those with "yes") and then filtered by fold chnage and the top/bottoem ones were the up pr downregulated genes. Is this approach correct? If I don't dop this, then I have fold chnage values as infinite becuase some genes have fpkm=0 in one of the two conditions,.. any suggestion is much appreciated! thanks, ib 
10162012, 07:56 AM  #7 
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245

I would discount any gene where the FPKM was zero in either condition. That simply indicates a lack of data, so one should not make any inference about such genes (you cannot estimate the significance of something that was not measured in the first place).
Without replicates, the statistical significance values (pValue and FDR) are meaningless. So drop those genes where FPKM is zero, then rank order the ones that remain by fold change and select from those with the greatest apparent fold change. In performing a differential gene expression analysis, it's probably wiser to exclude genes with too little data for proper abundance estimation right at the beginning. A common cutoff used in published papers is only working with genes with a raw read count of at least > 10 (or I've also seen a cutoff using an RPKM > 0.1 published as well).
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. 
10162012, 08:57 AM  #8 
Member
Location: uk Join Date: Jul 2012
Posts: 56

thanks mbblack,
one thing i don't get...so you suggest to rank by fpkm value, exclude those where this is=0 and then rank the rest by fold change? so this means that we do not consider whether they are significant? thanks, ib 
10162012, 09:06 AM  #9 
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245

Without biological replicates, you cannot compute reliable nor robust statistical significance. To do that, you need to sample the withingroup biological variation so you can estimate variance around each mean, which is impossible without replicates. Without replicates, you cannot even compute a mean. You just have two numbers, and the difference between them, with no idea of the variation about those two numbers.
So yes, I don't see any point in even computing statistics for the situation where you have no replicates  the numbers are meaningless and thus so is the inference of significance or nonsignificance (by statistical criteria at least). Really, the only interpretable measure you have without replicates is the difference between your pairs of numbers. So, rank order those differences in observed abundance, and ignore those with insufficient data to adequately compute an estimated abundance (ie. the FPKM = 0 ones).
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. 
10172012, 02:20 AM  #10 
Member
Location: uk Join Date: Jul 2012
Posts: 56

Hi,
I was going through the cuffdiff manual where it says "If there are no replicates, the variance is measured across the conditions under the assumption that most transcripts are not differentially expressed However, if your experiment has replicates for each condition, Cuffdiff can better estimate the variance for fragment counts and thus provides more accurate differential expression calls." I would appreciate your opinion abt this.. best, ib 
10172012, 04:52 AM  #11  
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245

Quote:
So, again I will say, in my opinion and experience, in the absence of biological replicates any statistical significance measure is meaningless. Without replicates, you do not have the appropriate data to compute statistical significance (and no manipulation nor permutation of the limited nonreplicated data you do have can truly make up for that lack of appropriate replicate data). That is why I suggest ignoring any statistics altogether. Just use what you actually do have, which is difference (NOT mean difference, just difference) between your two samples. (P.S. you always need to be skeptical of the fact that a mathematical formula or computational algorithm can produce some kind of number you want. Generally speaking, such things will always produce a result because that is what they are designed to do  they churn through the input values and spit out a result. The mere fact that you can run an algorithm to completion does not mean that you should be doing so with any given set of data, or that you should be trusting or relying on the results it produces. For example, most software will run ttests on simple pairs of numbers without replicates, but the fact the software produces output does not mean the statistical test was a mathematically valid one to perform in the first place. A tTest needs to be run on differences of means, and if you do not have replicates in your data, you cannot compute means nor reliable estimates of variance, and thus a ttest is an inapporpriate test, despite the fact the algorithm ran properly and produced output values).
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. Last edited by mbblack; 10172012 at 05:02 AM. 

10172012, 05:17 AM  #12 
Member
Location: uk Join Date: Jul 2012
Posts: 56

Hi mbblack,
thanks. I think I understood your point of view. I shall look at the data by simply looking at the fold change and ignoring the fpkm values equal to zero from both samples. Thanks a lot for your time. irene 
01222013, 09:42 PM  #13 
Junior Member
Location: Perth Join Date: Feb 2012
Posts: 2

Hi Folks,
ignoring fkpm values close to zero in one condition might actually is not good for genes that are true on/off responses. I have a couple of those and these are probably the most interesting ones... 
01232013, 04:16 AM  #14  
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245

Quote:
And an FPKM=0 simply means you failed to detect that transcript, not that it was not actually expressed. Including those values puts your results into, at best, the realm of speculation, not inference based on actual data. No data is not evidence of nonexpression. So you cannot make the inference of a true on/off feature, since you literally have no data for the feature in one condition. You would need to independently confirm that the nondetected feature was actually absent because it was not expressed, as opposed to simply being so rare your RNASeq experiment failed to detect it.
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. Last edited by mbblack; 01232013 at 04:18 AM. 

01232013, 04:41 PM  #15 
Junior Member
Location: Perth Join Date: Feb 2012
Posts: 2

I think that a low read count is in many cases a pretty good indication of low expression. Simply excluding those on the basis that I cannot statistically be sure means you are missing out on a lot of differentially expressed genes. Just as an example, a well known differentially expressed gene under our condition would be ruled out by you because the statistics tells you so (because FPKM = 0 under control condition), biology tells us it is the one of the most important genes.... Guess it is also a matter of what you want to get from the data

01232013, 04:48 PM  #16 
Senior Member
Location: Pathum Thani, Thailand Join Date: Nov 2009
Posts: 190

It might be better to use normalized raw reads. That way you can see how reliable the fold difference actually is, lower read counts are more prone to noise induced errors. For example if sample A has 5 reads and sample B has 0 reads then it may not be so reliable. Whereas if sample A has 3000 reads and sample B has 0 it would be more reliable. FPKM can change the numbers making low read count small fragments look more reliable than they actually are (not sure if that has changed, it's been a while since I tried FPKM).

01232013, 04:49 PM  #17 
Senior Member
Location: US Join Date: Jan 2009
Posts: 392

I don't like FPKM/RPKM because it can mask how many reads there actually are, even inflating values when there are actually very few reads. That being said, I disagree with filtering out genes that have a 0 FPKM or read count in one condition. I have seen clear examples where one conditions will have dozens, even hundreds of reads, while the other has zero. Clearly you can observe expression in one condition and call this as differentially expressed. Where things become murky is when one condition has maybe 24 reads and the other 0. A better filter then would take into account this possibility.
Last edited by chadn737; 01242013 at 05:54 AM. 
01242013, 04:59 AM  #18  
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245

Quote:
In my own work, we have settled on excluding raw counts less than 11 (so I actually filter on count > 10), and then normalize what remains. Even then, It's simply though to plot and show that genes with raw counts between 11 and about 150 or so, have very high variance in their transcript abundance estimates, while for those with counts > about 150, the variance tightens up dramatically. We also always run 5 biological replicates for all treatments and controls. Working in toxicology and particularly with risk assessment type studies, we do not have the option of dismissing statistical significance, and in fact almost always base our DGE assessments on simultaneously filtering results for statistical significance and minimal fold change difference (although for initial exploratory analyses, we may relax those criteria  as you say, it depends on what one's goals for the data are). Just as an aside, in the limited qPCR validations series that I've run, we get very poor correspondence with RNASeq results base solely on statistical significance or solely on fold change. Correspondence (using ABI TaqMan rtPCR assays) improves dramatically when comparing genes that were both statistically significant and met minimum fold change differences (I usually filter for genes with FDR < 0.05 and FC > +/ 1.5). Nothing novel in that result, and of course, the same applies for microarray data for that matter: combining statistical significance and some minimum magnitude of relative change proves a far more robust estimator of differential gene expression than either cutoff alone. The problem with the original post that started this thread, is that you cannot compute statistical significance in the absence or replicates, so you are left with just raw differences in magnitude based on single measures of abundance.
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. 

Thread Tools  

