Seqanswers Leaderboard Ad

**Simon Anders** · 06-09-2012, 12:54 AM

Hi Justin

You stumbled over a dirty little secret in the theory of generalized linear models (GLMs).

As you probably know, a t test accounts for the uncertainty of the estimate for the standard deviation. The same t value gives higher p values if the standard deviation has been estimated from fewer samples, because then, you compare with the t distribution for fewer degrees of freedom. In the limit of many samples, in contrast, a t test is the same as a z test (where you simply compare the ratio of (log) fold change to standard deviation with a normal distribution). In an ANOVA setting, one uses the F statistic instead of the t statistic, but this is just a reformulation. Instead to a t distribution, one compares against an F distribution, which becomes a chi^2 distribution for large numbers of samples (or DoFs).

In generalized linear models, we use the deviance instead of the F statistic, and because residuals are no longer normally distributed, it is hard to say what the null distribution for the deviance is. It is easy to show, though, that asymptotically, for many samples, the deviance behaves as the F, i.e., it converges to chi^2, and so, we always compare with that.

In summary: If your data follows normal distributions, you work with ordinary least square, and there, you can easily account for the effect of uncertainty on your variance estimate thanks to Student's (Gossamer's) word. Once you do not have normality, we do not have an easy solution, and people pretend that this issue (that everybody acknowledges to be very important in the ODE case) is not that crucial and that it should be fine to ignore it.

My guess is that this is because, outside genomics, GLMs are rarely used for data with less than, say, 10 or 20 samples, anyway.

How do we get out of this? In DESeq, we argue that it is fine to assume that the dispersion estimate is much more precise than what one should expect according to the number of samples, because we look at many genes and fit a line. So, if we are right with the assumption that genes with similar means have similar dispersion, we should get correct p values.

Whenever a gene has unusually high dispersion it will get a too low p value and appear wrongly at the top of the list of hits. Hence, our maximum rule. We belief high estimates but pull up low estimates to the fitted line, so that we are, over all, overly conservative. This is why DESeq p values tend to show the right-skew that you observe.

With sharingMode="gene-est-only", you get the textbook approach, which, as explained above, fails for small sample number, and gives you a left-skewed p value histogram even in the null case.

With sharingMode="fit-only", the p-value histogram for the null case often tends to look nice and flat, but this turned out to be deceptive: The high-dispersion outliers will often pollute the list of highly significant genes.

Hence, with the "maximum" mode, we err on the conservative side. This is not ideal, and there should be a way to get back the power that we lose this way by finding a compromise between these approaches. EdgeR, by the way, claims to have solved this issue with its weighted maximum likelihood method, but at least in my simulations, it still gives to low p values. This is what we do not consider the issue solved and are still working on it. I think, we already have some good ideas.

**Justin AC Powell** · 06-11-2012, 02:14 AM

Hi Simon,

Thanks for that answer - it's very useful and clears up a couple of things I've been wondering about!

One further question related to this:- Will the FDR calculation still be roughly accurate with such a p-value distribution (assuming genuine change is detected)?

By the way I think the support you guys provide on this forum is great, it really helps with getting to grips with some of the trickier areas of this type of analysis.

**Simon Anders** · 06-11-2012, 03:04 AM

Originally posted by Justin AC Powell View Post

One further question related to this:- Will the FDR calculation still be roughly accurate with such a p-value distribution (assuming genuine change is detected)?

Well, as the p values are a bit conservative, the FDR calculation will be so, too, but as long as you err on the conservative side, you are always safe.

Simon

Topics	Statistics	Last Post
Decoding Neurodegeneration with Advanced RNA Sequencing by seqadmin Started by seqadmin, 12-30-2024, 01:35 PM	0 responses 21 views 0 likes	Last Post by seqadmin 12-30-2024, 01:35 PM
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 41 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 55 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 40 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM

Seqanswers Leaderboard Ad

Announcement

DESeq - sharing mode question, and an odd p-value distribution

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News