Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq - sharing mode question, and an odd p-value distribution

    I have a couple of questions about DESeq (and my particular analysis)

    1) I have a data set with three controls and three treated samples. I've seen it said that the sharingMode="gene-est-only" would not be appropriate for a data set with this few replicates (even with method="pooled") because under-estimation of the variance causes false positives. My experience supports this, with "gene-est-only" I get many more very low adjusted p-values than I'd expect, even if I permute the data set between control and treated which should remove genuinely signficant results. This is not thought to be a problem with array based experiments of this size when just doing a t-test, and indeed if I take my RNA-seq data, convert to RPKM and then do a straight t-test (using R t.test) I DON'T see this - the p-value distribution is flat. My question is: why is this - are negative binomial based methods like DESeq and EdgeR more susceptible to false positives caused by small sample sizes than a t.test? Or in other words, how come for a 3 control, 3 treated data set it seems *essential* to use the variance pooling approaches with EdgeR/DESeq when it isn't with a t.test. Do I have something figured out wrong?

    2) What would explain the upward slope in the following p-value distribution for the above experiment using DESeq with standard parameters (sharingMode=maximum, method="pooled")? My guess is that it results from the actual variance often being lower than what DESeq estimates, meaning that the results are closer to the null distribution than would be expected by random chance - but how does that happen? One possibility I am investigating is that this data set is actually a result of immunoprecipitation from tagged ribosomes and I think the variance may partly be a function of the enrichment level for each gene. In that case one can't rely on genes of similar expression levels having similar sized errors (as they may have very different levels of enrichment), which may be a problem when sharing variance information. However I can't switch to gene-est-only because of there are not enough replicates. Currently this is only a guess so I'd be interested in any insight as to the causes of such a p-value distribution...

    Click image for larger version

Name:	pvals.JPG
Views:	1
Size:	28.4 KB
ID:	307843

    Best regards,

    Justin

  • #2
    Hi Justin

    You stumbled over a dirty little secret in the theory of generalized linear models (GLMs).

    As you probably know, a t test accounts for the uncertainty of the estimate for the standard deviation. The same t value gives higher p values if the standard deviation has been estimated from fewer samples, because then, you compare with the t distribution for fewer degrees of freedom. In the limit of many samples, in contrast, a t test is the same as a z test (where you simply compare the ratio of (log) fold change to standard deviation with a normal distribution). In an ANOVA setting, one uses the F statistic instead of the t statistic, but this is just a reformulation. Instead to a t distribution, one compares against an F distribution, which becomes a chi^2 distribution for large numbers of samples (or DoFs).

    In generalized linear models, we use the deviance instead of the F statistic, and because residuals are no longer normally distributed, it is hard to say what the null distribution for the deviance is. It is easy to show, though, that asymptotically, for many samples, the deviance behaves as the F, i.e., it converges to chi^2, and so, we always compare with that.

    In summary: If your data follows normal distributions, you work with ordinary least square, and there, you can easily account for the effect of uncertainty on your variance estimate thanks to Student's (Gossamer's) word. Once you do not have normality, we do not have an easy solution, and people pretend that this issue (that everybody acknowledges to be very important in the ODE case) is not that crucial and that it should be fine to ignore it.

    My guess is that this is because, outside genomics, GLMs are rarely used for data with less than, say, 10 or 20 samples, anyway.

    How do we get out of this? In DESeq, we argue that it is fine to assume that the dispersion estimate is much more precise than what one should expect according to the number of samples, because we look at many genes and fit a line. So, if we are right with the assumption that genes with similar means have similar dispersion, we should get correct p values.

    Whenever a gene has unusually high dispersion it will get a too low p value and appear wrongly at the top of the list of hits. Hence, our maximum rule. We belief high estimates but pull up low estimates to the fitted line, so that we are, over all, overly conservative. This is why DESeq p values tend to show the right-skew that you observe.

    With sharingMode="gene-est-only", you get the textbook approach, which, as explained above, fails for small sample number, and gives you a left-skewed p value histogram even in the null case.

    With sharingMode="fit-only", the p-value histogram for the null case often tends to look nice and flat, but this turned out to be deceptive: The high-dispersion outliers will often pollute the list of highly significant genes.

    Hence, with the "maximum" mode, we err on the conservative side. This is not ideal, and there should be a way to get back the power that we lose this way by finding a compromise between these approaches. EdgeR, by the way, claims to have solved this issue with its weighted maximum likelihood method, but at least in my simulations, it still gives to low p values. This is what we do not consider the issue solved and are still working on it. I think, we already have some good ideas.

    Comment


    • #3
      Hi Simon,

      Thanks for that answer - it's very useful and clears up a couple of things I've been wondering about!

      One further question related to this:- Will the FDR calculation still be roughly accurate with such a p-value distribution (assuming genuine change is detected)?

      By the way I think the support you guys provide on this forum is great, it really helps with getting to grips with some of the trickier areas of this type of analysis.

      Comment


      • #4
        Originally posted by Justin AC Powell View Post
        One further question related to this:- Will the FDR calculation still be roughly accurate with such a p-value distribution (assuming genuine change is detected)?
        Well, as the p values are a bit conservative, the FDR calculation will be so, too, but as long as you err on the conservative side, you are always safe.

        Simon

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Innovations in Spatial Biology
          by seqadmin


          Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

          3D Genomics
          While spatial biology often involves studying proteins and RNAs in their...
          Yesterday, 07:30 PM
        • seqadmin
          Advancing Precision Medicine for Rare Diseases in Children
          by seqadmin




          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
          12-16-2024, 07:57 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 12-30-2024, 01:35 PM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-17-2024, 10:28 AM
        0 responses
        41 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-13-2024, 08:24 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-12-2024, 07:41 AM
        0 responses
        40 views
        0 likes
        Last Post seqadmin  
        Working...
        X