I have a couple of questions about DESeq (and my particular analysis)
1) I have a data set with three controls and three treated samples. I've seen it said that the sharingMode="gene-est-only" would not be appropriate for a data set with this few replicates (even with method="pooled") because under-estimation of the variance causes false positives. My experience supports this, with "gene-est-only" I get many more very low adjusted p-values than I'd expect, even if I permute the data set between control and treated which should remove genuinely signficant results. This is not thought to be a problem with array based experiments of this size when just doing a t-test, and indeed if I take my RNA-seq data, convert to RPKM and then do a straight t-test (using R t.test) I DON'T see this - the p-value distribution is flat. My question is: why is this - are negative binomial based methods like DESeq and EdgeR more susceptible to false positives caused by small sample sizes than a t.test? Or in other words, how come for a 3 control, 3 treated data set it seems *essential* to use the variance pooling approaches with EdgeR/DESeq when it isn't with a t.test. Do I have something figured out wrong?
2) What would explain the upward slope in the following p-value distribution for the above experiment using DESeq with standard parameters (sharingMode=maximum, method="pooled")? My guess is that it results from the actual variance often being lower than what DESeq estimates, meaning that the results are closer to the null distribution than would be expected by random chance - but how does that happen? One possibility I am investigating is that this data set is actually a result of immunoprecipitation from tagged ribosomes and I think the variance may partly be a function of the enrichment level for each gene. In that case one can't rely on genes of similar expression levels having similar sized errors (as they may have very different levels of enrichment), which may be a problem when sharing variance information. However I can't switch to gene-est-only because of there are not enough replicates. Currently this is only a guess so I'd be interested in any insight as to the causes of such a p-value distribution...
Best regards,
Justin
1) I have a data set with three controls and three treated samples. I've seen it said that the sharingMode="gene-est-only" would not be appropriate for a data set with this few replicates (even with method="pooled") because under-estimation of the variance causes false positives. My experience supports this, with "gene-est-only" I get many more very low adjusted p-values than I'd expect, even if I permute the data set between control and treated which should remove genuinely signficant results. This is not thought to be a problem with array based experiments of this size when just doing a t-test, and indeed if I take my RNA-seq data, convert to RPKM and then do a straight t-test (using R t.test) I DON'T see this - the p-value distribution is flat. My question is: why is this - are negative binomial based methods like DESeq and EdgeR more susceptible to false positives caused by small sample sizes than a t.test? Or in other words, how come for a 3 control, 3 treated data set it seems *essential* to use the variance pooling approaches with EdgeR/DESeq when it isn't with a t.test. Do I have something figured out wrong?
2) What would explain the upward slope in the following p-value distribution for the above experiment using DESeq with standard parameters (sharingMode=maximum, method="pooled")? My guess is that it results from the actual variance often being lower than what DESeq estimates, meaning that the results are closer to the null distribution than would be expected by random chance - but how does that happen? One possibility I am investigating is that this data set is actually a result of immunoprecipitation from tagged ribosomes and I think the variance may partly be a function of the enrichment level for each gene. In that case one can't rely on genes of similar expression levels having similar sized errors (as they may have very different levels of enrichment), which may be a problem when sharing variance information. However I can't switch to gene-est-only because of there are not enough replicates. Currently this is only a guess so I'd be interested in any insight as to the causes of such a p-value distribution...
Best regards,
Justin
Comment