Seqanswers Leaderboard Ad

**dpryan** · 08-04-2014, 01:25 AM

It's more likely that you have an outlier sample and this gene got flagged because of that. Have a look at section 1.4.2 of the vignette and then section 3.5 ("Dealing with count outliers").

I should note that I've also been bitten by this once, where a hemizygous gene in one group simply had a weird distribution due to the genetic modification. So, the outlier detection isn't exactly infallible (not that that'll surprise anyone).

**Wolfgang Huber** · 08-04-2014, 01:26 AM

Dear ExMachina

Have a look at the documentation of the package, i.e. the vignette or the man page of the results function. There it says:

By default, independent filtering is performed to select a set of genes for multiple test correction which will optimize the number of adjusted p-values less than a given critical value alpha (by default 0.1). The adjusted p-values for the genes which do not pass the filter threshold are set to NA. By default, the mean of normalized counts is used to perform this filtering, though other statistics can be provided. Several arguments from the filtered_p function of genefilter are provided here to control or turn off the independent filtering behavior.

In addition, results by default assigns a p-value of NA to genes containing count outliers, as identified using Cook's distance. See the cooksCutoff argument for control of this behavior. Cook's distances for each sample are accessible as a matrix "cooks" stored in the assays() list. This measure is useful for identifying rows where the observed counts might not fit to a negative binomial distribution.

In your case, it sounds like the second one applies?

PS I assume you are using a recent version of the package (always post the output of sessionInfo() for such posts).

Kind regards
Wolfgang

**ExMachina** · 08-04-2014, 04:02 AM

Thanks Wolfgang and dpryan.

Yes, it seems like it is getting flagged for not fitting the NegBin dist. So it sounds like adjusting the Cook's cutoff is what I need to look into?

Is there a simplified explanation of what this really "means"?--I'm shortly going to have to explain to a bunch of people why a gene that appears so obviously differentially expressed (in terms the distributions and magnitudes of the read counts) is being flagged by DESeq2. The complicating factor is that the DE of this gene is very well established in one of the cases I'm looking at; consequently, I'm sure many will raise a concern about the interpretation (and appropriateness) of "NA" genes in the cases of other genes.

FYI, sessionInfo:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] gplots_2.14.1 RColorBrewer_1.0-5 vsn_3.32.0 Biobase_2.24.0
[5] DESeq2_1.4.5 RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.4
[9] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 BiocInstaller_1.14.2

loaded via a namespace (and not attached):
[1] affy_1.42.3 affyio_1.32.0 annotate_1.42.1 AnnotationDbi_1.26.0
[5] bitops_1.0-6 caTools_1.17 DBI_0.2-7 gdata_2.13.3
[9] genefilter_1.46.1 geneplotter_1.42.0 grid_3.1.1 gtools_3.4.1
[13] KernSmooth_2.23-12 lattice_0.20-29 limma_3.20.8 locfit_1.5-9.1
[17] preprocessCore_1.26.1 RSQLite_0.11.4 splines_3.1.1 stats4_3.1.1
[21] survival_2.37-7 tools_3.1.1 XML_3.98-1.1 xtable_1.7-3
[25] XVector_0.4.0 zlibbioc_1.10.0

**Michael Love** · 08-04-2014, 04:16 AM

hi,

You can find plain language explanations for the Cook's filtering in two places: the vignette has a simplified explanation in the 'Theory' section and the citation of the package also explains it.

We use the Cook's distance as a diagnostic to tell if a single sample has a count which has a disproportionate impact on the log fold change and p-values. You can turn off the filtering and inspect the genes manually with:

results(dds, cooksCutoff=FALSE)

The maximum of Cook's distances for a row is in the column:

mcols(dds)$maxCooks

Or the matrix for all genes and samples:

assays(dds)[["cooks"]]

**ExMachina** · 08-04-2014, 05:18 AM

What I meant to ask was, is an "NA" in this case best interpreted as as potential false positive requiring further inspection, or is it a warning that any assigned p-value may very well be meaningless? The former is what it sounds like it means, but I know of some people taking a Cook's Cutoff assigned "NA" as an indication that that gene somehow failed to show DE.

Is it wrong-headed (or simply naive) to tinker with the Cooks cutoff until the cutoff is able to accommodate a gene that is known to be DE? Or, if a sample is truly causing a problem, would it be better to just remove that sample than risk letting a lot of artifacts leak through.

Thanks again. The answers here have already been very helpful.

**Michael Love** · 08-04-2014, 05:39 AM

The NA p-value is a warning about a potential outlier. It's not "failed to show DE".

I myself have turned off Cook's cutoff and inspected the counts of rows with high maxCooks manually, for an experiment which had low variance of counts for all but one condition, which had highly variable counts due to details of the protocol.

Additionally, working with this and a few other datasets with one highly variable condition, I have made a small change in the development branch (will be released in 2 months) which helps to avoid too much outlier calling in such cases.

**ExMachina** · 08-04-2014, 06:52 AM

Thank you for the additional perspective. So basically a good starting strategy would seem be to run DESeq2 twice (once with and once without Cooks) and then append the Cooks flags to the non-Cooks output. Then, visually inspect from there. Then perhaps rerun with an adjusted cutoff (or a sample discarded). Sound reasonable?

**Michael Love** · 08-04-2014, 08:05 AM

Note that you don't have to rerun DESeq(), you can simply adjust with results(). otherwise, yes, this makes sense for certain datasets with unequal dispersion between groups until the development code is released.

**Wolfgang Huber** · 08-04-2014, 09:43 AM

Originally posted by ExMachina View Post

My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA".

Note that something is missing in what you describe: if the lowest count in your cases is really 20000 and the highest in your controls 200, then that's a fold change of over 100, or >6.5 on the log2 scale. Perhaps you can send us a plot of your gene's values in the 8 samples.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

DESeq2: interpreting a "NA" p-value?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News