SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq2:could not find function "Error" learnholickyle Bioinformatics 9 02-27-2014 04:59 PM
DEXSeq error in estimateDispersions: match.arg(start.method, c("log(y)", "mean")) fpadilla Bioinformatics 14 07-03-2013 02:11 PM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 07:55 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 12:19 PM
SEQanswers second "publication": "How to map billions of short reads onto genomes" ECO Literature Watch 0 06-29-2009 11:49 PM

Reply
 
Thread Tools
Old 08-03-2014, 05:02 PM   #1
ExMachina
Member
 
Location: AATGCATCTCATGTTATCCTTCTCGAG

Join Date: Mar 2010
Posts: 14
Default DESeq2: interpreting a "NA" p-value?

My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA". I'm guessing that the calculation has reached some sort of a limit in this case but would like to know how to interpret/report this value?

Thanks

Last edited by ExMachina; 08-03-2014 at 05:22 PM.
ExMachina is offline   Reply With Quote
Old 08-04-2014, 01:25 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It's more likely that you have an outlier sample and this gene got flagged because of that. Have a look at section 1.4.2 of the vignette and then section 3.5 ("Dealing with count outliers").

I should note that I've also been bitten by this once, where a hemizygous gene in one group simply had a weird distribution due to the genetic modification. So, the outlier detection isn't exactly infallible (not that that'll surprise anyone).
dpryan is offline   Reply With Quote
Old 08-04-2014, 01:26 AM   #3
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Dear ExMachina

Have a look at the documentation of the package, i.e. the vignette or the man page of the results function. There it says:
By default, independent filtering is performed to select a set of genes for multiple test correction which will optimize the number of adjusted p-values less than a given critical value alpha (by default 0.1). The adjusted p-values for the genes which do not pass the filter threshold are set to NA. By default, the mean of normalized counts is used to perform this filtering, though other statistics can be provided. Several arguments from the filtered_p function of genefilter are provided here to control or turn off the independent filtering behavior.

In addition, results by default assigns a p-value of NA to genes containing count outliers, as identified using Cook's distance. See the cooksCutoff argument for control of this behavior. Cook's distances for each sample are accessible as a matrix "cooks" stored in the assays() list. This measure is useful for identifying rows where the observed counts might not fit to a negative binomial distribution.
In your case, it sounds like the second one applies?

PS I assume you are using a recent version of the package (always post the output of sessionInfo() for such posts).

Kind regards
Wolfgang
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 08-04-2014, 04:02 AM   #4
ExMachina
Member
 
Location: AATGCATCTCATGTTATCCTTCTCGAG

Join Date: Mar 2010
Posts: 14
Default

Thanks Wolfgang and dpryan.

Yes, it seems like it is getting flagged for not fitting the NegBin dist. So it sounds like adjusting the Cook's cutoff is what I need to look into?

Is there a simplified explanation of what this really "means"?--I'm shortly going to have to explain to a bunch of people why a gene that appears so obviously differentially expressed (in terms the distributions and magnitudes of the read counts) is being flagged by DESeq2. The complicating factor is that the DE of this gene is very well established in one of the cases I'm looking at; consequently, I'm sure many will raise a concern about the interpretation (and appropriateness) of "NA" genes in the cases of other genes.

FYI, sessionInfo:
Quote:
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] gplots_2.14.1 RColorBrewer_1.0-5 vsn_3.32.0 Biobase_2.24.0
[5] DESeq2_1.4.5 RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.4
[9] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 BiocInstaller_1.14.2

loaded via a namespace (and not attached):
[1] affy_1.42.3 affyio_1.32.0 annotate_1.42.1 AnnotationDbi_1.26.0
[5] bitops_1.0-6 caTools_1.17 DBI_0.2-7 gdata_2.13.3
[9] genefilter_1.46.1 geneplotter_1.42.0 grid_3.1.1 gtools_3.4.1
[13] KernSmooth_2.23-12 lattice_0.20-29 limma_3.20.8 locfit_1.5-9.1
[17] preprocessCore_1.26.1 RSQLite_0.11.4 splines_3.1.1 stats4_3.1.1
[21] survival_2.37-7 tools_3.1.1 XML_3.98-1.1 xtable_1.7-3
[25] XVector_0.4.0 zlibbioc_1.10.0

Last edited by ExMachina; 08-04-2014 at 04:23 AM.
ExMachina is offline   Reply With Quote
Old 08-04-2014, 04:16 AM   #5
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi,

You can find plain language explanations for the Cook's filtering in two places: the vignette has a simplified explanation in the 'Theory' section and the citation of the package also explains it.

We use the Cook's distance as a diagnostic to tell if a single sample has a count which has a disproportionate impact on the log fold change and p-values. You can turn off the filtering and inspect the genes manually with:

results(dds, cooksCutoff=FALSE)

The maximum of Cook's distances for a row is in the column:

mcols(dds)$maxCooks

Or the matrix for all genes and samples:

assays(dds)[["cooks"]]
Michael Love is offline   Reply With Quote
Old 08-04-2014, 05:18 AM   #6
ExMachina
Member
 
Location: AATGCATCTCATGTTATCCTTCTCGAG

Join Date: Mar 2010
Posts: 14
Default

What I meant to ask was, is an "NA" in this case best interpreted as as potential false positive requiring further inspection, or is it a warning that any assigned p-value may very well be meaningless? The former is what it sounds like it means, but I know of some people taking a Cook's Cutoff assigned "NA" as an indication that that gene somehow failed to show DE.

Is it wrong-headed (or simply naive) to tinker with the Cooks cutoff until the cutoff is able to accommodate a gene that is known to be DE? Or, if a sample is truly causing a problem, would it be better to just remove that sample than risk letting a lot of artifacts leak through.

Thanks again. The answers here have already been very helpful.
ExMachina is offline   Reply With Quote
Old 08-04-2014, 05:39 AM   #7
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

The NA p-value is a warning about a potential outlier. It's not "failed to show DE".

I myself have turned off Cook's cutoff and inspected the counts of rows with high maxCooks manually, for an experiment which had low variance of counts for all but one condition, which had highly variable counts due to details of the protocol.

Additionally, working with this and a few other datasets with one highly variable condition, I have made a small change in the development branch (will be released in 2 months) which helps to avoid too much outlier calling in such cases.
Michael Love is offline   Reply With Quote
Old 08-04-2014, 06:52 AM   #8
ExMachina
Member
 
Location: AATGCATCTCATGTTATCCTTCTCGAG

Join Date: Mar 2010
Posts: 14
Default

Thank you for the additional perspective. So basically a good starting strategy would seem be to run DESeq2 twice (once with and once without Cooks) and then append the Cooks flags to the non-Cooks output. Then, visually inspect from there. Then perhaps rerun with an adjusted cutoff (or a sample discarded). Sound reasonable?
ExMachina is offline   Reply With Quote
Old 08-04-2014, 08:05 AM   #9
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Note that you don't have to rerun DESeq(), you can simply adjust with results(). otherwise, yes, this makes sense for certain datasets with unequal dispersion between groups until the development code is released.

Last edited by Michael Love; 08-04-2014 at 08:06 AM. Reason: var => disp
Michael Love is offline   Reply With Quote
Old 08-04-2014, 09:43 AM   #10
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Quote:
Originally Posted by ExMachina View Post
My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA".
Note that something is missing in what you describe: if the lowest count in your cases is really 20000 and the highest in your controls 200, then that's a fold change of over 100, or >6.5 on the log2 scale. Perhaps you can send us a plot of your gene's values in the 8 samples.
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:58 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO