Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ExMachina
    Member
    • Mar 2010
    • 14

    DESeq2: interpreting a "NA" p-value?

    My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA". I'm guessing that the calculation has reached some sort of a limit in this case but would like to know how to interpret/report this value?

    Thanks
    Last edited by ExMachina; 08-03-2014, 05:22 PM.
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    It's more likely that you have an outlier sample and this gene got flagged because of that. Have a look at section 1.4.2 of the vignette and then section 3.5 ("Dealing with count outliers").

    I should note that I've also been bitten by this once, where a hemizygous gene in one group simply had a weird distribution due to the genetic modification. So, the outlier detection isn't exactly infallible (not that that'll surprise anyone).

    Comment

    • Wolfgang Huber
      Senior Member
      • Aug 2009
      • 109

      #3
      Dear ExMachina

      Have a look at the documentation of the package, i.e. the vignette or the man page of the results function. There it says:
      By default, independent filtering is performed to select a set of genes for multiple test correction which will optimize the number of adjusted p-values less than a given critical value alpha (by default 0.1). The adjusted p-values for the genes which do not pass the filter threshold are set to NA. By default, the mean of normalized counts is used to perform this filtering, though other statistics can be provided. Several arguments from the filtered_p function of genefilter are provided here to control or turn off the independent filtering behavior.

      In addition, results by default assigns a p-value of NA to genes containing count outliers, as identified using Cook's distance. See the cooksCutoff argument for control of this behavior. Cook's distances for each sample are accessible as a matrix "cooks" stored in the assays() list. This measure is useful for identifying rows where the observed counts might not fit to a negative binomial distribution.

      In your case, it sounds like the second one applies?

      PS I assume you are using a recent version of the package (always post the output of sessionInfo() for such posts).

      Kind regards
      Wolfgang
      Wolfgang Huber
      EMBL

      Comment

      • ExMachina
        Member
        • Mar 2010
        • 14

        #4
        Thanks Wolfgang and dpryan.

        Yes, it seems like it is getting flagged for not fitting the NegBin dist. So it sounds like adjusting the Cook's cutoff is what I need to look into?

        Is there a simplified explanation of what this really "means"?--I'm shortly going to have to explain to a bunch of people why a gene that appears so obviously differentially expressed (in terms the distributions and magnitudes of the read counts) is being flagged by DESeq2. The complicating factor is that the DE of this gene is very well established in one of the cases I'm looking at; consequently, I'm sure many will raise a concern about the interpretation (and appropriateness) of "NA" genes in the cases of other genes.

        FYI, sessionInfo:
        > sessionInfo()
        R version 3.1.1 (2014-07-10)
        Platform: x86_64-w64-mingw32/x64 (64-bit)

        locale:
        [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
        [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
        [5] LC_TIME=English_United States.1252

        attached base packages:
        [1] parallel stats graphics grDevices utils datasets methods base

        other attached packages:
        [1] gplots_2.14.1 RColorBrewer_1.0-5 vsn_3.32.0 Biobase_2.24.0
        [5] DESeq2_1.4.5 RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.4
        [9] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 BiocInstaller_1.14.2

        loaded via a namespace (and not attached):
        [1] affy_1.42.3 affyio_1.32.0 annotate_1.42.1 AnnotationDbi_1.26.0
        [5] bitops_1.0-6 caTools_1.17 DBI_0.2-7 gdata_2.13.3
        [9] genefilter_1.46.1 geneplotter_1.42.0 grid_3.1.1 gtools_3.4.1
        [13] KernSmooth_2.23-12 lattice_0.20-29 limma_3.20.8 locfit_1.5-9.1
        [17] preprocessCore_1.26.1 RSQLite_0.11.4 splines_3.1.1 stats4_3.1.1
        [21] survival_2.37-7 tools_3.1.1 XML_3.98-1.1 xtable_1.7-3
        [25] XVector_0.4.0 zlibbioc_1.10.0
        Last edited by ExMachina; 08-04-2014, 04:23 AM.

        Comment

        • Michael Love
          Senior Member
          • Jul 2013
          • 333

          #5
          hi,

          You can find plain language explanations for the Cook's filtering in two places: the vignette has a simplified explanation in the 'Theory' section and the citation of the package also explains it.

          We use the Cook's distance as a diagnostic to tell if a single sample has a count which has a disproportionate impact on the log fold change and p-values. You can turn off the filtering and inspect the genes manually with:

          results(dds, cooksCutoff=FALSE)

          The maximum of Cook's distances for a row is in the column:

          mcols(dds)$maxCooks

          Or the matrix for all genes and samples:

          assays(dds)[["cooks"]]

          Comment

          • ExMachina
            Member
            • Mar 2010
            • 14

            #6
            What I meant to ask was, is an "NA" in this case best interpreted as as potential false positive requiring further inspection, or is it a warning that any assigned p-value may very well be meaningless? The former is what it sounds like it means, but I know of some people taking a Cook's Cutoff assigned "NA" as an indication that that gene somehow failed to show DE.

            Is it wrong-headed (or simply naive) to tinker with the Cooks cutoff until the cutoff is able to accommodate a gene that is known to be DE? Or, if a sample is truly causing a problem, would it be better to just remove that sample than risk letting a lot of artifacts leak through.

            Thanks again. The answers here have already been very helpful.

            Comment

            • Michael Love
              Senior Member
              • Jul 2013
              • 333

              #7
              The NA p-value is a warning about a potential outlier. It's not "failed to show DE".

              I myself have turned off Cook's cutoff and inspected the counts of rows with high maxCooks manually, for an experiment which had low variance of counts for all but one condition, which had highly variable counts due to details of the protocol.

              Additionally, working with this and a few other datasets with one highly variable condition, I have made a small change in the development branch (will be released in 2 months) which helps to avoid too much outlier calling in such cases.

              Comment

              • ExMachina
                Member
                • Mar 2010
                • 14

                #8
                Thank you for the additional perspective. So basically a good starting strategy would seem be to run DESeq2 twice (once with and once without Cooks) and then append the Cooks flags to the non-Cooks output. Then, visually inspect from there. Then perhaps rerun with an adjusted cutoff (or a sample discarded). Sound reasonable?

                Comment

                • Michael Love
                  Senior Member
                  • Jul 2013
                  • 333

                  #9
                  Note that you don't have to rerun DESeq(), you can simply adjust with results(). otherwise, yes, this makes sense for certain datasets with unequal dispersion between groups until the development code is released.
                  Last edited by Michael Love; 08-04-2014, 08:06 AM. Reason: var => disp

                  Comment

                  • Wolfgang Huber
                    Senior Member
                    • Aug 2009
                    • 109

                    #10
                    Originally posted by ExMachina View Post
                    My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA".
                    Note that something is missing in what you describe: if the lowest count in your cases is really 20000 and the highest in your controls 200, then that's a fold change of over 100, or >6.5 on the log2 scale. Perhaps you can send us a plot of your gene's values in the 8 samples.
                    Wolfgang Huber
                    EMBL

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                      Here are nine questions we think about, in roughly the order they matter, before...
                      06-18-2026, 07:11 AM
                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      06-02-2026, 10:05 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    31 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-09-2026, 11:58 AM
                    0 responses
                    97 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-05-2026, 10:09 AM
                    0 responses
                    117 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-04-2026, 08:59 AM
                    0 responses
                    109 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...