Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2: interpreting a "NA" p-value?

    My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA". I'm guessing that the calculation has reached some sort of a limit in this case but would like to know how to interpret/report this value?

    Thanks
    Last edited by ExMachina; 08-03-2014, 05:22 PM.

  • #2
    It's more likely that you have an outlier sample and this gene got flagged because of that. Have a look at section 1.4.2 of the vignette and then section 3.5 ("Dealing with count outliers").

    I should note that I've also been bitten by this once, where a hemizygous gene in one group simply had a weird distribution due to the genetic modification. So, the outlier detection isn't exactly infallible (not that that'll surprise anyone).

    Comment


    • #3
      Dear ExMachina

      Have a look at the documentation of the package, i.e. the vignette or the man page of the results function. There it says:
      By default, independent filtering is performed to select a set of genes for multiple test correction which will optimize the number of adjusted p-values less than a given critical value alpha (by default 0.1). The adjusted p-values for the genes which do not pass the filter threshold are set to NA. By default, the mean of normalized counts is used to perform this filtering, though other statistics can be provided. Several arguments from the filtered_p function of genefilter are provided here to control or turn off the independent filtering behavior.

      In addition, results by default assigns a p-value of NA to genes containing count outliers, as identified using Cook's distance. See the cooksCutoff argument for control of this behavior. Cook's distances for each sample are accessible as a matrix "cooks" stored in the assays() list. This measure is useful for identifying rows where the observed counts might not fit to a negative binomial distribution.

      In your case, it sounds like the second one applies?

      PS I assume you are using a recent version of the package (always post the output of sessionInfo() for such posts).

      Kind regards
      Wolfgang
      Wolfgang Huber
      EMBL

      Comment


      • #4
        Thanks Wolfgang and dpryan.

        Yes, it seems like it is getting flagged for not fitting the NegBin dist. So it sounds like adjusting the Cook's cutoff is what I need to look into?

        Is there a simplified explanation of what this really "means"?--I'm shortly going to have to explain to a bunch of people why a gene that appears so obviously differentially expressed (in terms the distributions and magnitudes of the read counts) is being flagged by DESeq2. The complicating factor is that the DE of this gene is very well established in one of the cases I'm looking at; consequently, I'm sure many will raise a concern about the interpretation (and appropriateness) of "NA" genes in the cases of other genes.

        FYI, sessionInfo:
        > sessionInfo()
        R version 3.1.1 (2014-07-10)
        Platform: x86_64-w64-mingw32/x64 (64-bit)

        locale:
        [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
        [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
        [5] LC_TIME=English_United States.1252

        attached base packages:
        [1] parallel stats graphics grDevices utils datasets methods base

        other attached packages:
        [1] gplots_2.14.1 RColorBrewer_1.0-5 vsn_3.32.0 Biobase_2.24.0
        [5] DESeq2_1.4.5 RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.4
        [9] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 BiocInstaller_1.14.2

        loaded via a namespace (and not attached):
        [1] affy_1.42.3 affyio_1.32.0 annotate_1.42.1 AnnotationDbi_1.26.0
        [5] bitops_1.0-6 caTools_1.17 DBI_0.2-7 gdata_2.13.3
        [9] genefilter_1.46.1 geneplotter_1.42.0 grid_3.1.1 gtools_3.4.1
        [13] KernSmooth_2.23-12 lattice_0.20-29 limma_3.20.8 locfit_1.5-9.1
        [17] preprocessCore_1.26.1 RSQLite_0.11.4 splines_3.1.1 stats4_3.1.1
        [21] survival_2.37-7 tools_3.1.1 XML_3.98-1.1 xtable_1.7-3
        [25] XVector_0.4.0 zlibbioc_1.10.0
        Last edited by ExMachina; 08-04-2014, 04:23 AM.

        Comment


        • #5
          hi,

          You can find plain language explanations for the Cook's filtering in two places: the vignette has a simplified explanation in the 'Theory' section and the citation of the package also explains it.

          We use the Cook's distance as a diagnostic to tell if a single sample has a count which has a disproportionate impact on the log fold change and p-values. You can turn off the filtering and inspect the genes manually with:

          results(dds, cooksCutoff=FALSE)

          The maximum of Cook's distances for a row is in the column:

          mcols(dds)$maxCooks

          Or the matrix for all genes and samples:

          assays(dds)[["cooks"]]

          Comment


          • #6
            What I meant to ask was, is an "NA" in this case best interpreted as as potential false positive requiring further inspection, or is it a warning that any assigned p-value may very well be meaningless? The former is what it sounds like it means, but I know of some people taking a Cook's Cutoff assigned "NA" as an indication that that gene somehow failed to show DE.

            Is it wrong-headed (or simply naive) to tinker with the Cooks cutoff until the cutoff is able to accommodate a gene that is known to be DE? Or, if a sample is truly causing a problem, would it be better to just remove that sample than risk letting a lot of artifacts leak through.

            Thanks again. The answers here have already been very helpful.

            Comment


            • #7
              The NA p-value is a warning about a potential outlier. It's not "failed to show DE".

              I myself have turned off Cook's cutoff and inspected the counts of rows with high maxCooks manually, for an experiment which had low variance of counts for all but one condition, which had highly variable counts due to details of the protocol.

              Additionally, working with this and a few other datasets with one highly variable condition, I have made a small change in the development branch (will be released in 2 months) which helps to avoid too much outlier calling in such cases.

              Comment


              • #8
                Thank you for the additional perspective. So basically a good starting strategy would seem be to run DESeq2 twice (once with and once without Cooks) and then append the Cooks flags to the non-Cooks output. Then, visually inspect from there. Then perhaps rerun with an adjusted cutoff (or a sample discarded). Sound reasonable?

                Comment


                • #9
                  Note that you don't have to rerun DESeq(), you can simply adjust with results(). otherwise, yes, this makes sense for certain datasets with unequal dispersion between groups until the development code is released.
                  Last edited by Michael Love; 08-04-2014, 08:06 AM. Reason: var => disp

                  Comment


                  • #10
                    Originally posted by ExMachina View Post
                    My gene of interest has counts ranging from 20,000 to over 100,000 reads in my "case" samples, where as the same gene has counts only in the 50-200 reads in my "control" sample. I have 4 biological replicates in each group. The log2 fold change is nearly 4 but the p-value is "NA".
                    Note that something is missing in what you describe: if the lowest count in your cases is really 20000 and the highest in your controls 200, then that's a fold change of over 100, or >6.5 on the log2 scale. Perhaps you can send us a plot of your gene's values in the 8 samples.
                    Wolfgang Huber
                    EMBL

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    20 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X