Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2: dealing with count outliers and interpretation of the results

    Sorry for repeating my message - just so that others can follow the discussion:

    Hello,

    I have a question regarding replaceOutliers function. More precisely, it looks like after using replaceOutliers, there are cases that non-outlier becomes an outlier.

    I have 9 samples: 3 replicates for 3 conditions (6 days, 12 days and 28 days, as factors).

    Here is what I do and what I get:
    > des2.dataset <- DESeqDataSetFromMatrix(countData = countDF, colData = pheno[colnames(countDF),], design = ~ days)

    countDF is just a data frame of raw counts. pheno is a data frame with information for each sample, including column "days".

    > des2.mf <- DESeq(des2.dataset)

    > des2.replace <- DESeq(replaceOutliers(des2.mf,minReplicates = 3))

    Now, I have 160 genes, that by default detection method were not detected to be outliers. But after outlier replacement, they became outliers.

    Here is an example of one such gene:

    > assays(des2.mf)[["cooks"]]["12484",]
    RME180 RME181 RME182 RME183 RME184 RME185 RME186
    0.1072452 6.6523800 2.3861998 0.1974414 0.1226845 3.0418715 0.2286905
    RME187 RMN013
    3.1039154 11.1255294

    > assays(des2.replace)[["cooks"]]["12484",]
    RME180 RME181 RME182 RME183 RME184 RME185
    0.07305022 7.57479726 2.35945901 0.12521243 0.00511494 3.05473959
    RME186 RME187 RMN013
    0.53951600 3.10378440 13.54871243


    The cooks distance has changed, and now sample RMN013 has become an outlier... But the counts didn't:

    > assays(des2.mf)[["counts"]]["12484",]
    RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
    18078 28989 1894 28078 4692 27 2373 0 1954
    > assays(des2.replace)[["counts"]]["12484",]
    RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
    18078 28989 1894 28078 4692 27 2373 0 1954

    So my question would be exactly that: Why did cooks distance change?
    Another is merely concern over interpretation (given that this is intended behaviour): if I used replaceOutliers, its hard to justify having outliers in the data.... Furthermore, 5 out of these 160 cases were actually significant before outlier replacement. And I would definitely like to have these 5 in my hit-list.


    --- reply to Michael:
    I did see that in the vignette it is recommended to use replaceOultiers with 7 or more replicates, but for my particular data I see no much choice: with option cooksCutoff=FALSE, I get 2752 differentially expressed genes, including some like following:

    > counts(des2.mf, normalized = TRUE)["73284",]
    RME180 RME181 RME182 RMN013 RME183 RME187
    0.000000 0.000000 0.000000 0.000000 3811.433548 2.043781
    RME184 RME185 RME186
    0.000000 0.000000 0.000000
    > results(des2.mf, contrast = c("days","12","6"), cooksCutoff = FALSE)["73284",]
    baseMean log2FoldChange lfcSE stat pvalue padj
    73284 423.7197 9.376287 2.229498 4.205559 0.00002604376 0.001053495

    Obviously, I would rather exclude this and similar cases. However, when I use default cooksCutoff, I get only 271 differentially expressed genes. Many of those do have outlier counts, but the general pattern in the data is clear:

    > counts(des2.mf, normalized = TRUE)["71803",]
    RME180 RME181 RME182 RMN013 RME183 RME187 RME184
    0.0000 0.0000 0.0000 259.4740 5340.1329 54641.5051 552.1897
    RME185 RME186
    1047.5708 569.0070

    That is why I opted out to using replaceOutliers, as the replacement procedure is conservative and should not lead to increase in false positives. With replaceOutliers I get 1512 differentially expressed genes - much more similar to expected number of differences.

    But the question that I had remains-
    How did it happen that the cooks distance changed while counts did not change?
    how can explain the presence of outliers after doing outlier Replace procedure?

  • #2
    hi,

    The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

    I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

    I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.

    Comment


    • #3
      Originally posted by Michael Love View Post
      hi,

      The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

      I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

      I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.
      Hi Michael,

      Thank you for your quick replies. I like very much the idea of creating a cooksCutoff suitable for my particular experiment. I think it would be really great to have such guidelines in the user guide. Until then, if you don't mind, just to make sure I interpret it correctly:

      So I looked at the distribution of log2(maxCooks), and I also plot a current cooksCutoff for my experiment:

      Is it expected(good) that I have such a bi-modal distribution? Should I just then use a cut-off that splits two populations in this distribution?
      Attached Files

      Comment


      • #4
        I wouldn't make a decision based on the bimodality, but instead by looking at the max Cook's distance for your example genes, and then having a sense how many genes have higher values than a given cutoff. Note that the example gene that you want to preserve has a really high count (54641) in the same group as two much smaller counts (259, 5340). So this sample will have a large Cook's distance. As you don't want to filter this gene, you will have to set the Cook's cutoff higher than this.

        Comment


        • #5
          Hi Michael,

          Thanks for your helpful explanation. I have a somewhat similar question regarding outliers in DESeq2. I know that I can access the cook's distance matrix by
          assay(dds)[["cook"]], but is there a way to know which genes where flagged as outliers according to cook's distance? As you mentioned in your reply, "DESeq calls replaceOutliers() internally and keeps track of which rows had outlier replaced", so I'm wondering whether I could access the row numbers somehow?

          Thanks a lot in advance!
          Golsheed





          Originally posted by Michael Love View Post
          hi,

          The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

          I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

          I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.

          Comment


          • #6
            hi Golsheed,

            The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

            If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

            I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

            You can add this column manually with:

            Code:
            res$outlier = res$baseMean > 0 & is.na(res$pvalue)
            similarly for independent filtering:

            Code:
            res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

            Comment


            • #7
              Thanks a lot. It's very helpful. Just to make sure I'm doing things right, would you mind answering two more questions please:

              (1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
              will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?

              (2) After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?

              Thanks,
              Golsheed




              Originally posted by Michael Love View Post
              hi Golsheed,

              The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

              If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

              I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

              You can add this column manually with:

              Code:
              res$outlier = res$baseMean > 0 & is.na(res$pvalue)
              similarly for independent filtering:

              Code:
              res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

              Comment


              • #8
                sorry for so many replies! I forgot to mention I have a very large sample size (and hence, many degrees of freedom), so that's why I'm using the replaceOutliersWithTrimmedMean
                function, as stated in the vignette.

                Golsheed


                Originally posted by Michael Love View Post
                hi Golsheed,

                The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

                If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

                I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

                You can add this column manually with:

                Code:
                res$outlier = res$baseMean > 0 & is.na(res$pvalue)
                similarly for independent filtering:

                Code:
                res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

                Comment


                • #9
                  hi,

                  "(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
                  will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?"


                  DESeq() internally calls replaceOutliers() (which does what you quote above), so you shouldn't use replaceOutliers() after a DESeq() call. I no longer include replaceOutliers() in the demonstration code for this reason. And in the help page for replaceOutliers (in the most recent release) it says, "Note that this function is called within DESeq, so is not necessary to call on top of a DESeq call."

                  (2) "After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?"

                  Yes, it is because Cook's is recalculated, which is not a good thing. If you use DESeq() only, you should not encounter this problem.

                  Comment


                  • #10
                    Thanks a lot for your help. I just realized that there's new versin of the vignette out.


                    Originally posted by Michael Love View Post
                    hi,

                    "(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
                    will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?"


                    DESeq() internally calls replaceOutliers() (which does what you quote above), so you shouldn't use replaceOutliers() after a DESeq() call. I no longer include replaceOutliers() in the demonstration code for this reason. And in the help page for replaceOutliers (in the most recent release) it says, "Note that this function is called within DESeq, so is not necessary to call on top of a DESeq call."

                    (2) "After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?"

                    Yes, it is because Cook's is recalculated, which is not a good thing. If you use DESeq() only, you should not encounter this problem.

                    Comment


                    • #11
                      Hi Michael,
                      Thanks again for your insightful comments before. I have another question if you don't mind:
                      Suppose that I use the following function:
                      dds <- nbinomLRT(dds, full= ~ Batch + AFcomp, reduced= ~ Batch)
                      and in the results I get a subset of genes that are significant by the LRT; i.e., the full model does a better job of explaining the observed read counts.
                      My goal is actually to verify whether AFcomp has a significant effect on gene expression (read counts), and to this end, I choose the genes with, say, padj<0.05.
                      In DESeq2 terminology and "model-wise", is that equivalent to using
                      ddsW <- nbinomWaldTest(dds),
                      where design(dds)= ~ Batch + AFcomp, and then looking for the genes with, say, padj<0.05 in results(dds, name="AFcomp")? Or is this statistically wrong and the LRT is a better option?
                      I'd much appreciate your help,
                      Golsheed



                      Originally posted by Michael Love View Post
                      hi Golsheed,

                      The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

                      If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

                      I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

                      You can add this column manually with:

                      Code:
                      res$outlier = res$baseMean > 0 & is.na(res$pvalue)
                      similarly for independent filtering:

                      Code:
                      res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

                      Comment


                      • #12
                        The Wald test and the LRT on a factor with only two levels are getting after the same set of genes, but with different statistical distributions behind the two. Wikipedia has explanations for both.

                        Comment


                        • #13
                          Hi Michael,

                          Thanks for all this details in these thread, I have a simple first question :
                          Is it possible to simply remove those outliers from the analysis ? I have a 25 vs 25 case control study, I have no problem loosing a few subjects for some genes !

                          Comment


                          • #14
                            Hi Raphael,

                            The default behavior of DESeq() with many replicates is to replace with a value predicted by the null (See the section of the manuscript on outlier handling, and the vignette). We don't currently have functionality to ignore the observation for outlier samples.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X