Seqanswers Leaderboard Ad

**Michael Love** · 08-04-2014, 07:30 AM

hi,

The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.

**Tatsiana_by** · 08-04-2014, 09:21 AM

Originally posted by Michael Love View Post

hi,

The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.

Hi Michael,

Thank you for your quick replies. I like very much the idea of creating a cooksCutoff suitable for my particular experiment. I think it would be really great to have such guidelines in the user guide. Until then, if you don't mind, just to make sure I interpret it correctly:

So I looked at the distribution of log2(maxCooks), and I also plot a current cooksCutoff for my experiment:

Is it expected(good) that I have such a bi-modal distribution? Should I just then use a cut-off that splits two populations in this distribution?

Attached Files

maxCooks.png (9.1 KB, 71 views)

**Michael Love** · 08-04-2014, 10:40 AM

I wouldn't make a decision based on the bimodality, but instead by looking at the max Cook's distance for your example genes, and then having a sense how many genes have higher values than a given cutoff. Note that the example gene that you want to preserve has a really high count (54641) in the same group as two much smaller counts (259, 5340). So this sample will have a large Cook's distance. As you don't want to filter this gene, you will have to set the Cook's cutoff higher than this.

**Golsheed** · 10-20-2014, 12:16 PM

Hi Michael,

Thanks for your helpful explanation. I have a somewhat similar question regarding outliers in DESeq2. I know that I can access the cook's distance matrix by
assay(dds)[["cook"]], but is there a way to know which genes where flagged as outliers according to cook's distance? As you mentioned in your reply, "DESeq calls replaceOutliers() internally and keeps track of which rows had outlier replaced", so I'm wondering whether I could access the row numbers somehow?

Thanks a lot in advance!
Golsheed

Originally posted by Michael Love View Post

hi,

The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.

**Michael Love** · 10-20-2014, 12:26 PM

hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:

res$outlier = res$baseMean > 0 & is.na(res$pvalue)

similarly for independent filtering:

Code:

res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

**Golsheed** · 10-20-2014, 12:48 PM

Thanks a lot. It's very helpful. Just to make sure I'm doing things right, would you mind answering two more questions please:

(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?

(2) After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?

Thanks,
Golsheed

Originally posted by Michael Love View Post

hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:

res$outlier = res$baseMean > 0 & is.na(res$pvalue)

similarly for independent filtering:

Code:

res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

**Golsheed** · 10-20-2014, 12:59 PM

sorry for so many replies! I forgot to mention I have a very large sample size (and hence, many degrees of freedom), so that's why I'm using the replaceOutliersWithTrimmedMean
function, as stated in the vignette.

Golsheed

Originally posted by Michael Love View Post

hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:

res$outlier = res$baseMean > 0 & is.na(res$pvalue)

similarly for independent filtering:

Code:

res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

**Michael Love** · 10-20-2014, 01:05 PM

hi,

"(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?"

DESeq() internally calls replaceOutliers() (which does what you quote above), so you shouldn't use replaceOutliers() after a DESeq() call. I no longer include replaceOutliers() in the demonstration code for this reason. And in the help page for replaceOutliers (in the most recent release) it says, "Note that this function is called within DESeq, so is not necessary to call on top of a DESeq call."

(2) "After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?"

Yes, it is because Cook's is recalculated, which is not a good thing. If you use DESeq() only, you should not encounter this problem.

**Golsheed** · 10-20-2014, 01:55 PM

Thanks a lot for your help. I just realized that there's new versin of the vignette out.

Originally posted by Michael Love View Post

hi,

"(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?"

DESeq() internally calls replaceOutliers() (which does what you quote above), so you shouldn't use replaceOutliers() after a DESeq() call. I no longer include replaceOutliers() in the demonstration code for this reason. And in the help page for replaceOutliers (in the most recent release) it says, "Note that this function is called within DESeq, so is not necessary to call on top of a DESeq call."

(2) "After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?"

Yes, it is because Cook's is recalculated, which is not a good thing. If you use DESeq() only, you should not encounter this problem.

**Golsheed** · 10-21-2014, 11:08 AM

Hi Michael,
Thanks again for your insightful comments before. I have another question if you don't mind:
Suppose that I use the following function:
dds <- nbinomLRT(dds, full= ~ Batch + AFcomp, reduced= ~ Batch)
and in the results I get a subset of genes that are significant by the LRT; i.e., the full model does a better job of explaining the observed read counts.
My goal is actually to verify whether AFcomp has a significant effect on gene expression (read counts), and to this end, I choose the genes with, say, padj<0.05.
In DESeq2 terminology and "model-wise", is that equivalent to using
ddsW <- nbinomWaldTest(dds),
where design(dds)= ~ Batch + AFcomp, and then looking for the genes with, say, padj<0.05 in results(dds, name="AFcomp")? Or is this statistically wrong and the LRT is a better option?
I'd much appreciate your help,
Golsheed

Originally posted by Michael Love View Post

hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:

res$outlier = res$baseMean > 0 & is.na(res$pvalue)

similarly for independent filtering:

Code:

res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)

**Michael Love** · 10-21-2014, 11:21 AM

The Wald test and the LRT on a factor with only two levels are getting after the same set of genes, but with different statistical distributions behind the two. Wikipedia has explanations for both.

**raphael123** · 11-10-2014, 05:22 PM

Hi Michael,

Thanks for all this details in these thread, I have a simple first question :
Is it possible to simply remove those outliers from the analysis ? I have a 25 vs 25 case control study, I have no problem loosing a few subjects for some genes !

**Michael Love** · 11-10-2014, 05:33 PM

Hi Raphael,

The default behavior of DESeq() with many replicates is to replace with a value predicted by the null (See the section of the manuscript on outlier handling, and the vignette). We don't currently have functionality to ignore the observation for outlier samples.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

DESeq2: dealing with count outliers and interpretation of the results

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News