SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
HTseq-count to DEseq2: Need a little help.. sindrle Bioinformatics 22 04-01-2016 02:48 AM
DESeq2 question in dealing with time course data. liuse Bioinformatics 12 03-19-2015 03:10 AM
six huge read count outliers on neighboring transcripts on one lane - theories? samhokin RNA Sequencing 1 11-12-2013 08:39 AM
interpretation of alignment results carolW Bioinformatics 9 04-30-2013 07:01 PM
RNA-seq results interpretation - help needed rebrendi Bioinformatics 11 09-02-2012 03:36 AM

Reply
 
Thread Tools
Old 08-04-2014, 06:40 AM   #1
Tatsiana_by
Junior Member
 
Location: Innsbruck

Join Date: Jul 2014
Posts: 3
Default DESeq2: dealing with count outliers and interpretation of the results

Sorry for repeating my message - just so that others can follow the discussion:

Hello,

I have a question regarding replaceOutliers function. More precisely, it looks like after using replaceOutliers, there are cases that non-outlier becomes an outlier.

I have 9 samples: 3 replicates for 3 conditions (6 days, 12 days and 28 days, as factors).

Here is what I do and what I get:
> des2.dataset <- DESeqDataSetFromMatrix(countData = countDF, colData = pheno[colnames(countDF),], design = ~ days)

countDF is just a data frame of raw counts. pheno is a data frame with information for each sample, including column "days".

> des2.mf <- DESeq(des2.dataset)

> des2.replace <- DESeq(replaceOutliers(des2.mf,minReplicates = 3))

Now, I have 160 genes, that by default detection method were not detected to be outliers. But after outlier replacement, they became outliers.

Here is an example of one such gene:

> assays(des2.mf)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186
0.1072452 6.6523800 2.3861998 0.1974414 0.1226845 3.0418715 0.2286905
RME187 RMN013
3.1039154 11.1255294

> assays(des2.replace)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185
0.07305022 7.57479726 2.35945901 0.12521243 0.00511494 3.05473959
RME186 RME187 RMN013
0.53951600 3.10378440 13.54871243


The cooks distance has changed, and now sample RMN013 has become an outlier... But the counts didn't:

> assays(des2.mf)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954
> assays(des2.replace)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954

So my question would be exactly that: Why did cooks distance change?
Another is merely concern over interpretation (given that this is intended behaviour): if I used replaceOutliers, its hard to justify having outliers in the data.... Furthermore, 5 out of these 160 cases were actually significant before outlier replacement. And I would definitely like to have these 5 in my hit-list.


--- reply to Michael:
I did see that in the vignette it is recommended to use replaceOultiers with 7 or more replicates, but for my particular data I see no much choice: with option cooksCutoff=FALSE, I get 2752 differentially expressed genes, including some like following:

> counts(des2.mf, normalized = TRUE)["73284",]
RME180 RME181 RME182 RMN013 RME183 RME187
0.000000 0.000000 0.000000 0.000000 3811.433548 2.043781
RME184 RME185 RME186
0.000000 0.000000 0.000000
> results(des2.mf, contrast = c("days","12","6"), cooksCutoff = FALSE)["73284",]
baseMean log2FoldChange lfcSE stat pvalue padj
73284 423.7197 9.376287 2.229498 4.205559 0.00002604376 0.001053495

Obviously, I would rather exclude this and similar cases. However, when I use default cooksCutoff, I get only 271 differentially expressed genes. Many of those do have outlier counts, but the general pattern in the data is clear:

> counts(des2.mf, normalized = TRUE)["71803",]
RME180 RME181 RME182 RMN013 RME183 RME187 RME184
0.0000 0.0000 0.0000 259.4740 5340.1329 54641.5051 552.1897
RME185 RME186
1047.5708 569.0070

That is why I opted out to using replaceOutliers, as the replacement procedure is conservative and should not lead to increase in false positives. With replaceOutliers I get 1512 differentially expressed genes - much more similar to expected number of differences.

But the question that I had remains-
How did it happen that the cooks distance changed while counts did not change?
how can explain the presence of outliers after doing outlier Replace procedure?
Tatsiana_by is offline   Reply With Quote
Old 08-04-2014, 07:30 AM   #2
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi,

The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.
Michael Love is offline   Reply With Quote
Old 08-04-2014, 09:21 AM   #3
Tatsiana_by
Junior Member
 
Location: Innsbruck

Join Date: Jul 2014
Posts: 3
Default

Quote:
Originally Posted by Michael Love View Post
hi,

The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.
Hi Michael,

Thank you for your quick replies. I like very much the idea of creating a cooksCutoff suitable for my particular experiment. I think it would be really great to have such guidelines in the user guide. Until then, if you don't mind, just to make sure I interpret it correctly:

So I looked at the distribution of log2(maxCooks), and I also plot a current cooksCutoff for my experiment:

Is it expected(good) that I have such a bi-modal distribution? Should I just then use a cut-off that splits two populations in this distribution?
Attached Images
File Type: png maxCooks.png (9.1 KB, 35 views)
Tatsiana_by is offline   Reply With Quote
Old 08-04-2014, 10:40 AM   #4
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

I wouldn't make a decision based on the bimodality, but instead by looking at the max Cook's distance for your example genes, and then having a sense how many genes have higher values than a given cutoff. Note that the example gene that you want to preserve has a really high count (54641) in the same group as two much smaller counts (259, 5340). So this sample will have a large Cook's distance. As you don't want to filter this gene, you will have to set the Cook's cutoff higher than this.
Michael Love is offline   Reply With Quote
Old 10-20-2014, 12:16 PM   #5
Golsheed
Member
 
Location: Montreal

Join Date: Oct 2014
Posts: 49
Default

Hi Michael,

Thanks for your helpful explanation. I have a somewhat similar question regarding outliers in DESeq2. I know that I can access the cook's distance matrix by
assay(dds)[["cook"]], but is there a way to know which genes where flagged as outliers according to cook's distance? As you mentioned in your reply, "DESeq calls replaceOutliers() internally and keeps track of which rows had outlier replaced", so I'm wondering whether I could access the row numbers somehow?

Thanks a lot in advance!
Golsheed





Quote:
Originally Posted by Michael Love View Post
hi,

The distances changed slightly because other rows had count outliers replaced. This can change all of the experiment-wide estimates like dispersion trend and the distribution of fold changes which are used for estimating the Cook's distance.

I still wouldn't recommend replacing counts with 3 replicates per condition, as there is too little information to know which count is the outlier. Instead, I'd recommend examining the distribution of mcols(dds)$maxCooks and finding a cutoff from visual inspection which makes sense for your experiment.

I should adjust the documentation to make clear that it's better to use the DESeq() argument for outlier replacement than replaceOutliers(). DESeq calls replaceOutliers() internally (depending on the argument minReplicatesForReplace), and keeps track of which rows had outlier replaced.
Golsheed is offline   Reply With Quote
Old 10-20-2014, 12:26 PM   #6
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:
res$outlier = res$baseMean > 0 & is.na(res$pvalue)
similarly for independent filtering:

Code:
res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)
Michael Love is offline   Reply With Quote
Old 10-20-2014, 12:48 PM   #7
Golsheed
Member
 
Location: Montreal

Join Date: Oct 2014
Posts: 49
Default

Thanks a lot. It's very helpful. Just to make sure I'm doing things right, would you mind answering two more questions please:

(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?

(2) After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?

Thanks,
Golsheed




Quote:
Originally Posted by Michael Love View Post
hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:
res$outlier = res$baseMean > 0 & is.na(res$pvalue)
similarly for independent filtering:

Code:
res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)
Golsheed is offline   Reply With Quote
Old 10-20-2014, 12:59 PM   #8
Golsheed
Member
 
Location: Montreal

Join Date: Oct 2014
Posts: 49
Default

sorry for so many replies! I forgot to mention I have a very large sample size (and hence, many degrees of freedom), so that's why I'm using the replaceOutliersWithTrimmedMean
function, as stated in the vignette.

Golsheed


Quote:
Originally Posted by Michael Love View Post
hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:
res$outlier = res$baseMean > 0 & is.na(res$pvalue)
similarly for independent filtering:

Code:
res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)
Golsheed is offline   Reply With Quote
Old 10-20-2014, 01:05 PM   #9
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi,

"(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?"


DESeq() internally calls replaceOutliers() (which does what you quote above), so you shouldn't use replaceOutliers() after a DESeq() call. I no longer include replaceOutliers() in the demonstration code for this reason. And in the help page for replaceOutliers (in the most recent release) it says, "Note that this function is called within DESeq, so is not necessary to call on top of a DESeq call."

(2) "After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?"

Yes, it is because Cook's is recalculated, which is not a good thing. If you use DESeq() only, you should not encounter this problem.
Michael Love is offline   Reply With Quote
Old 10-20-2014, 01:55 PM   #10
Golsheed
Member
 
Location: Montreal

Join Date: Oct 2014
Posts: 49
Default

Thanks a lot for your help. I just realized that there's new versin of the vignette out.


Quote:
Originally Posted by Michael Love View Post
hi,

"(1) If I run DESeq(dds) once, and then use the function replaceOutliersWithTrimmedMean(dds),
will it replace "all" the outliers, which are flagged by the cook's distance, with the trimmed mean over all the samples (adjusted by size factors)?"


DESeq() internally calls replaceOutliers() (which does what you quote above), so you shouldn't use replaceOutliers() after a DESeq() call. I no longer include replaceOutliers() in the demonstration code for this reason. And in the help page for replaceOutliers (in the most recent release) it says, "Note that this function is called within DESeq, so is not necessary to call on top of a DESeq call."

(2) "After using the replaceOutliersWithTrimmedMean(dds) function, I still detect genes with res$pvalue="NA" and res$basemean>0. I don't completely understand why this the case. Is it because after the replacement (with trimmed means) is done, the cook's distance is calculated once again, resulting in some genes begin detected as outliers?"

Yes, it is because Cook's is recalculated, which is not a good thing. If you use DESeq() only, you should not encounter this problem.
Golsheed is offline   Reply With Quote
Old 10-21-2014, 11:08 AM   #11
Golsheed
Member
 
Location: Montreal

Join Date: Oct 2014
Posts: 49
Default

Hi Michael,
Thanks again for your insightful comments before. I have another question if you don't mind:
Suppose that I use the following function:
dds <- nbinomLRT(dds, full= ~ Batch + AFcomp, reduced= ~ Batch)
and in the results I get a subset of genes that are significant by the LRT; i.e., the full model does a better job of explaining the observed read counts.
My goal is actually to verify whether AFcomp has a significant effect on gene expression (read counts), and to this end, I choose the genes with, say, padj<0.05.
In DESeq2 terminology and "model-wise", is that equivalent to using
ddsW <- nbinomWaldTest(dds),
where design(dds)= ~ Batch + AFcomp, and then looking for the genes with, say, padj<0.05 in results(dds, name="AFcomp")? Or is this statistically wrong and the LRT is a better option?
I'd much appreciate your help,
Golsheed



Quote:
Originally Posted by Michael Love View Post
hi Golsheed,

The actual filtering is done by results(), and the cooksCutoff can be set by the user, so the 'dds' object cannot know in advance the Cook's-distance-filtered rows.

If an outlier is replaced by DESeq(), there is a column: mcols(dds)$replace which notes these.

I've considered adding a column 'outlier' to the results table, but this table already has so many columns that I hesitated to do this.

You can add this column manually with:

Code:
res$outlier = res$baseMean > 0 & is.na(res$pvalue)
similarly for independent filtering:

Code:
res$indepfilt = !is.na(res$pvalue) & is.na(res$padj)
Golsheed is offline   Reply With Quote
Old 10-21-2014, 11:21 AM   #12
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

The Wald test and the LRT on a factor with only two levels are getting after the same set of genes, but with different statistical distributions behind the two. Wikipedia has explanations for both.
Michael Love is offline   Reply With Quote
Old 11-10-2014, 04:22 PM   #13
raphael123
Member
 
Location: Mc Gill -- Montreal

Join Date: Dec 2013
Posts: 37
Default

Hi Michael,

Thanks for all this details in these thread, I have a simple first question :
Is it possible to simply remove those outliers from the analysis ? I have a 25 vs 25 case control study, I have no problem loosing a few subjects for some genes !
raphael123 is offline   Reply With Quote
Old 11-10-2014, 04:33 PM   #14
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Hi Raphael,

The default behavior of DESeq() with many replicates is to replace with a value predicted by the null (See the section of the manuscript on outlier handling, and the vignette). We don't currently have functionality to ignore the observation for outlier samples.
Michael Love is offline   Reply With Quote
Reply

Tags
deseq2, outliers

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO