Sorry for repeating my message - just so that others can follow the discussion:
Hello,
I have a question regarding replaceOutliers function. More precisely, it looks like after using replaceOutliers, there are cases that non-outlier becomes an outlier.
I have 9 samples: 3 replicates for 3 conditions (6 days, 12 days and 28 days, as factors).
Here is what I do and what I get:
> des2.dataset <- DESeqDataSetFromMatrix(countData = countDF, colData = pheno[colnames(countDF),], design = ~ days)
countDF is just a data frame of raw counts. pheno is a data frame with information for each sample, including column "days".
> des2.mf <- DESeq(des2.dataset)
> des2.replace <- DESeq(replaceOutliers(des2.mf,minReplicates = 3))
Now, I have 160 genes, that by default detection method were not detected to be outliers. But after outlier replacement, they became outliers.
Here is an example of one such gene:
> assays(des2.mf)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186
0.1072452 6.6523800 2.3861998 0.1974414 0.1226845 3.0418715 0.2286905
RME187 RMN013
3.1039154 11.1255294
> assays(des2.replace)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185
0.07305022 7.57479726 2.35945901 0.12521243 0.00511494 3.05473959
RME186 RME187 RMN013
0.53951600 3.10378440 13.54871243
The cooks distance has changed, and now sample RMN013 has become an outlier... But the counts didn't:
> assays(des2.mf)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954
> assays(des2.replace)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954
So my question would be exactly that: Why did cooks distance change?
Another is merely concern over interpretation (given that this is intended behaviour): if I used replaceOutliers, its hard to justify having outliers in the data.... Furthermore, 5 out of these 160 cases were actually significant before outlier replacement. And I would definitely like to have these 5 in my hit-list.
--- reply to Michael:
I did see that in the vignette it is recommended to use replaceOultiers with 7 or more replicates, but for my particular data I see no much choice: with option cooksCutoff=FALSE, I get 2752 differentially expressed genes, including some like following:
> counts(des2.mf, normalized = TRUE)["73284",]
RME180 RME181 RME182 RMN013 RME183 RME187
0.000000 0.000000 0.000000 0.000000 3811.433548 2.043781
RME184 RME185 RME186
0.000000 0.000000 0.000000
> results(des2.mf, contrast = c("days","12","6"), cooksCutoff = FALSE)["73284",]
baseMean log2FoldChange lfcSE stat pvalue padj
73284 423.7197 9.376287 2.229498 4.205559 0.00002604376 0.001053495
Obviously, I would rather exclude this and similar cases. However, when I use default cooksCutoff, I get only 271 differentially expressed genes. Many of those do have outlier counts, but the general pattern in the data is clear:
> counts(des2.mf, normalized = TRUE)["71803",]
RME180 RME181 RME182 RMN013 RME183 RME187 RME184
0.0000 0.0000 0.0000 259.4740 5340.1329 54641.5051 552.1897
RME185 RME186
1047.5708 569.0070
That is why I opted out to using replaceOutliers, as the replacement procedure is conservative and should not lead to increase in false positives. With replaceOutliers I get 1512 differentially expressed genes - much more similar to expected number of differences.
But the question that I had remains-
How did it happen that the cooks distance changed while counts did not change?
how can explain the presence of outliers after doing outlier Replace procedure?
Hello,
I have a question regarding replaceOutliers function. More precisely, it looks like after using replaceOutliers, there are cases that non-outlier becomes an outlier.
I have 9 samples: 3 replicates for 3 conditions (6 days, 12 days and 28 days, as factors).
Here is what I do and what I get:
> des2.dataset <- DESeqDataSetFromMatrix(countData = countDF, colData = pheno[colnames(countDF),], design = ~ days)
countDF is just a data frame of raw counts. pheno is a data frame with information for each sample, including column "days".
> des2.mf <- DESeq(des2.dataset)
> des2.replace <- DESeq(replaceOutliers(des2.mf,minReplicates = 3))
Now, I have 160 genes, that by default detection method were not detected to be outliers. But after outlier replacement, they became outliers.
Here is an example of one such gene:
> assays(des2.mf)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186
0.1072452 6.6523800 2.3861998 0.1974414 0.1226845 3.0418715 0.2286905
RME187 RMN013
3.1039154 11.1255294
> assays(des2.replace)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185
0.07305022 7.57479726 2.35945901 0.12521243 0.00511494 3.05473959
RME186 RME187 RMN013
0.53951600 3.10378440 13.54871243
The cooks distance has changed, and now sample RMN013 has become an outlier... But the counts didn't:
> assays(des2.mf)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954
> assays(des2.replace)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954
So my question would be exactly that: Why did cooks distance change?
Another is merely concern over interpretation (given that this is intended behaviour): if I used replaceOutliers, its hard to justify having outliers in the data.... Furthermore, 5 out of these 160 cases were actually significant before outlier replacement. And I would definitely like to have these 5 in my hit-list.
--- reply to Michael:
I did see that in the vignette it is recommended to use replaceOultiers with 7 or more replicates, but for my particular data I see no much choice: with option cooksCutoff=FALSE, I get 2752 differentially expressed genes, including some like following:
> counts(des2.mf, normalized = TRUE)["73284",]
RME180 RME181 RME182 RMN013 RME183 RME187
0.000000 0.000000 0.000000 0.000000 3811.433548 2.043781
RME184 RME185 RME186
0.000000 0.000000 0.000000
> results(des2.mf, contrast = c("days","12","6"), cooksCutoff = FALSE)["73284",]
baseMean log2FoldChange lfcSE stat pvalue padj
73284 423.7197 9.376287 2.229498 4.205559 0.00002604376 0.001053495
Obviously, I would rather exclude this and similar cases. However, when I use default cooksCutoff, I get only 271 differentially expressed genes. Many of those do have outlier counts, but the general pattern in the data is clear:
> counts(des2.mf, normalized = TRUE)["71803",]
RME180 RME181 RME182 RMN013 RME183 RME187 RME184
0.0000 0.0000 0.0000 259.4740 5340.1329 54641.5051 552.1897
RME185 RME186
1047.5708 569.0070
That is why I opted out to using replaceOutliers, as the replacement procedure is conservative and should not lead to increase in false positives. With replaceOutliers I get 1512 differentially expressed genes - much more similar to expected number of differences.
But the question that I had remains-
How did it happen that the cooks distance changed while counts did not change?
how can explain the presence of outliers after doing outlier Replace procedure?
Comment