Hello,
I have an RNAseq experiment from two strains of mice, that have been exposed to two different environments (three biological replicates of each combination). So my experimental setup is
> colData
sample genetics environment
black2 B6 black
black3 B6 black
black5 B6 black
black1 B6 yellow
black4 B6 yellow
black6 B6 yellow
agouti1 129 yellow
agouti4 129 yellow
agouti6 129 yellow
agouti2 129 black
agouti3 129 black
agouti5 129 black
I want to test which genes change their expression when the environment changes, controlling for the genetic background. I am using DESeq2 for this, with the formula ~ genetics + environment + genetics:environment.
However, the choice of the base level for the factors in my experimental design changes the genes that are significant. If I leave the default reference levels (129 for genetics and black for env) I get more DE genes that if I set as reference B6 for genetics and yellow for environemnt.
I thought the reference level is used to decide how to calculate the fold change and is necessary to correctly interpret the coefficients, but shouldn't the results be the same?
Any thoughts on what am I missing?
For cases where you have a control and mutant, it makes sense to use the control as the base level and compare everything to that, but in this case, I don't have any valid reason to chose one strain over the other. My results are different depending on this, so what is the correct form of doing this sort of analysis?
Thank you very much in advance.
My code is below.
## 1 ##
> levels(colData$genetics)
[1] "129" "B6"
> levels(colData$environment)
[1] "black" "yellow"
> dds <- DESeqDataSetFromMatrix(countData = data[,8:19], colData = colData, design = ~ genetics + environment + genetics:environment)
> dds <- DESeq(dds)
> resG <- results(dds, alpha=0.05, name="genetics_B6_vs_129")
> resE <- results(dds, alpha=0.05, name="environment_yellow_vs_black")
> resGE <- results(dds, alpha=0.05, name="geneticsB6.environmentyellow")
> dim(subset(resG, resG$padj < 0.05))
[1] 3675 6
> dim(subset(resE, resE$padj < 0.05))
[1] 20 6
> dim(subset(resGE, resGE$padj < 0.05))
[1] 24 6
## 2 ##
> colData2 <- colData
> colData2$genetics <- relevel(colData2$genetics, ref="B6")
> colData2$environment <- relevel(colData2$environment, ref="yellow")
> levels(colData2$genetics)
[1] "B6" "129"
> levels(colData2$environment)
[1] "yellow" "black"
> dds2 <- DESeqDataSetFromMatrix(countData = data[,8:19], colData = colData2, design = ~ genetics + environment + genetics:environment)
> dds2 <- DESeq(dds2)
> res2G <- results(dds2, alpha=0.05, name="genetics_129_vs_B6")
> res2E <- results(dds2, alpha=0.05, name="environment_black_vs_yellow")
> res2GE <- results(dds2, alpha=0.05, name="genetics129.environmentblack")
> dim(subset(res2G, resG$padj < 0.05))
[1] 3127 6
> dim(subset(res2E, resE$padj < 0.05))
[1] 2 6
> dim(subset(res2GE, resGE$padj < 0.05))
[1] 0 6
##
> g1 <- subset(resG, resG$padj < 0.05)
> g2 <- subset(res2G, res2G$padj < 0.05)
> length(intersect(rownames(g1), rownames(g2)))
[1] 2640
-- output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.2.10 RcppArmadillo_0.4.320.0 Rcpp_0.11.2
[4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.7
[7] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] annotate_1.40.1 AnnotationDbi_1.24.0 Biobase_2.22.0 DBI_0.2-7
[5] genefilter_1.44.0 grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1
[9] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2
[13] survival_2.37-7 tools_3.0.2 XML_3.95-0.2 xtable_1.7-3
I have an RNAseq experiment from two strains of mice, that have been exposed to two different environments (three biological replicates of each combination). So my experimental setup is
> colData
sample genetics environment
black2 B6 black
black3 B6 black
black5 B6 black
black1 B6 yellow
black4 B6 yellow
black6 B6 yellow
agouti1 129 yellow
agouti4 129 yellow
agouti6 129 yellow
agouti2 129 black
agouti3 129 black
agouti5 129 black
I want to test which genes change their expression when the environment changes, controlling for the genetic background. I am using DESeq2 for this, with the formula ~ genetics + environment + genetics:environment.
However, the choice of the base level for the factors in my experimental design changes the genes that are significant. If I leave the default reference levels (129 for genetics and black for env) I get more DE genes that if I set as reference B6 for genetics and yellow for environemnt.
I thought the reference level is used to decide how to calculate the fold change and is necessary to correctly interpret the coefficients, but shouldn't the results be the same?
Any thoughts on what am I missing?
For cases where you have a control and mutant, it makes sense to use the control as the base level and compare everything to that, but in this case, I don't have any valid reason to chose one strain over the other. My results are different depending on this, so what is the correct form of doing this sort of analysis?
Thank you very much in advance.
My code is below.
## 1 ##
> levels(colData$genetics)
[1] "129" "B6"
> levels(colData$environment)
[1] "black" "yellow"
> dds <- DESeqDataSetFromMatrix(countData = data[,8:19], colData = colData, design = ~ genetics + environment + genetics:environment)
> dds <- DESeq(dds)
> resG <- results(dds, alpha=0.05, name="genetics_B6_vs_129")
> resE <- results(dds, alpha=0.05, name="environment_yellow_vs_black")
> resGE <- results(dds, alpha=0.05, name="geneticsB6.environmentyellow")
> dim(subset(resG, resG$padj < 0.05))
[1] 3675 6
> dim(subset(resE, resE$padj < 0.05))
[1] 20 6
> dim(subset(resGE, resGE$padj < 0.05))
[1] 24 6
## 2 ##
> colData2 <- colData
> colData2$genetics <- relevel(colData2$genetics, ref="B6")
> colData2$environment <- relevel(colData2$environment, ref="yellow")
> levels(colData2$genetics)
[1] "B6" "129"
> levels(colData2$environment)
[1] "yellow" "black"
> dds2 <- DESeqDataSetFromMatrix(countData = data[,8:19], colData = colData2, design = ~ genetics + environment + genetics:environment)
> dds2 <- DESeq(dds2)
> res2G <- results(dds2, alpha=0.05, name="genetics_129_vs_B6")
> res2E <- results(dds2, alpha=0.05, name="environment_black_vs_yellow")
> res2GE <- results(dds2, alpha=0.05, name="genetics129.environmentblack")
> dim(subset(res2G, resG$padj < 0.05))
[1] 3127 6
> dim(subset(res2E, resE$padj < 0.05))
[1] 2 6
> dim(subset(res2GE, resGE$padj < 0.05))
[1] 0 6
##
> g1 <- subset(resG, resG$padj < 0.05)
> g2 <- subset(res2G, res2G$padj < 0.05)
> length(intersect(rownames(g1), rownames(g2)))
[1] 2640
-- output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.2.10 RcppArmadillo_0.4.320.0 Rcpp_0.11.2
[4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.7
[7] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] annotate_1.40.1 AnnotationDbi_1.24.0 Biobase_2.22.0 DBI_0.2-7
[5] genefilter_1.44.0 grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1
[9] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2
[13] survival_2.37-7 tools_3.0.2 XML_3.95-0.2 xtable_1.7-3
Comment