Hi,
I am working with an RNA-Seq dataset and used DESeq2 to perform differential gene expression analysis using a multi-factor design.
In my results I noticed that the log2FC values were different compared to if I would just take the log ratio of the condition means (as suggested in the following post):
I checked my design matrices and all the data and I just can't understand why this is different.
Below is the code I used:
library(DESeq2)
coldata <- read.table("Run1_run2_deseq2_design.txt", sep="\t", header=T, row.names=1)
coldata <- DataFrame(coldata)
countdata <- read.table("Run1_run2_raw_counts.counts", sep="\t", header=T, row.names=1)
countdata.run1.run2 <- cbind(countdata[,c(2:4,6:8,10:12,14:16)])
colnames(countdata.run1.run2) <- rownames(coldata)
ddsFromCounts <- DESeqDataSetFromMatrix(countData=countdata.run1.run2, colData=coldata, design=~Type+Condition+Type:Condition)
ddsFromCounts$Condition <- factor(ddsFromCounts$Condition, levels=c("U","E"))
dds.run1.run2 <- DESeq(ddsFromCounts)
> colData(dds.run1.run2)
DataFrame with 12 rows and 3 columns
Type Condition sizeFactor
<factor> <factor> <numeric>
U_U_1 U U 0.8218577
M_U_1 M U 1.1845552
M_E_1 M E 1.2826040
U_E_1 U E 1.1260139
E_U_1 E U 0.9145150
... ... ... ...
M_U_2 M U 1.4016941
M_E_2 M E 0.7907137
U_E_2 U E 1.4520507
E_U_2 E U 0.4845640
E_E_2 E E 0.9343742
These are the design groups that I can use (resultsNames):
"TypeE" "TypeM" "TypeU" "ConditionU" "ConditionE" "TypeE.ConditionU" "TypeM.ConditionU" "TypeU.ConditionU" "TypeE.ConditionE" "TypeM.ConditionE" "TypeU.ConditionE"
Here are the first six rows of my results for a particular contrast:
> head(results(dds.run1.run2, contrast=list("TypeM.ConditionU","TypeE.ConditionU")))
log2 fold change (MAP): TypeM.ConditionU vs TypeE.ConditionU
Wald test p-value: TypeM.ConditionU vs TypeE.ConditionU
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##### 6.83635425 -0.29831487 0.25660567 -1.1625420 0.245015380 NA
##### 0.77856172 0.02932029 0.12641235 0.2319417 0.816583317 NA
##### 0.10539002 -0.03229292 0.06085041 -0.5306935 0.595631173 NA
##### 2.52917312 0.08576176 0.19506918 0.4396479 0.660192139 NA
##### 320.61150606 0.30625330 0.11398239 2.6868476 0.007212986 0.1360898
##### 0.07568229 0.01855059 0.05945254 0.3120235 0.755022638 NA
Following is some diagnostic code to compare the "results" fold change with the log of condition means for the 5th feature from the results table (rowMeans = 320):
> nc.test <- round(counts(dds.run1.run2, normalized=TRUE)[5,])
> data.frame(type=colData(dds.run1.run2)$Type, condition=colData(dds.run1.run2)$Condition, normcounts=nc.test)
type condition normcounts
U_U_1 U U 301
M_U_1 M U 365
M_E_1 M E 297
U_E_1 U E 346
E_U_1 E U 272
E_E_1 E E 361
U_U_2 U U 337
M_U_2 M U 327
M_E_2 M E 287
U_E_2 U E 345
E_U_2 E U 262
E_E_2 E E 347
> log2(mean(nc.test[colData(dds.run1.run2)$Type == "M" & colData(dds.run1.run2)$Condition == "U"])/mean(nc.test[colData(dds.run1.run2)$Type == "E" & colData(dds.run1.run2)$Condition == "U"]))
[1] 0.3739323
As you'll notice that the log2FC value from results is 0.30625330 whereas it is 0.3739323 from the log of condition mean ratio. I am just not sure what might be affecting this.
I will appreciate any help on this.
Thank you!
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.6.2 RcppArmadillo_0.4.550.1.0 Rcpp_0.11.3 GenomicRanges_1.18.3 GenomeInfoDb_1.2.3 IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 Biobase_2.26.0 BiocParallel_1.0.0 brew_1.0-6 checkmate_1.5.1 cluster_1.15.3
[12] codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1 digest_0.6.6 fail_1.2 foreach_1.4.2 foreign_0.8-61 Formula_1.1-2 genefilter_1.48.1 geneplotter_1.44.0 ggplot2_1.0.0
[23] grid_3.1.2 gtable_0.1.2 Hmisc_3.14-6 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 MASS_7.3-35 munsell_0.4.2 nnet_7.3-8 plyr_1.8.1
[34] proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1 rpart_4.1-8 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2 survival_2.37-7 tools_3.1.2
[45] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0
I am working with an RNA-Seq dataset and used DESeq2 to perform differential gene expression analysis using a multi-factor design.
In my results I noticed that the log2FC values were different compared to if I would just take the log ratio of the condition means (as suggested in the following post):
I checked my design matrices and all the data and I just can't understand why this is different.
Below is the code I used:
library(DESeq2)
coldata <- read.table("Run1_run2_deseq2_design.txt", sep="\t", header=T, row.names=1)
coldata <- DataFrame(coldata)
countdata <- read.table("Run1_run2_raw_counts.counts", sep="\t", header=T, row.names=1)
countdata.run1.run2 <- cbind(countdata[,c(2:4,6:8,10:12,14:16)])
colnames(countdata.run1.run2) <- rownames(coldata)
ddsFromCounts <- DESeqDataSetFromMatrix(countData=countdata.run1.run2, colData=coldata, design=~Type+Condition+Type:Condition)
ddsFromCounts$Condition <- factor(ddsFromCounts$Condition, levels=c("U","E"))
dds.run1.run2 <- DESeq(ddsFromCounts)
> colData(dds.run1.run2)
DataFrame with 12 rows and 3 columns
Type Condition sizeFactor
<factor> <factor> <numeric>
U_U_1 U U 0.8218577
M_U_1 M U 1.1845552
M_E_1 M E 1.2826040
U_E_1 U E 1.1260139
E_U_1 E U 0.9145150
... ... ... ...
M_U_2 M U 1.4016941
M_E_2 M E 0.7907137
U_E_2 U E 1.4520507
E_U_2 E U 0.4845640
E_E_2 E E 0.9343742
These are the design groups that I can use (resultsNames):
"TypeE" "TypeM" "TypeU" "ConditionU" "ConditionE" "TypeE.ConditionU" "TypeM.ConditionU" "TypeU.ConditionU" "TypeE.ConditionE" "TypeM.ConditionE" "TypeU.ConditionE"
Here are the first six rows of my results for a particular contrast:
> head(results(dds.run1.run2, contrast=list("TypeM.ConditionU","TypeE.ConditionU")))
log2 fold change (MAP): TypeM.ConditionU vs TypeE.ConditionU
Wald test p-value: TypeM.ConditionU vs TypeE.ConditionU
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##### 6.83635425 -0.29831487 0.25660567 -1.1625420 0.245015380 NA
##### 0.77856172 0.02932029 0.12641235 0.2319417 0.816583317 NA
##### 0.10539002 -0.03229292 0.06085041 -0.5306935 0.595631173 NA
##### 2.52917312 0.08576176 0.19506918 0.4396479 0.660192139 NA
##### 320.61150606 0.30625330 0.11398239 2.6868476 0.007212986 0.1360898
##### 0.07568229 0.01855059 0.05945254 0.3120235 0.755022638 NA
Following is some diagnostic code to compare the "results" fold change with the log of condition means for the 5th feature from the results table (rowMeans = 320):
> nc.test <- round(counts(dds.run1.run2, normalized=TRUE)[5,])
> data.frame(type=colData(dds.run1.run2)$Type, condition=colData(dds.run1.run2)$Condition, normcounts=nc.test)
type condition normcounts
U_U_1 U U 301
M_U_1 M U 365
M_E_1 M E 297
U_E_1 U E 346
E_U_1 E U 272
E_E_1 E E 361
U_U_2 U U 337
M_U_2 M U 327
M_E_2 M E 287
U_E_2 U E 345
E_U_2 E U 262
E_E_2 E E 347
> log2(mean(nc.test[colData(dds.run1.run2)$Type == "M" & colData(dds.run1.run2)$Condition == "U"])/mean(nc.test[colData(dds.run1.run2)$Type == "E" & colData(dds.run1.run2)$Condition == "U"]))
[1] 0.3739323
As you'll notice that the log2FC value from results is 0.30625330 whereas it is 0.3739323 from the log of condition mean ratio. I am just not sure what might be affecting this.
I will appreciate any help on this.
Thank you!
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.6.2 RcppArmadillo_0.4.550.1.0 Rcpp_0.11.3 GenomicRanges_1.18.3 GenomeInfoDb_1.2.3 IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 Biobase_2.26.0 BiocParallel_1.0.0 brew_1.0-6 checkmate_1.5.1 cluster_1.15.3
[12] codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1 digest_0.6.6 fail_1.2 foreach_1.4.2 foreign_0.8-61 Formula_1.1-2 genefilter_1.48.1 geneplotter_1.44.0 ggplot2_1.0.0
[23] grid_3.1.2 gtable_0.1.2 Hmisc_3.14-6 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 MASS_7.3-35 munsell_0.4.2 nnet_7.3-8 plyr_1.8.1
[34] proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1 rpart_4.1-8 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2 survival_2.37-7 tools_3.1.2
[45] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0
Comment