Hello,
I make a differential expression analysis between treated samples (factors) and untreated (controls) at different states (vegetative, early, inter, late).
The problem is that I lack replicates.
-For the vegetative state I have no replicates at all.
Contrast*:
EZL1.veg (factor) vs ICL7.veg (control)
-For the early state I have partially replicates (only 2 for the control)
Contrast*:
EZL1.T0 (factor) vs ICL7.T0 & ICL7.T5 (control)
-For the inter state I have partially replicates (only 2 for the factor)
Contrast*:
EZL1.T10 & EZL1.T20 vs ICL7.T20
-For the late state I have partially replicates (only 2 for the factor)
Contrast*:
EZL1.T35 & EZL1.T50 vs ICL7.T35
My questions are :
1) How Deseq calculates the dispersion in the case of the vegetative state where I have no replicates at all*? (given that I apply a DESEq analysis with design=~group and then I ask results(ddsEZL1_ICL7, contrast=c("group", "controls0", "factors0")))
2) How Deseq calculates the dispersion in all other cases where I have partially replicates*?
Thank you in advance for any ideas*!
Here is the code*:
# 0:Vegetative, 1:Early, 2:Inter, 3:Late
> times=factor(c(1, rep(2,2),rep(3,2),0,rep(1,2), rep(2,1),rep(3,1),0), levels=c(0,1,2,3))
>
> strain = factor(c(rep("factors", 6), rep("controls",5)), levels=c("controls","factors"))
>
> colData = data.frame( strain= strain, times=times)
> rownames(colData)=colnames(duo)
> colData$group <- factor(paste0(colData$strain,colData$times))
> colData
strain times group
EZL1.T0 factors 1 factors1
EZL1.T10 factors 2 factors2
EZL1.T20 factors 2 factors2
EZL1.T35 factors 3 factors3
EZL1.T50 factors 3 factors3
EZL1.veg factors 0 factors0
ICL7.T0 controls 1 controls1
ICL7.T5 controls 1 controls1
ICL7.T20 controls 2 controls2
ICL7.T35 controls 3 controls3
ICL7.veg controls 0 controls0
>
> ddsEZL1_ICL7 <-DESeqDataSetFromMatrix(countData= as.matrix(duo), colData= colData, design=~group)
> ddsEZL1_ICL7 <-DESeq(ddsEZL1_ICL7)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
>
>
> resVEG<-results(ddsEZL1_ICL7, contrast=c("group", "controls0", "factors0"))
> resVEG<- resVEG[!is.na(resVEG$padj) & resVEG$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_VEG<-rownames(resVEG)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_VEG)
[1] 15
>
> resEARLY <- results(ddsEZL1_ICL7, contrast=c("group", "controls1", "factors1"))
> resEARLY <- resEARLY[!is.na(resEARLY$padj) & resEARLY$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_EARLY<-rownames(resEARLY)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_EARLY)
[1] 412
>
> resINTER <- results(ddsEZL1_ICL7, contrast=c("group", "controls2", "factors2"))
> resINTER <- resINTER[!is.na(resINTER$padj) & resINTER$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_INTER<-rownames(resINTER)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_INTER)
[1] 1087
>
> resLATE <- results(ddsEZL1_ICL7, contrast= c("group", "controls3", "factors3"))
> resLATE <- resLATE[!is.na(resLATE$padj) & resLATE$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_LATE<-rownames(resLATE)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_LATE)
[1] 2921
Note*:
When I do a simple DESeq just for the vegetative state DESeq ignores the condition labels (factor and control) and estimates the variance by treating all samples (EZL1.veg and ICL7.veg) as if they were replicates of the same
condition, in order to estimate the dispersions. But this method gives me different results at the end (much more conservative which is normal):
f <- list("~/EZL1_results/EZL1-veg_mRNA_GCCAAT.TOPHAT.pt_51.pe.sorted.bam_fragments.csv",
"~/ICL7_results/ICL7-veg_RNA_ATCACG.TOPHAT.pt_51.pe.sorted.bam_fragments.csv")
….
….
> str(df_points_EARLY)
'data.frame': 41533 obs. of 2 variables:
$ EZL1.veg: int 104 105 44 0 58 84 111 22 0 105 ...
$ ICL7.veg: int 280 235 106 0 85 446 185 64 1 168 ...
> strain = factor(c(rep("EZL1",1), rep("ICL7",1)), levels=c("EZL1","ICL7"))
> colData = data.frame( strain= strain )
> rownames(colData)=colnames(df_points_EARLY)
> dds<-DESeqDataSetFromMatrix(countData= as.matrix(df_points_EARLY), colData= colData, design=~strain)
> dds <-DESeq(dds)
> resultsNames(dds)
[1] "Intercept" "strainEZL1" "strainICL7"
> resEZL1_ICL7 <- results(dds, contrast=list("strainICL7","strainEZL1"))
> resEZL1_ICL7<- resEZL1_ICL7[!is.na(resEZL1_ICL7$padj) & resEZL1_ICL7$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1_ICL7_VFG<-rownames(resEZL1_ICL7)
> length(list_of_differentially_expressed_genes_for_EZL1_ICL7_VEG)
[1] 0
I make a differential expression analysis between treated samples (factors) and untreated (controls) at different states (vegetative, early, inter, late).
The problem is that I lack replicates.
-For the vegetative state I have no replicates at all.
Contrast*:
EZL1.veg (factor) vs ICL7.veg (control)
-For the early state I have partially replicates (only 2 for the control)
Contrast*:
EZL1.T0 (factor) vs ICL7.T0 & ICL7.T5 (control)
-For the inter state I have partially replicates (only 2 for the factor)
Contrast*:
EZL1.T10 & EZL1.T20 vs ICL7.T20
-For the late state I have partially replicates (only 2 for the factor)
Contrast*:
EZL1.T35 & EZL1.T50 vs ICL7.T35
My questions are :
1) How Deseq calculates the dispersion in the case of the vegetative state where I have no replicates at all*? (given that I apply a DESEq analysis with design=~group and then I ask results(ddsEZL1_ICL7, contrast=c("group", "controls0", "factors0")))
2) How Deseq calculates the dispersion in all other cases where I have partially replicates*?
Thank you in advance for any ideas*!
Here is the code*:
# 0:Vegetative, 1:Early, 2:Inter, 3:Late
> times=factor(c(1, rep(2,2),rep(3,2),0,rep(1,2), rep(2,1),rep(3,1),0), levels=c(0,1,2,3))
>
> strain = factor(c(rep("factors", 6), rep("controls",5)), levels=c("controls","factors"))
>
> colData = data.frame( strain= strain, times=times)
> rownames(colData)=colnames(duo)
> colData$group <- factor(paste0(colData$strain,colData$times))
> colData
strain times group
EZL1.T0 factors 1 factors1
EZL1.T10 factors 2 factors2
EZL1.T20 factors 2 factors2
EZL1.T35 factors 3 factors3
EZL1.T50 factors 3 factors3
EZL1.veg factors 0 factors0
ICL7.T0 controls 1 controls1
ICL7.T5 controls 1 controls1
ICL7.T20 controls 2 controls2
ICL7.T35 controls 3 controls3
ICL7.veg controls 0 controls0
>
> ddsEZL1_ICL7 <-DESeqDataSetFromMatrix(countData= as.matrix(duo), colData= colData, design=~group)
> ddsEZL1_ICL7 <-DESeq(ddsEZL1_ICL7)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
>
>
> resVEG<-results(ddsEZL1_ICL7, contrast=c("group", "controls0", "factors0"))
> resVEG<- resVEG[!is.na(resVEG$padj) & resVEG$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_VEG<-rownames(resVEG)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_VEG)
[1] 15
>
> resEARLY <- results(ddsEZL1_ICL7, contrast=c("group", "controls1", "factors1"))
> resEARLY <- resEARLY[!is.na(resEARLY$padj) & resEARLY$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_EARLY<-rownames(resEARLY)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_EARLY)
[1] 412
>
> resINTER <- results(ddsEZL1_ICL7, contrast=c("group", "controls2", "factors2"))
> resINTER <- resINTER[!is.na(resINTER$padj) & resINTER$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_INTER<-rownames(resINTER)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_INTER)
[1] 1087
>
> resLATE <- results(ddsEZL1_ICL7, contrast= c("group", "controls3", "factors3"))
> resLATE <- resLATE[!is.na(resLATE$padj) & resLATE$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1vsICL7_LATE<-rownames(resLATE)
> length(list_of_differentially_expressed_genes_for_EZL1vsICL7_LATE)
[1] 2921
Note*:
When I do a simple DESeq just for the vegetative state DESeq ignores the condition labels (factor and control) and estimates the variance by treating all samples (EZL1.veg and ICL7.veg) as if they were replicates of the same
condition, in order to estimate the dispersions. But this method gives me different results at the end (much more conservative which is normal):
f <- list("~/EZL1_results/EZL1-veg_mRNA_GCCAAT.TOPHAT.pt_51.pe.sorted.bam_fragments.csv",
"~/ICL7_results/ICL7-veg_RNA_ATCACG.TOPHAT.pt_51.pe.sorted.bam_fragments.csv")
….
….
> str(df_points_EARLY)
'data.frame': 41533 obs. of 2 variables:
$ EZL1.veg: int 104 105 44 0 58 84 111 22 0 105 ...
$ ICL7.veg: int 280 235 106 0 85 446 185 64 1 168 ...
> strain = factor(c(rep("EZL1",1), rep("ICL7",1)), levels=c("EZL1","ICL7"))
> colData = data.frame( strain= strain )
> rownames(colData)=colnames(df_points_EARLY)
> dds<-DESeqDataSetFromMatrix(countData= as.matrix(df_points_EARLY), colData= colData, design=~strain)
> dds <-DESeq(dds)
> resultsNames(dds)
[1] "Intercept" "strainEZL1" "strainICL7"
> resEZL1_ICL7 <- results(dds, contrast=list("strainICL7","strainEZL1"))
> resEZL1_ICL7<- resEZL1_ICL7[!is.na(resEZL1_ICL7$padj) & resEZL1_ICL7$padj <= 0.05,]
> list_of_differentially_expressed_genes_for_EZL1_ICL7_VFG<-rownames(resEZL1_ICL7)
> length(list_of_differentially_expressed_genes_for_EZL1_ICL7_VEG)
[1] 0