SEQanswers DESEQ2: How to use dispersion estimated from one dataset in another dataset
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post Parharn Bioinformatics 4 09-18-2014 07:23 AM ysnapus Bioinformatics 4 03-10-2014 04:00 PM mbk0asis Bioinformatics 0 03-10-2014 12:07 AM arkilis General 2 07-18-2013 05:09 PM ips Bioinformatics 6 03-21-2011 04:29 PM

 06-26-2015, 02:12 AM #1 EVELE Junior Member   Location: France Join Date: Jun 2015 Posts: 9 DESEQ2: How to use dispersion estimated from one dataset in another dataset Hi all, I have recently started using DeSeq2 and I would like to know if it is possible to do something like this: I have 3 time course experiments (1 control and 2 treated (A and B)) but I do not have biological replicates for B treated experiment. So I calculated the dispersion for experiment A versus control and I wonder if there is a way to use the same dispersion for the comparison of B versus control. How to do that ? Is it biologically, scientifically "legal" to do that ? By the way, there is evidence that they have similar dispersion and I'd like to avoid using the "blind" method. I would appreciate it to let me know your ideas about it.
 06-26-2015, 04:03 AM #2 Michael Love Senior Member   Location: Boston Join Date: Jul 2013 Posts: 333 The best approach would be to combine both analyses into one, by using a DESeqDataSet with all the samples. The design would still be something like: ~ time + condition + condition:time. In order to test the two treatments separately, you can use the likelihood ratio test with different design matrices presented to the reduced argument (here you have to build the matrices outside of the DESeq() function, and pass them in, as is done with limma for example). Something like: full.mm <- model.matrix(~ time + condition + condition:time, data=colData(dds)) # now contruct 2 reduced matrices for treatment A and treatment B, by removing the columns for the interaction terms which contain treatment A and B respectively. The for each reduced matrix, you run these lines to get a results table: ddsA <- DESeq(dds, full=full.mm, reduced=reduced.mmA, test="LRT") resA <- results(ddsA)
06-30-2015, 01:16 AM   #3
EVELE
Junior Member

Location: France

Join Date: Jun 2015
Posts: 9

Hi Michael,

I tried to apply your suggestions.
As I am not sure for my understanding and the interpretations I give to my results, I expose (in bold) some of my considerations and questions arising through the steps.
I would appreciate it a lot , if you could answer some of them.

# times*: 0,1,2 represent stages*: vegetative, early, late respectively (early and late in a sexual cycle)
> times=factor(c(rep(1,2), rep(2,3), 1, 0, 1, rep(2,3), 1, 2, 0, 1, rep(2,2), 1, rep(2,2), 1, rep(0,2), 1, rep(2,5), 0), levels=c(0,1,2))

> condition = factor(c(rep("A", 14), rep("Controls",9), rep("B",7)), levels=c("Controls","A","B"))

Consideration 1
I consider that the log2 Fold Change results will be*: treatment A over Controls and B over Controls* (the first element of the vector levels=c("Controls","A","B") is always the denominator)...

> colData = data.frame( condition=condition, times=times)

# quarteto is my data frame containing the counts
> rownames(colData)=colnames(quarteto)
> dds<-DESeqDataSetFromMatrix(countData= as.matrix(quarteto), colData= colData ,design=formula(~times+condition+condition:times))
> colData
condition times
ND7.T0 A 1
ND7.T10 A 1
ND7.T20 A 2
ND7.T30 A 2
ND7.T40 A 2
ND7.T5 A 1
ND7.V A 0
ICL7.T0 A 1
ICL7.T10 A 2
ICL7.T20 A 2
ICL7.T35 A 2
ICL7.T5 A 1
ICL7.T50 A 2
ICL7.veg A 0
T0.2 Controls 1
T20.2 Controls 2
T20.3 Controls 2
T3 Controls 1
T30 Controls 2
T45 Controls 2
T5.2 Controls 1
V1.2 Controls 0
VK1 Controls 0
EZL1.T0 B 1
EZL1.T10 B 2
EZL1.T20 B 2
EZL1.T35 B 2
EZL1.T5 B 2
EZL1.T50 B 2
EZL1.veg B 0
>
>
> full.mm <- model.matrix(~ times + condition + condition:times, data=colData(dds))
(Intercept) times1 times2 conditionA conditionB times1:conditionA
ND7.T0 1 1 0 1 0 1
ND7.T10 1 1 0 1 0 1
ND7.T20 1 0 1 1 0 0
ND7.T30 1 0 1 1 0 0
ND7.T40 1 0 1 1 0 0
ND7.T5 1 1 0 1 0 1
times2:conditionA times1:conditionB times2:conditionB
ND7.T0 0 0 0
ND7.T10 0 0 0
ND7.T20 1 0 0
ND7.T30 1 0 0
ND7.T40 1 0 0
ND7.T5 0 0 0

> # contruction of 2 reduced matrices for treatment A and treatment B, by removing the columns for the interaction terms which contain treatment A and B respectively.

Consideration 2
Do I consider right that the only columns concerning the interaction terms of conditionA are*:
times1:conditionA and times2:conditionA*?

> reducedA.mm <- full.mm[ ,!colnames(full.mm) %in% c("times1:conditionA", "times2:conditionA")]

> reducedB.mm <- full.mm[ ,!colnames(full.mm) %in% c("times1:conditionB", "times2:conditionB")]

(Intercept) times1 times2 conditionA conditionB times1:conditionB
ND7.T0 1 1 0 1 0 0
ND7.T10 1 1 0 1 0 0
ND7.T20 1 0 1 1 0 0
ND7.T30 1 0 1 1 0 0
ND7.T40 1 0 1 1 0 0
ND7.T5 1 1 0 1 0 0
times2:conditionB
ND7.T0 0
ND7.T10 0
ND7.T20 0
ND7.T30 0
ND7.T40 0
ND7.T5 0

> # treated_B vs Controls
> ddsB <- DESeq(dds, full=full.mm, reduced=reducedB.mm, test="LRT")
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 4444 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
>
> resultsNames(ddsB)
[1] "Intercept" "times1" "times2"
[4] "conditionA" "conditionB" "times1.conditionA"
[7] "times2.conditionA" "times1.conditionB" "times2.conditionB"
>
> resB_1 <- results(ddsB, name="times1.conditionB")
log2 fold change (MLE): times1.conditionB
LRT p-value: full vs reduced
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
PTET.51.1.G0010001 146.5136783 -0.62466508 0.7898300 0.6319728 0.7290694
PTET.51.1.G0010002 122.5083339 -0.68161198 0.8059814 1.1202526 0.5711369
PTET.51.1.G0010003 100.0616624 -0.53628223 0.5763633 2.2757097 0.3205058
PTET.51.1.G0010004 0.2590626 -5.89587066 10.5174621 0.4181308 0.8113422
PTET.51.1.G0010005 111.5651704 -0.13174595 0.6950362 1.1823154 0.5536859
PTET.51.1.G0010006 115.1214038 0.07501774 0.9922750 0.9200433 0.6312700
<numeric>
PTET.51.1.G0010001 0.99998
PTET.51.1.G0010002 0.99998
PTET.51.1.G0010003 0.99998
PTET.51.1.G0010004 NA
PTET.51.1.G0010005 0.99998
PTET.51.1.G0010006 0.99998
>

Question 1
resB_1 <- results(ddsB, name="times1.conditionB")
How are the results resB_1 interpreted*?
Is it the differential expression from time 0 to time 1 specific to conditionB ??

Question 2
How is the «*log2 fold change (MLE): times1.conditionB*» interpreted at the results resB_1? How is it calculated (what are the 2 parts of comparison)*?
Is it something like*: times1.conditionB vs (Controls+times1.conditionA)*?

> # COUNTS PLOTS #
>
> data1 <- plotCounts(ddsB, which.min(resB_1\$padj), intgroup=c("times","condition"), returnData=TRUE)
> pdf("counts_over_time.pdf")
> ggplot(data1, aes(x=times, y=count, color=condition, group=condition)) + geom_point() + stat_smooth(se=FALSE,method="loess") + scale_y_log10()
Error in predLoess(y, x, newx, s, weights, pars\$robust, pars\$span, pars\$degree, :
NA/NaN/Inf dans un appel à une fonction externe (argument 5)
Il y a eu 16 avis (utilisez warnings() pour les visionner)

> dev.off()
null device
1
> The plot is Attached

Question 3
Why error*? Is there a problem with the loess method*?

> resB_1 <- results(ddsB, name="times1.conditionB")
> resB_2 <- results(ddsB, name="times2.conditionB")
>
> min_fold_change=2
>
> ####
> resB_1 <- resB_1[resB_1\$log2FoldChange >= log2(min_fold_change) | resB_1\$log2FoldChange <= log2(1/min_fold_change),]
>
> resB_2 <- resB_2[resB_2\$log2FoldChange >= log2(min_fold_change) | resB_2\$log2FoldChange <= log2(1/min_fold_change),]
>
> list_of_differentially_expressed_genes_for_B_vs_Controls<-unique(c(rownames(resB_1),rownames(resB_2)))
> length(list_of_differentially_expressed_genes_for_B_vs_Controls)
[1] 33

Question 3
Is this the right way*:
list_of_differentially_expressed_genes_for_B_vs_Controls<-unique(c(rownames(resB_1),rownames(resB_2)))
to obtain the list of all the differentially expressed genes for treatment B vs Control*?
Attached Files
 counts_over_time.pdf (4.9 KB, 9 views)

 07-01-2015, 08:09 AM #4 Michael Love Senior Member   Location: Boston Join Date: Jul 2013 Posts: 333 Your code looks correct. One comment: the outlier detection is useful when there are a few genes with outliers. Here there are many genes being flagged, which either means: there is a lot of variability, which the outlier detection method is not picking up on or there is a sample with consistent outlier counts. So I'd recommend setting minReplicatesForReplace=Inf to turn off filtering, and to use plotPCA on rlog or VST data to identify if there is an obvious outlier sample which should be removed. Question 1: you don't need to focus on the log2FoldChange. When you perform a likelihood ratio test, you are testing multiple terms. Here you are testing whether there are treatment specific differences over time. There are multiple terms involved in this test: the additional effect of treatmentA at time 1, the additional effect of treatmentA at time 2 (compared to the base level in controls). The p-values and adjusted p-values are the important columns. This is described in the help page for results, under "On p-values": ?results Question 2: the meaning of each of the terms in an interaction model is quite complex, and I often recommend that users who have complex experimental designs who are not familiar with analyzing these speak to a local statistician who can help explain all the terms. There is nothing special about DESeq2 here, these are the same terms that would appear in a linear regression time course analysis, so any person with statistical training should be able to explain these to you in person. Question 3: this is a ggplot2 error, so I'm not sure. it's likely related to 0's in the count column and the log10 scaling. Your last question: you are testing for treatment-specific differences across multiple time points, so it doesn't make sense to do LFC thresholding here. Or if you want to threshold, remember that there are multiple time points being tested, and each one has it's own interaction term.