SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Paired design versus unpaired design in DESeq2 KristenC RNA Sequencing 1 05-29-2014 11:05 AM
DESeq2: Continuos variable in the design sindrle RNA Sequencing 14 05-27-2014 07:40 AM
How do I analyze double nested design? (DESeq2) Sciurus Bioinformatics 5 04-15-2014 02:59 AM
non balanced design in DESeq2 libro.s Bioinformatics 1 01-10-2014 05:10 AM
EdgeR design-matrix design extended.wobble RNA Sequencing 3 07-11-2011 06:58 AM

Reply
 
Thread Tools
Old 09-03-2014, 09:52 AM   #1
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default Deseq2 design

Dear all,

I have one question about the design of the condition: I am doing gene silencing screening with 2 different shRNA (targeted to the same gene). Before I do the RNA-seq, I know both of these two shRNA bring my gene of interest down 50% compare with control (bench data), and I want to see the differential expression genes in the knock down condition. So I have 9 samples in total with 3 biological replicates per condition:

untreated1 control
untreated2 control
untreated3 control
treatedA1 shRNA_A
treatedA2 shRNA_A
treatedA3 shRNA_A
treatedB1 shRNA_B
treatedB2 shRNA_B
treatedB3 shRNA_B

First, I did the Deseq2 analysis with two groups comparison: control vs shRNA_A and control vs shRNA_B.
design as follows:
condition
untreated1 control
untreated2 control
untreated3 control
treatedA1 shRNA_A
treatedA2 shRNA_A
treatedA3 shRNA_A

condition
untreated1 control
untreated2 control
untreated3 control
treatedA1 shRNA_B
treatedA2 shRNA_B
treatedA3 shRNA_B


From the two dataset, I can see my gene goes down 50% (consistent with my bench data), but there are different number of regulated genes (with padj < 0.1) in the two dataset:
control vs shRNA_A 167 genes
control vs shRNA_B 431 genes

there are 50 common regulated genes in the two lists, but the non-overlapped genes have similar expression trend in the two shRNA condition (just because of the padj, those genes are not choose in the list).

Is my analysis design correct? or shall I do another type of design like this?

condition type
untreated1 control untreated
untreated2 control untreated
untreated3 control untreated
treatedA1 shRNA_A treated
treatedA2 shRNA_A treated
treatedA3 shRNA_A treated
treatedB1 shRNA_B treated
treatedB2 shRNA_B treated
treatedB3 shRNA_B treated

ideally I would like to see the genes in two shRNA treated condition together (like anova test), but I do not know how to design it?

Thanks in advance for any advice or comments!

Best,
Wei
Wei-HD is offline   Reply With Quote
Old 09-03-2014, 11:42 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I'd personally go with the combined design (i.e., the last one you showed). Especially since you only have 3 samples per group, you'll likely get a bit better variance estimates with that. If you wanted to do a classic 1-way anova like test, then you'd just use nbinomLRT() and compare that against a design of ~1.
dpryan is offline   Reply With Quote
Old 09-04-2014, 04:16 AM   #3
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Yes, as Devon states, if you want to test (control vs shRNA_A) and (control vs shRNA_B) at once you can use the LRT,

with a condition vector like so: c("control", ..., "shRNA_A", ..., "shRNA_B", ... )

you can use:

Code:
design(dds) = ~ condition
dds = DESeq(dds, test="LRT", reduced = ~ 1)
results(dds) gives the LRT p-values and adjusted p-values (note that the LFC column is only for one of the comparisons, see ?results for more information).

If you want to make the two comparisons individually, it's best to run all the samples together and produce results tables like so:

Code:
dds = DESeq(dds)
results(dds, contrast=c("condition","shRNA_A","control"))
results(dds, contrast=c("condition","shRNA_B","control"))
Michael Love is offline   Reply With Quote
Old 09-04-2014, 07:56 AM   #4
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default

Thanks a lot Devon and Michael!

Like you recommended I run all the samples together to get better variance estimates. I tried to my design as ~ condition+type. I had one error"the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others", I guess this is a wrong design.

Here is my code:

> countdata = as.matrix(read.table("data.txt", header=T, row.names=1))
> head (countdata)
untreated1 untreated2 untreated3 treatedA1 treatedA2 treatedA3 treatedB1 treatedB2 treatedB3
ENSMUSG00000000001 1233 1091 1259 1234 1118 1082 1454 1086 1143
ENSMUSG00000000003 0 0 0 0 0 0 0 0 0
ENSMUSG00000000028 76 68 61 75 81 82 93 74 88
ENSMUSG00000000031 2 3 5 2 2 2 12 5 2
ENSMUSG00000000037 24 31 38 27 26 14 47 43 45
ENSMUSG00000000049 6 7 10 11 19 16 20 11 15
> coldata= read.table ("design.txt", header = T, row.names=1)
>
> coldata
condition type
untreated1 control untreated
untreated2 control untreated
untreated3 control untreated
treatedA1 shRNA_A treated
treatedA2 shRNA_A treated
treatedA3 shRNA_A treated
treatedB1 shRNA_B treated
treatedB2 shRNA_B treated
treatedB3 shRNA_B treated
> dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition + type)
Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others



Then I do it like Michael's code, I delete the type column in coldata file.

> coldata
condition
untreated1 control
untreated2 control
untreated3 control
treatedA1 shRNA_A
treatedA2 shRNA_A
treatedA3 shRNA_A
treatedB1 shRNA_B
treatedB2 shRNA_B
treatedB3 shRNA_B

> dds <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition)
> design(dds) = ~ condition
> dds = DESeq(dds, test="LRT", reduced = ~ 1)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> res <- results(dds)
> head(res)
log2 fold change: condition shRNA B vs control
LRT p-value: '~ condition' vs '~ 1'
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000000001 1183.648139 -0.1000936 0.09353996 1.191909 0.55103647
ENSMUSG00000000003 0.000000 NA NA NA NA
ENSMUSG00000000028 77.301980 0.1806468 0.21423655 1.095024 0.57838715
ENSMUSG00000000031 3.688155 0.7430438 0.92477711 2.323728 0.31290243
ENSMUSG00000000037 32.345099 0.4137095 0.32571884 6.944625 0.03104515
ENSMUSG00000000049 12.603642 0.8593232 0.53167437 3.999843 0.13534587
padj
<numeric>
ENSMUSG00000000001 0.9509148
ENSMUSG00000000003 NA
ENSMUSG00000000028 0.9573557
ENSMUSG00000000031 NA
ENSMUSG00000000037 0.3663522
ENSMUSG00000000049 NA

######question1: res only include one comparison "condition shRNA B vs control", how can I get the result "condition shRNA A vs control"? (sorry I checked the ?results, but did not manage to get it)
######question2: LRT p-value: '~ condition' vs '~ 1', I do not quite understand '~ 1'?

when I do two comparisons individually, then this is nbinomWaldTest test:

dds = DESeq(dds)
resA <- results(dds, contrast=c("condition","shRNA_A","control"))
resB <- results(dds, contrast=c("condition","shRNA_B","control"))

The result resB is slightly different from the res (with LRT test) in both number of genes and fold change number.
######question3: which method shall I use?

Thanks again for your time and help
Wei-HD is offline   Reply With Quote
Old 09-04-2014, 08:11 AM   #5
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Q1:

In my previous response I gave the answer to this one:

Quote:
results(dds) gives the LRT p-values and adjusted p-values (note that the LFC column is only for one of the comparisons, see ?results for more information).
When you print 'res', it says:

Code:
LRT p-value: '~ condition' vs '~ 1'
So the p-values are for the test of the counts depending on all levels of condition (control, shRNA_A, shRNA_B), while the LFC shown in this table is only for one of the fold changes. Please read the help for ?results which has a paragraph explaining this.


Q2:

The likelihood ratio test compares the explanatory power of two models.

See http://en.wikipedia.org/wiki/Likelihood-ratio_test for background on this statistical test.

the full model (~ condition) includes the intercept and the information in 'condition' (note that in R formula, ~ condition is equivalent to ~ 1 + condition)

the reduced model (~ 1) is just an intercept term

resB is a comparison of one level vs the control, whereas the LRT tests all levels of condition.

Q3:

Which method you use depends on the biological question you want to ask.

Use the LRT if you want to describe the set of genes changing from any level of condition.

Use the Wald tests if you want to describe the set of genes from a single pair of levels (e.g. shRNA_B vs control).
Michael Love is offline   Reply With Quote
Old 09-04-2014, 08:57 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Michael: Quick (largely) unrelated question for you. Given an experiment with design ~condition, where condition just has two levels, is there much of a benefit to using a Wald test over an LRT other than the ease of computation? My understanding had always been that the Wald test approximated an LRT and was often preferred since you don't have to fit the alternate model.
dpryan is offline   Reply With Quote
Old 09-04-2014, 09:08 AM   #7
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

I haven't seen any big difference in the ranking or set of genes {adj. p < alpha} between the Wald and the LRT. Some differences are:

The Wald statistic appeared more compatible with the LFC shrinkage, producing uniform p-values under the null (the numerator and denominator shrink at nearly the same rate), while after LFC shrinkage the LRT statistic under the null piles up a bit near 0 (p-values piling up at 1).

The Wald allow for easy testing of |beta| > theta, for theta > 0.

The standard Wald pipeline has LFC shrinkage, for which we have to estimate the MLE betas to estimate the prior, so I'm not sure if it's even faster one way or the other.
Michael Love is offline   Reply With Quote
Old 09-04-2014, 09:21 AM   #8
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default

Thanks a lot Michael for the detail and explanation!

I think LRT test makes more sense for me (compare all levels). Sorry for the question before to get the comparison with shRNA_A vs control. I got it with the "name".

> resultsNames(dds)
[1] "Intercept" "condition_shRNAA_vs_control" "condition_shRNAB_vs_control"
> resA <- results(dds, name="condition_shRNA_A_vs_control")

But with "contrast", I had one error:

> resA <- results(dds, contrast=c("condition","shRNA_A","control"))

Error in normalizeSingleBracketSubscript(j, x) :
subscript contains invalid names
Wei-HD is offline   Reply With Quote
Old 09-04-2014, 09:27 AM   #9
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default

Quote:
Originally Posted by dpryan View Post
I'd personally go with the combined design (i.e., the last one you showed). Especially since you only have 3 samples per group, you'll likely get a bit better variance estimates with that. If you wanted to do a classic 1-way anova like test, then you'd just use nbinomLRT() and compare that against a design of ~1.

Thanks a lot Devon! just one question about "design of ~1", you mentioned. you mean: dds <- nbinomLRT(dds, reduced = ~ 1) ?
Wei-HD is offline   Reply With Quote
Old 09-04-2014, 10:16 AM   #10
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi Wei,

Yes, there is a bug in v1.4, that if you've run the LRT pipeline, then you can't use the contrast argument if the LFC is already contained in resultsNames, but you can get it with the 'name' argument. And it's the same result table using 'name'. I've fixed this in the next release (v1.6 released in October).

( and yes to your question for Devon )
Michael Love is offline   Reply With Quote
Old 09-04-2014, 10:22 AM   #11
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default

Thanks again Michael for the swift reply and great work
I need to read more to understand the use of "~1" in different function
Wei-HD is offline   Reply With Quote
Old 09-04-2014, 11:13 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by Michael Love View Post
I haven't seen any big difference in the ranking or set of genes {adj. p < alpha} between the Wald and the LRT. Some differences are:

The Wald statistic appeared more compatible with the LFC shrinkage, producing uniform p-values under the null (the numerator and denominator shrink at nearly the same rate), while after LFC shrinkage the LRT statistic under the null piles up a bit near 0 (p-values piling up at 1).

The Wald allow for easy testing of |beta| > theta, for theta > 0.

The standard Wald pipeline has LFC shrinkage, for which we have to estimate the MLE betas to estimate the prior, so I'm not sure if it's even faster one way or the other.
Thanks, that makes sense.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO