SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq2 Contrasts cacti Bioinformatics 4 08-04-2015 11:20 AM
DESeq2 - using contrasts vs. multiple runs for DE analysis rutgers2015 Bioinformatics 2 07-14-2015 10:34 AM
Contrasts with DESeq2 mistrm RNA Sequencing 3 07-24-2014 09:03 AM
Paired design versus unpaired design in DESeq2 KristenC RNA Sequencing 1 05-29-2014 11:05 AM
Differential expression analysis on RPKMs - Contrasts and contrasts of contrasts giorgifm Bioinformatics 6 08-16-2013 10:50 AM

Reply
 
Thread Tools
Old 10-25-2014, 08:51 AM   #1
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default DESeq2 model design + contrasts

I've skimmed through previous threads about contrasts and I'm still getting hung up with my data.

My data are 16s rDNA count data from a full factorial design with 3 tissue types (A,B,C) that were either exposed to Chemical1, Chemical2, or both. Each condition has three replicates.

Edit: Each replicate has tissue A, B, C samples from the same animal. So, 3 animals per treatment. How should I account for this?

For each tissue (just showing A for example)

TissueA, Chem1+, Chem2+ x 3 replicates
TissueA, Chem1-, Chem2+ x 3 replicates
TissueA, Chem1+, Chem2- x 3 replicates
TissueA, Chem1-, Chem2- x 3 replicates [aka control]

The questions I want to test are:
  • Which OTUs are different from Tissue A vs B, B vs C, and A vs C?
  • Effects of Chem1 on each Tissue
  • Effects of Chem2 on each Tissue
  • The effects of Chem1 and Chem2 on each tissue

Is it worth taking the chem1 and chem2 treatments and merge them as one condition? i.e. condition1= chem1+chem2+, condition2=chem1-chem2+ etc for simplicity and setup my model as ~tissue + condition?

Last edited by gamb; 10-25-2014 at 12:04 PM.
gamb is offline   Reply With Quote
Old 10-25-2014, 04:24 PM   #2
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi gamb,

Firstly, as you are just starting to analyze, best to update to Bioconductor 3.0 and DESeq2 version 1.6 which was just released.

As the analysis is a bit complicated, I'd recommend using DESeq(dds, betaPrior=FALSE) as this simplifies the building of contrasts with interactions.

The design should be

Code:
~ tissue*chem1*chem2
...because you're interested in both the interactions of chem1 and chem2, and in the specific effects for each tissue.

The list of effects is given by

Code:
resultsNames(dds)
You can build up the contrasts using a list like below. I'll give some examples and see if you can extend to the others. Comparing the tissues alone is simple by specifying the name of the factor and the two levels to compare, e.g.:

Code:
results(dds, contrast=c("tissue","C","B"))
Then some basics about interaction terms: the effect of chem 1 in tissue B will be:

Code:
results(dds, contrast=list(c("chem1","chem1.tissueB")))
...because you need to add the main chem 1 effect (the effect for tissue A, the base level of tissue) and the extra chem 1 effect for tissue B.

If you wanted to test if chem 1 has an effect in tissue B different than tissue A you would only test the interaction term "chem1.tissueB", because this is that difference.

Tissue A is the base level, so the effect of chem 1 in tissue A is simply

Code:
results(dds, name="chem1")
The effect of chem 1 and chem 2 in tissue B is a bit lengthy, you need to add main effects and all relevant interaction terms:

Code:
results(dds, contrast=list(c("chem1","chem2","chem1.tissueB","chem2.tissueB","chem1.chem2","chem1.chem2.tissueB")))
It's simpler for tissue A because this is the base level:

Code:
results(dds, contrast=list(c("chem1","chem2","chem1.chem2")))
And if you want to test only the interaction, answering if the effect of the two chems is different than each separately, you remove some of these effects. For tissue B this contrast would be:

Code:
results(dds, contrast=list(c("chem1.chem2","chem1.chem2.tissueB")))
For tissue A it is again simpler:

Code:
results(dds, name="chem1.chem2")
Note that another way to compare individual groups would be to create a new variable which describes all groups:

Code:
dds$group = factor(paste(dds$chem1, dds$chem2, dds$tissue))
design(dds) = ~ group
Then use the following to compare to specific groups:

Code:
results(dds, contrast=c("group","...","..."))

Last edited by Michael Love; 10-26-2014 at 09:20 AM.
Michael Love is offline   Reply With Quote
Old 10-26-2014, 07:42 AM   #3
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Thanks Michael-

I updated everything and now I have problems!

Following your code,

Code:
results(dds, contrast=c("tissue","C","B"))
Works for me but this doesn't:

Code:
results(dds, contrast=list("chem1","chem1.tissueB"))
I get the following:

Code:
results(dds, contrast=list("Chem1", "Chem1.tissueB"))
Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues) : 
  all elements of the contrast as a list of length 2 should be elements of 'resultsNames(object)'
My code:

Code:
dds=phyloseq_to_deseq2(data, ~tissue*Chem1*Chem2)
dds=DESeq(dds, test = "Wald", fitType = "local", betaPrior=FALSE)
resultsNames(dds)

 [1] "Intercept"                       "tissue_tissueC_vs_tissueA"      
 [3] "tissue_tissueB_vs_tissueA"       "Chem1_Yes_vs_No"                
 [5] "Chem2_Yes_vs_No"                 "tissuetissueC.Chem1Yes"         
 [7] "tissuetissueB.Chem1Yes"          "tissuetissueC.Chem2Yes"         
 [9] "tissuetissueB.Chem2Yes"          "Chem1Yes.Chem2Yes"              
[11] "tissuetissueC.Chem1Yes.Chem2Yes" "tissuetissueB.Chem1Yes.Chem2Yes"

Also, the tissues are paired by the source animal (i.e. Chem1+Chem2+ has tissue samples A,B, and C, from animals 1,2 and 3). How do I account for this?
gamb is offline   Reply With Quote
Old 10-26-2014, 08:22 AM   #4
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Can you post the output of:

sessionInfo()
Michael Love is offline   Reply With Quote
Old 10-26-2014, 08:38 AM   #5
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Code:
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] fossil_0.3.7              shapefiles_0.7            foreign_0.8-61           
 [4] maps_2.3-9                sp_1.0-15                 phyloseq_1.10.0          
 [7] gridExtra_0.9.1           plyr_1.8.1                reshape2_1.4             
[10] ggdendro_0.1-15           gdata_2.13.3              plotrix_3.5-7            
[13] ellipse_0.3-8             ggplot2_1.0.0             vegan_2.0-10             
[16] lattice_0.20-29           permute_0.8-3             RColorBrewer_1.0-5       
[19] biom_0.3.12               ape_3.1-4                 DESeq2_1.6.1             
[22] RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3               GenomicRanges_1.18.1     
[25] GenomeInfoDb_1.2.0        IRanges_2.0.0             S4Vectors_0.4.0          
[28] BiocGenerics_0.12.0      

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      ade4_1.6-2           annotate_1.44.0      AnnotationDbi_1.28.0
 [5] base64enc_0.1-2      BatchJobs_1.4        BBmisc_1.7           Biobase_2.26.0      
 [9] BiocParallel_1.0.0   Biostrings_2.34.0    brew_1.0-6           checkmate_1.5.0     
[13] chron_2.3-45         cluster_1.15.3       codetools_0.2-9      colorspace_1.2-4    
[17] data.table_1.9.4     DBI_0.3.1            digest_0.6.4         fail_1.2            
[21] foreach_1.4.2        Formula_1.1-2        genefilter_1.48.1    geneplotter_1.44.0  
[25] gtable_0.1.2         gtools_3.4.1         Hmisc_3.14-5         igraph_0.7.1        
[29] iterators_1.0.7      latticeExtra_0.6-26  locfit_1.5-9.1       MASS_7.3-35         
[33] Matrix_1.1-4         multtest_2.22.0      munsell_0.4.2        nlme_3.1-118        
[37] nnet_7.3-8           proto_0.3-10         RJSONIO_1.3-0        rpart_4.1-8         
[41] RSQLite_1.0.0        scales_0.2.4         sendmailR_1.2-1      splines_3.1.1       
[45] stringr_0.6.2        survival_2.37-7      tools_3.1.1          XML_3.98-1.1        
[49] xtable_1.7-4         XVector_0.6.0        zlibbioc_1.12.0
Edit: Closed out of my session and re-ran my code in a new session. Now getting this readout:

Code:
> dds=phyloseq_to_deseq2(data, ~tissue*Chem1*Chem2)
converting counts to integer mode
> dds=DESeq(dds, test = "Wald", fitType = "local", betaPrior=FALSE)
estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

Last edited by gamb; 10-26-2014 at 08:52 AM.
gamb is offline   Reply With Quote
Old 10-26-2014, 09:59 AM   #6
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi,

sorry, I had a typo in my previous posts: I forgot to wrap the names in c(). I've corrected this above.

To adapt to your level names, where I wrote "chem1", substitute "Chem1_Yes_vs_No", where I wrote "tissueC.chem1" substitute "tissuetissueC.Chem1Yes", etc.

To add a fixed effect which controls for animal, you should add this variable to the design:

Code:
~ animal + tissue*chem1*chem2
The "every gene contains at least one zero" can be solved by using an alternate size factor estimation. I just answered this question on the Bioconductor support site this week. If you search there you should be able to find a solution.
Michael Love is offline   Reply With Quote
Old 10-26-2014, 10:31 AM   #7
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Thank you! It seems like everything is now working and makes sense!
gamb is offline   Reply With Quote
Old 10-26-2014, 11:41 AM   #8
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Spoke too soon. It appears that either "animal" variable is forcing this error or the tissue*chem1*chem2


Code:
dds=phyloseq_to_deseq2(data, ~ animal + tissue*Chem1*Chem2)
converting counts to integer mode
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  one or more variables or interaction terms in the design formula
  are linear combinations of the others and must be removed

Last edited by gamb; 10-26-2014 at 11:45 AM.
gamb is offline   Reply With Quote
Old 10-26-2014, 12:41 PM   #9
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Can you show as.data.frame(colData(dds))?

When I build a data frame as you described the samples, I get a full rank matrix with this design:

Code:
> d = data.frame(t=factor(rep(1:3,each=12)),
  c1=factor(rep(rep(1:2,each=6),3)),
  c2=factor(rep(rep(1:2,each=3),6)),
  a=factor(rep(1:3,12)))
> m = model.matrix(~ a + t*c1*c2, data=d)
> qr(m)$rank
[1] 14
> ncol(m)
[1] 14
Michael Love is offline   Reply With Quote
Old 10-26-2014, 03:13 PM   #10
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Data is structured like this:

Code:
   Animal Chem1 Chem2  Tissue
1       F   Yes    No tissueA
2       F   Yes    No tissueB
3       F   Yes    No tissueC
4       G    No   Yes tissueA
5       G    No   Yes tissueB
6       G    No   Yes tissueC
7       H    No   Yes tissueA
8       H    No   Yes tissueB
9       H    No   Yes tissueC
10      I    No   Yes tissueA
11      I    No   Yes tissueB
12      I    No   Yes tissueC
13      J   Yes   Yes tissueA
14      J   Yes   Yes tissueB
15      J   Yes   Yes tissueC
16      K   Yes   Yes tissueA
17      K   Yes   Yes tissueB
18      K   Yes   Yes tissueC
19      L   Yes   Yes tissueA
20      L   Yes   Yes tissueB
21      L   Yes   Yes tissueC
22      A    No    No tissueA
23      A    No    No tissueB
24      A    No    No tissueC
25      B    No    No tissueA
26      B    No    No tissueB
27      B    No    No tissueC
28      C    No    No tissueA
29      C    No    No tissueB
30      C    No    No tissueC
31      D   Yes    No tissueA
32      D   Yes    No tissueB
33      D   Yes    No tissueC
34      E   Yes    No tissueA
35      E   Yes    No tissueB
36      E   Yes    No tissueC
gamb is offline   Reply With Quote
Old 10-26-2014, 03:35 PM   #11
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

aha, I thought you meant that it was only three animals total. You can't just add a term for animal because it forms linear combinations with chem1*chem2...

Last edited by Michael Love; 10-26-2014 at 04:40 PM. Reason: misread the previous dataframe
Michael Love is offline   Reply With Quote
Old 10-26-2014, 04:41 PM   #12
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

I'll add more tomorrow, you can control for animal, but you have to reformulate the terms.
Michael Love is offline   Reply With Quote
Old 10-27-2014, 09:08 AM   #13
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

There is a way to control for the animal effect in this experiment, by recoding the animals within blocks of chem1*chem2 and adding an interaction term. It gets a bit trickier to form contrasts, but I will show how to do this.

It helps to clean up the colData a bit, I'll show on a clean version, and you can reorder your samples correspondingly. Note that you should reorder the columns of the count matrix so that they match. If you have a SummarizedExperiment, the colData and matrix are tied together, so simply reordering the columns of the SummarizedExperiment does both for you.

I made a sample data.frame like your data:

Code:
d <- data.frame(anim=factor(rep(1:12,each=3)),
  chem1=factor(rep(rep(c("N","Y"),each=6),3)),
  chem2=factor(rep(rep(c("N","Y"),each=3),6)),
  tiss=factor(rep(1:3,12)))
this looks like

Code:
    anim chem1 chem2 tiss
 1     1     N     N    1
 2     1     N     N    2
 3     1     N     N    3
 4     2     N     Y    1
 5     2     N     Y    2
 6     2     N     Y    3
 7     3     Y     N    1
 8     3     Y     N    2
 9     3     Y     N    3
 10    4     Y     Y    1
 11    4     Y     Y    2
 12    4     Y     Y    3
 13    5     N     N    1
 14    5     N     N    2
 15    5     N     N    3
 16    6     N     Y    1
 17    6     N     Y    2
 18    6     N     Y    3
 19    7     Y     N    1
 20    7     Y     N    2
 21    7     Y     N    3
 22    8     Y     Y    1
 23    8     Y     Y    2
 24    8     Y     Y    3
 25    9     N     N    1
 26    9     N     N    2
 27    9     N     N    3
 28   10     N     Y    1
 29   10     N     Y    2
 30   10     N     Y    3
 31   11     Y     N    1
 32   11     Y     N    2
 33   11     Y     N    3
 34   12     Y     Y    1
 35   12     Y     Y    2
 36   12     Y     Y    3
Now reorder by chem1, chem2 and tiss. Then add a new variable which will be used to control animal specific effects within each block of chem1*chem2.

Code:
idx <- with(d, order(chem1, chem2, tiss))
d <- d[idx,]
d$anim.nested <- factor(rep(1:3,12))
You can examine d to see how this works.

The model matrix is then the chem1 and chem2 terms and interaction, as well as the interactions of these with animal, and the interaction of these with tissue:

Code:
m <- model.matrix(~ chem1 + chem2 + chem1:chem2 + chem1:chem2:anim.nested + chem1:chem2:tiss, data=d)
The column names of this are:

Code:
> colnames(m)
  [1] "(Intercept)"                "chem1Y"
  [3] "chem2Y"                     "chem1Y:chem2Y"
  [5] "chem1N:chem2N:anim.nested2" "chem1Y:chem2N:anim.nested2"
  [7] "chem1N:chem2Y:anim.nested2" "chem1Y:chem2Y:anim.nested2"
  [9] "chem1N:chem2N:anim.nested3" "chem1Y:chem2N:anim.nested3"
 [11] "chem1N:chem2Y:anim.nested3" "chem1Y:chem2Y:anim.nested3"
 [13] "chem1N:chem2N:tiss2"        "chem1Y:chem2N:tiss2"
 [15] "chem1N:chem2Y:tiss2"        "chem1Y:chem2Y:tiss2"
 [17] "chem1N:chem2N:tiss3"        "chem1Y:chem2N:tiss3"
 [19] "chem1N:chem2Y:tiss3"        "chem1Y:chem2Y:tiss3"

Many of the contrasts become hard to form, though possible with numeric contrasts. A trick for getting the numeric contrasts of the other comparisons, is to take the column means of the model matrix, for the groups of samples you want to compare. For instance, if you want to find the main chem1 effect in tissue 2, the contrast would be:

Code:
ctrst = colMeans(m[d$tiss == "2" & d$chem1 == "Y" & d$chem2 == "N",]) -
        colMeans(m[d$tiss == "2" & d$chem1 == "N" & d$chem2 == "N",])
This ends up being the chem1 base effect, the chem1Y effect in tissue 2 minus the chem1N effect in tissue 2 and a few 1/3 and -1/3 to the animal interaction effects.

Edit: note that interaction terms can be tested using the same logic, but you need to subtract the groups with only one chem and then add the base group. So it looks like: chem1Y chem2Y - chem1N chem2Y - chem1Y chem2N + chem1N chem2N.

Also, I had to update that custom line of size factor code for when all rows have a zero. If you were using that line of code to avoid the issue that all rows of your count matrix have one or more zeros, you should get the edited version:

https://support.bioconductor.org/p/62246/

We're implementing a solution here which was on the list of todos for a while now, but it won't be out for 6 months. Typical RNA-Seq doesn't have this issue, but since phyloseq we are seeing more metagenomics datasets which often have no genes without a zero.

Last edited by Michael Love; 10-27-2014 at 02:24 PM. Reason: added info on interaction
Michael Love is offline   Reply With Quote
Old 10-27-2014, 02:17 PM   #14
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Ok! So, just checking in to make sure this is right... I reformatted the data as described named "d", "bbb" is a phyloseq data object, and "m" is the matrix model.

Code:
counts=as.data.frame(otu_table(bbb))
dds=DESeqDataSetFromMatrix(countData=counts, colData=d, design=~ chem1 + chem2 + chem1:chem2 + chem1:chem2:anim.nested + chem1:chem2:tiss)

cts = counts(dds)
geoMeans = apply(cts, 1, function(row) mean(log(row[row != 0])))
dds = estimateSizeFactors(dds, geoMeans=geoMeans)
bdds=DESeq(dds, test = "Wald", fitType = "local", betaPrior=FALSE)
resultsNames(bdds)
And for contrast example.....
Code:
ctrst = colMeans(m[d$Type == “2” & d$chem1 == "Y" & d$chem2 == "N",]) -
        colMeans(m[d$Type == “2” & d$chem1 == "N" & d$chem2 == "N",])
        

res=results(bdds, contrast=ctrst)

Does this seem right? Also, I've noticed I get quite a few segfaults within R when running the deseq2/phyloseq packages... is this just unique to my setup?
gamb is offline   Reply With Quote
Old 10-27-2014, 02:23 PM   #15
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

The geoMeans code you have is the old one, where I forgot to exponentiate the mean of log counts. See the link in my previous reply.

Otherwise seems right. Be very careful that the columns of your count matrix line up with colData.

I've never seen or heard of a segfault from our code. Do you have a normal release of R or did you build yourself?
Michael Love is offline   Reply With Quote
Old 10-27-2014, 02:37 PM   #16
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Dang it! Good catch, my columns did not match up.

When I get another error I'll screenshot it and post it here. They seem to pop-up at will. I have a normal release of R and the errors occur at my lab computer + personal computer.
gamb is offline   Reply With Quote
Old 10-27-2014, 03:15 PM   #17
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

A safer way to reorder columns would be to build the DESeqDataSet with a design of ~ 1, then rearrange the columns of the DESeqDataSet, then add a new column anim.nested, then add the desired design. What do you mean "at will"? After running a particular function? At what points in the analysis to you get these?
Michael Love is offline   Reply With Quote
Old 10-27-2014, 03:24 PM   #18
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

Quote:
What do you mean "at will"? After running a particular function? At what points in the analysis to you get these?
It's been pretty random when they happen- usually when I'm not running code (often just editing script files prior to running them) so I haven't been able to track down what particular thing is triggering them. All I do know is that they happen in sessions when I've loaded those packages. I think it might be tied into the phyloseq data structures.
gamb is offline   Reply With Quote
Old 10-27-2014, 03:30 PM   #19
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Yeah, that won't be from DESeq2 then. And maybe ask the phyloseq maintainers if they have seen this on their github site, although their data structures are data.frame and matrix so I wouldn't know why there would be a problem there. If you are loading objects from memory at the beginning those could be corrupted.
Michael Love is offline   Reply With Quote
Old 10-27-2014, 04:33 PM   #20
gamb
Member
 
Location: USA

Join Date: Oct 2014
Posts: 11
Default

So I got R to throw up this error (not a segfault...these don't happen too often) right after resultsNames(dds)

Code:
2014-10-27 19:31:47.892 R[5511:707] *** RController: caught ObjC exception while processing system events. Update to the latest GUI version and consider reporting this properly (see FAQ) if it persists and is not known. 
*** reason: *** -[__NSArrayM objectAtIndex:]: index 18446744073709551615 beyond bounds [0 .. 85]
*** name: NSRangeException, info: (null)
*** Version: R 3.1.1 () R.app R 3.1.1 GUI 1.65 Mavericks build(null)
Consider saving your work soon in case this develops into a problem.
gamb is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:01 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO