SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq2 error : the model matrix is not full rank NicoBxl Bioinformatics 9 03-14-2017 06:10 AM
model matrix design for multiple siRNAs targeting same gene Brandi Bioinformatics 2 02-27-2014 04:18 AM
EdgeR model matrix interpretation sindrle Bioinformatics 4 10-24-2013 03:28 AM
FPKM formula g781 RNA Sequencing 2 10-18-2013 01:41 AM
SNPdat: Easy annotation of novel and known SNPs from model and non-model organisms d1antho Bioinformatics 0 03-15-2013 08:58 AM

Reply
 
Thread Tools
Old 10-22-2014, 10:50 PM   #1
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default DESeq2 model matrix formula

Hi all,

I am trying to figure out the correct formula to use for my model in an RNA-seq experiment I am analyzing with DESeq2.

The experimental design covers 10 patients, done in 4 batches. No single batch is comprised solely of a single patient. The patients are distributed equally between two groups. Group 1 contains 2 male and 3 female patients and Group 2 is 3 male/2 female. 1 G1 male and 1 G2 male are Asian; the remainder of the patients are Caucasian. From each patient, we have collected 3 different tissue types.

The primary goals of the experiment are to: (1) understand how each tissue differs between groups and (2) understand how the tissues differ from one another within groups. The gender, race, and batch effects are not of particular importance, but I'd like to be able to control for them. We have a moderate expectation that there will be differences between genders; whether race will have an effect is less clear.

I initially tried "~ patient + batch + ethnicity + gender + group + tissue + group:tissue" but, aside from the fact that I'm unsure whether this is correct, my design matrix is degenerate. No matter what variables I drop or how I add interaction terms, R always tells me that my design matrix has rank (at most) 14.

Eventually, I followed the example given in section 3.5 of the edgeR manual (even though I am using DESeq2), using only "~ group + group:patient + group:tissue" and blocking the patients within the groups. This design matrix does have 14 columns and is full rank, but I am not using the gender/batch/ethnicity information, and it is unclear to me whether I can accurately assess differences between two tissues within the same group: I know different groups within a tissue should be group1 + group1.tissue1 versus group2 + group2.tissue1, but I cannot likewise do tissue1 + group1.tissue1 versus tissue2 + group1.tissue2 because I am not fitting main effects for the tissues.

I feel like the complication here is that there are multiple blocking variables, and I have both within- and between-group comparisons. I thought about trying to use sva or RUVseq to try to remove the effects of the variables that are not of real interest, but using sva runs into the same problem of needing to have a full model with those variables included, and using RUVseq doesn't take advantage of the known relationships between the variables and the samples (plus I don't have any well defined control genes).

Here is my actual sample matrix, if it's helpful (note that one of the 30 samples was removed for QC reasons):
Code:
group	patient	gender	ethnicity	tissue	batch
G2	X30844	M	Asian	T2	Batch_1
G2	X30844	M	Asian	T3	Batch_1
G2	X30844	M	Asian	T1	Batch_1
G2	X30855	M	Caucasian	T2	Batch_1
G2	X30855	M	Caucasian	T3	Batch_1
G2	X30855	M	Caucasian	T1	Batch_1
G1	X30999	F	Caucasian	T2	Batch_3
G1	X30999	F	Caucasian	T3	Batch_3
G1	X30999	F	Caucasian	T1	Batch_3
G1	X31002	F	Caucasian	T2	Batch_3
G1	X31002	F	Caucasian	T3	Batch_3
G1	X31122	F	Caucasian	T2	Batch_4
G1	X31122	F	Caucasian	T3	Batch_4
G1	X31122	F	Caucasian	T1	Batch_4
G2	X31132	M	Caucasian	T2	Batch_4
G2	X31132	M	Caucasian	T3	Batch_4
G2	X31132	M	Caucasian	T1	Batch_4
G1	X31134	M	Asian	T2	Batch_4
G1	X31134	M	Asian	T3	Batch_4
G1	X31134	M	Asian	T1	Batch_4
G1	X31188	M	Caucasian	T2	Batch_5
G1	X31188	M	Caucasian	T3	Batch_5
G1	X31188	M	Caucasian	T1	Batch_5
G2	X31193	F	Caucasian	T2	Batch_5
G2	X31193	F	Caucasian	T3	Batch_5
G2	X31193	F	Caucasian	T1	Batch_5
G2	X31195	F	Caucasian	T2	Batch_5
G2	X31195	F	Caucasian	T3	Batch_5
G2	X31195	F	Caucasian	T1	Batch_5
Thanks so much for your help and sorry about the long post!
dmosk is offline   Reply With Quote
Old 10-23-2014, 07:08 AM   #2
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi,

The patient effect is more fine grained than a number of other effects: gender, batch, ethnicity. So by including the patient effect as a term in the design, you control for these differences as well. That is the reason you cannot build a full rank design matrix which includes patient, gender, batch and ethnicity.

I would recommend you use the design:

Code:
~ patient + tissue + group + tissue:group
Your questions of interest are:

(1) understand how each tissue differs between groups and (2) understand how the tissues differ from one another within groups.



The first question is found by investigating the interaction effects. For instance

Code:
results(dds, contrast=list("tissueT1.groupG2","tissueT1.groupG1"))
tests the interaction, i.e. it tests for genes which show a different G2 vs G1 effect in tissue T1 compared to the general G2 vs G1 effect.

The group G2 vs G1 effect in tissue T1 is this interaction effect above plus the main G2 vs G1 effect:

Code:
results(dds, contrast=list(c("tissueT1.groupG2","groupG2"),
c("tissueT1.groupG1","groupG1")))
so adding the interaction and the main effect tests for genes which show any G2 vs G1 effect in tissue T1, whether because there is only a main G2 vs G1 effect, or only a T1-specific G2 vs G1 effect, or there is a main effect and a specific effect which are added together. (If this is confusing, check a statistical reference on interaction terms).



The second question is found using the same logic. For instance

Code:
results(dds, contrast=list("tissueT2.groupG1","tissueT1.groupG1"))
tests for genes which show a group-G1-specific tissue T2 vs T1 effect beyond the main T2 vs T1 effect.

To build the T2 vs T1 effect in group G1, you need to add the interaction effect and the main T2 vs T1 effect:

Code:
results(dds, contrast=list(c("tissueT2.groupG1","tissueT2"),
c("tissueT1.groupG1","tissueT1")))
so this test for those genes which show a tissue T2 vs T1 effect in group G1.
Michael Love is offline   Reply With Quote
Old 10-23-2014, 07:44 AM   #3
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default

Hi Michael,

Thanks for the reply.

I had already tried using the design you suggested (as I agree it is reasonable), but it gives me a degenerate matrix (and DESeq2 does report this):
Code:
samples <- read.table('samples.txt')
modelMatrix <- model.matrix(~ patient + tissue + group + tissue:group, data=as.data.frame(samples))
qr(modelMatrix)$rank
ncol(modelMatrix)
gives rank 14 on a 15-column matrix (where samples.txt is the sample matrix from my first post). It's unclear to me where the linear dependence is arising, but it's hard to argue with R.

In response to your other comments about the contrasts, I understand the distinction between the different testing schema. We are interested in effects across groups/tissues irrespective of whether they are unique to the interaction, which is why I was using the sum of the interaction and the main effect, and that is why I was concerned about not fitting a main effect for the tissue. Of course, interaction-specific effects are interesting, as well, just not solely so in this case.

Do you have any ideas as to the origin of the degeneracy in my model matrix and how I might rectify it? I've racked my brain trying to figure out the problem here but to no avail...

Thanks again for your help!
dmosk is offline   Reply With Quote
Old 10-23-2014, 08:07 AM   #4
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

You're right, I forgot that group is also a linear combination of patient. You can just remove the main group effect then, as this is already accounted for by patient. This gives you:

Code:
~ patient + tissue + tissue:group
Now the only unfortunate thing is that the main group G2 vs G1 effect has to be written in terms of patient. The interactions could be written using the code above, but to add the interaction to the main effect you need to build the main effect from the patients.

G2 vs G1 is:

(X30844 + X30855 + X31132 + X31193 + X31195)/5 - (X30999 + X31002 + X31122 + X31134 + X31188)/5

So you can't use the list() style contrast anymore, but instead you have to use the numeric argument to contrast. The numeric should correspond to the order of resultsNames(dds). Give the patients a 1/5 or -1/5 if they are in G2 or G1 respectively. Give the interaction of interest a 1 and -1 for the numerator term and the denominator term respectively. See ?results for more info on specifying numeric contrasts.
Michael Love is offline   Reply With Quote
Old 10-23-2014, 11:09 AM   #5
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default

Thanks, Michael! You are completely right that the groups are linear combinations of the patients - that slipped right past me. Unfortunately, the model you suggested is still giving me a 15-column matrix (and it's thus degenerate).

Your comment about numeric contrasts gave me an idea, though. As I mentioned in my first post, the only full-rank model matrix I had successfully created was by blocking the patients in the groups (as two sets of X1 through X5, with one set for each group), and using formula "~ group + group:patient + group:tissue." I could then get at the main effects for the tissues by averaging over the group:tissue interactions for a given tissue type. Does that sound reasonable?

Also, one thing I didn't mention is that we do have a hypothesis that, ideally, we'd like to test, in that patients in one group are expected to be more variable than those in the other group. Without fitting a batch effect, I don't think this can be gotten at, since some batches are comprised of a single group. Is there any way I can solve this problem, or is it impossible to fully extricate the batch effect from those of the other variables?
dmosk is offline   Reply With Quote
Old 10-23-2014, 11:45 AM   #6
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

I think the problem with that design is that that model equates patient X1 in group 1 with patient X1 in group 2 which is not true (check if you see a patientX1.groupG2 column). Also tissue T1 in group 1 with tissue T1 in group 2 (check if you see a tissueT1.group2 column). I need to think more, but ATM I can't think of a model with interactions that has less parameters to fit than samples.
Michael Love is offline   Reply With Quote
Old 10-23-2014, 11:46 AM   #7
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

To explore the question of variability, I would look at PCA plots and color by group. Tests of differential variability exist, but that's not part of DESeq2.
Michael Love is offline   Reply With Quote
Old 10-23-2014, 12:57 PM   #8
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default

There is a groupG1.patientX2 column, but my feeling was that it's OK to do it like this because I'm not fitting a main effect for the patient variable, so the patient effects are compartmentalized to being within-groups (since the patient:group interaction is the only place where the patient term appears in the model). I realize I can't, for instance, average over groupG1.patientX2 and groupG2.patientX2 to get a patientX2 effect, but I was under the impression that the groups' main effects and the group:tissue interaction effects were valid and disentangled from patient-specific variation. (There is a also a groupG2.tissueT1 column but I'm unclear as to why this is a problem; it is the same set of three tissues that are being extracted from each patient, regardless of group.) Are my assumptions incorrect?

Also, what's really escaping me right now is why the model you proposed doesn't give a full-rank design matrix/why the experimental design is insufficient to fully specify the variables of interest and their interactions, as in the "~ patient + tissue + tissue:group" formula you suggested.

I really appreciate all your advice and support. Needless to say, I have found this problem to be frustratingly challenging, and your guidance is very helpful!
dmosk is offline   Reply With Quote
Old 10-23-2014, 01:10 PM   #9
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Can you print the column names of the model matrix with your suggested design?

re: why the model doesn't give full rank. I think it is because you have the missing sample and that if you excluded the other samples from that patient it would be full rank.
Michael Love is offline   Reply With Quote
Old 10-23-2014, 02:17 PM   #10
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default

Sure - this is "~ group + group:patient + group:tissue" after replacing the patient names with X1 through X5 for G1, and likewise for G2:
Code:
> colnames(modelMatrix)
 [1] "(Intercept)"       "groupG2"           "groupG1:patientX2"
 [4] "groupG2:patientX2" "groupG1:patientX3" "groupG2:patientX3"
 [7] "groupG1:patientX4" "groupG2:patientX4" "groupG1:patientX5"
[10] "groupG2:patientX5" "groupG1:tissueT2"  "groupG2:tissueT2" 
[13] "groupG1:tissueT3"  "groupG2:tissueT3"
I took this idea from section 3.5 of edgeR's vignette, which gave an example for having both between- and within-group comparisons, as there are here.

I had actually already tried adding back in that one patient's missing sample but the matrix with "~ patient + tissue + tissue:group" remained degenerate. Only by adding an additional patient to each group was I able to create a full-rank design with that formula. I could try to lobby to get two more samples sequenced if the data I have is definitely insufficient to appropriately model the variables of interest.

To get back to the group-level variability, I realize that I can't test for that directly within DESeq2. The reason I mentioned it is that I would want to base whatever metric I use on the normalized values given by the software, but the normalized values for e.g. X30844 and X30855 are intertwined with the effects of batch 2, which is comprised entirely of those two samples, if I don't model batch effects. Since I can't model batch effects directly in my formula, could I attempt to remove them first with sva/Combat, or would that violate the assumptions of DESeq2 (even though there wouldn't be a batch term in the formula)?
dmosk is offline   Reply With Quote
Old 10-24-2014, 06:04 AM   #11
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Do you see what I meant above though? The problem with your design is that there is no "patientX1:groupG2" effect or equivalently "groupG2:patientX1". This means that patient X1 is considered the same effect in the two groups (which is a problem because they are not even the same patient).

Similarly there is no "groupG2:tissueT1". So your model also assumes tissue T1 is the same effect in the two groups.

I can't think of a way to set up the model with interactions without equating the base levels of your factors across groups, which is not a good assumption to make.

Last edited by Michael Love; 10-24-2014 at 06:38 AM.
Michael Love is offline   Reply With Quote
Old 10-24-2014, 06:05 AM   #12
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Note that any model which contains "patient" as a main effect controls for batch, because the batches are all composed of the sums of patient columns.
Michael Love is offline   Reply With Quote
Old 10-24-2014, 10:32 AM   #13
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default

You're right about the patient terms controlling for batch effects - I was thinking that I'd want to use the patients' coefficients to assess in-group variability but I realize now that I wouldn't need to.

I do see that the design I gave does not explicitly have a term for G2:X1, but isn't this implicit? The intercept column minus the sum of the group:patient columns gives G2:X1, and likewise for G2:T1 (although I remain confused as to why the latter would be an issue either way, given that it is the same tissue type in each group). Accordingly, if I were to look at resultsNames(.) of the associated DESeq object, I would see, in the expanded model, a G2:X1 term.

Going back to the model you suggested, I have two questions:
First, I must have made some error when I claimed that, by adding in an additional patient to each group, I was able to achieve a full-rank design. I have tried to replicate this but I have been unsuccessful - I just tried simulating the addition of four patients to each group and the matrix was still degenerate. Is there some inherent linear dependency that's still present there?
Second, with that model, building the main effect for the groups by averaging over the patients' coefficients concerns me. Wouldn't the group effects then be confounded by patient-specific effects?

Thanks again, Michael. I'd be completely stuck without your assistance!
dmosk is offline   Reply With Quote
Old 10-24-2014, 11:09 AM   #14
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Yes, you're right, sorry it took me a while to see what kind of matrix you were getting.

And my suggested design doesn't work because sum(patient 6-10) = T1:G2 + T2:G2 + T3:G2.

So then you should go ahead and use the ~ group + group:patient + group:tissue.

Given the complexity, I'd use betaPrior=FALSE to keep the coefficients the same as the standard ones returned by model.matrix.

You can use numeric contrasts then to answer your questions. One way that i've taught how to generate the numeric contrast is to form the column mean of each group. For example, if you want to compare tissue 1 across the two groups, the column mean of the tissue 1 samples in group 1 (assuming you had that extra sample) is

Code:
1.0 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.0 0.0
and for group 2 this is

Code:
1.0 1.0 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.0
the second minus the first gives the contrast:

Code:
0.0  1.0 -0.2  0.2 -0.2  0.2 -0.2  0.2 -0.2  0.2  0.0  0.0  0.0  0.0
I think this should correspond to your colnames, but check if you get the same. The numbers will be adjusted given you are missing one T1 sample.
Michael Love is offline   Reply With Quote
Old 10-24-2014, 11:14 AM   #15
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default

OK I think I figured out the linear dependence in "~ patient + tissue + tissue:group" - it's between patient and tissue:group --
Because each patient is in one group and produces each tissue, the sum of the interaction terms over each tissue for a given group, minus all the patients in that group except one, equals the remaining patient. If I had two only two tissues and four patients (P1-P4; two in each group), then e.g. T1:G2 + T2:G2 - P4 = P3. For clarity, I simulated this:
Code:
> modelMatrix <- model.matrix(~ patient + tissue:group, data=as.data.frame(samples))
> modelMatrix
                (Intercept) patientP2 patientP3 patientP4 tissueT1:groupG1
30844_CD8_CM              1         0         0         0                0
30844_CD8_Naive           1         0         0         0                1
30855_CD8_CM              1         1         0         0                0
30855_CD8_Naive           1         1         0         0                1
30999_CD8_CM              1         0         1         0                0
30999_CD8_Naive           1         0         1         0                0
31002_CD8_CM              1         0         0         1                0
31002_CD8_Naive           1         0         0         1                0
                tissueT2:groupG1 tissueT1:groupG2 tissueT2:groupG2
30844_CD8_CM                   1                0                0
30844_CD8_Naive                0                0                0
30855_CD8_CM                   1                0                0
30855_CD8_Naive                0                0                0
30999_CD8_CM                   0                0                1
30999_CD8_Naive                0                1                0
31002_CD8_CM                   0                0                1
31002_CD8_Naive                0                1                0
Then, if I can't directly model both the individual patient effects and the tissue:group interaction terms at the same time, is there another model that would be appropriate? Are you definitely convinced that the model from the edgeR manual is incorrect?
dmosk is offline   Reply With Quote
Old 10-24-2014, 11:21 AM   #16
dmosk
Junior Member
 
Location: California

Join Date: Oct 2014
Posts: 8
Default

I see you figured it out while I was still typing my response

Thanks so much! I was actually also going to ask about whether betaPrior should be set to false in this case. I do have one more question, though: Is it valid to average over T1:G1 and T1:G2 to claim a main effect for T1, or is this inestimable given the coefficients in the model?

ETA: To clarify, I mean if I wanted to compare T1 to T2, could I average over the interactions similarly to how you showed in your example, or can I only compare T1 to T2 within G1 or G2?

Last edited by dmosk; 10-24-2014 at 11:26 AM. Reason: added clarification
dmosk is offline   Reply With Quote
Old 10-24-2014, 11:33 AM   #17
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Yes, this is a valid way to generate the T1 effect, and the fold change should end up being the average of the group2 common scale counts over the group1 counts for tissue T1. However, the dispersion estimates will be more accurate than with a model that doesn't include the patient effect, because in that case, patient specific differences would end up looking like higher dispersion. Another way this could be modeled is with mixed effects models, with the patient as a random effect. But this is how you should set up the contrast for modeling patient as fixed effect.
Michael Love is offline   Reply With Quote
Old 10-30-2014, 12:55 PM   #18
lmolokin
Member
 
Location: Beltsville, MD, USA

Join Date: Jul 2012
Posts: 24
Default design not of full rank

Hi,
I'm attempting to use the same formula as above (~treatment + treatment:subject + treatment:time) to get within treatment time effects while accounting for subject differences. The problem I've run into is that the formula creates an effect that doesn't exist due to one less subject in the treated group vs control. In edgeR, I can remove the non-existent effect by simply removing the column from the design matrix but I'm not sure how to do that in DESeq2.

Here's the code I use for creating the dds:

Code:
dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData = designframe,
                              formula(~tx+tx:subj+tx:time)
Is there a way to use model.matrix with the DESeqDataSetFromMatrix function so I can specify the truncated, full rank design?

Can I use ignoreRank for this situation?

Thanks.
lmolokin is offline   Reply With Quote
Old 10-30-2014, 01:00 PM   #19
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Don't use ignoreRank, that'll lead to trouble.

I answered this in another thread today, should work for you:

http://seqanswers.com/forums/showthread.php?t=47881

In this development cycle, I will bump up the modelMatrix argument to higher level functions to make things simpler. I didn't realize until after we'd designed the interface that model.matrix() will create a matrix with a column of 0's.
Michael Love is offline   Reply With Quote
Old 10-30-2014, 01:27 PM   #20
lmolokin
Member
 
Location: Beltsville, MD, USA

Join Date: Jul 2012
Posts: 24
Default

Thanks for the quick reply.

I see how you use modelMatrix after already having created the dds object. But how can I initially create the dds object if I can't specify the model within it?

When I input:
Code:
dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData = designframe,
                              formula(~tx+tx:subj+tx:time))
I get the error:
"Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed"

Thanks!
lmolokin is offline   Reply With Quote
Reply

Tags
deseq2, formula, model.matrix, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:08 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO