Seqanswers Leaderboard Ad

**Michael Love** · 10-23-2014, 07:08 AM

hi,

The patient effect is more fine grained than a number of other effects: gender, batch, ethnicity. So by including the patient effect as a term in the design, you control for these differences as well. That is the reason you cannot build a full rank design matrix which includes patient, gender, batch and ethnicity.

I would recommend you use the design:

Code:

~ patient + tissue + group + tissue:group

Your questions of interest are:

(1) understand how each tissue differs between groups and (2) understand how the tissues differ from one another within groups.

The first question is found by investigating the interaction effects. For instance

Code:

results(dds, contrast=list("tissueT1.groupG2","tissueT1.groupG1"))

tests the interaction, i.e. it tests for genes which show a different G2 vs G1 effect in tissue T1 compared to the general G2 vs G1 effect.

The group G2 vs G1 effect in tissue T1 is this interaction effect above plus the main G2 vs G1 effect:

Code:

results(dds, contrast=list(c("tissueT1.groupG2","groupG2"),
c("tissueT1.groupG1","groupG1")))

so adding the interaction and the main effect tests for genes which show any G2 vs G1 effect in tissue T1, whether because there is only a main G2 vs G1 effect, or only a T1-specific G2 vs G1 effect, or there is a main effect and a specific effect which are added together. (If this is confusing, check a statistical reference on interaction terms).

The second question is found using the same logic. For instance

Code:

results(dds, contrast=list("tissueT2.groupG1","tissueT1.groupG1"))

tests for genes which show a group-G1-specific tissue T2 vs T1 effect beyond the main T2 vs T1 effect.

To build the T2 vs T1 effect in group G1, you need to add the interaction effect and the main T2 vs T1 effect:

Code:

results(dds, contrast=list(c("tissueT2.groupG1","tissueT2"),
c("tissueT1.groupG1","tissueT1")))

so this test for those genes which show a tissue T2 vs T1 effect in group G1.

**dmosk** · 10-23-2014, 07:44 AM

Hi Michael,

Thanks for the reply.

I had already tried using the design you suggested (as I agree it is reasonable), but it gives me a degenerate matrix (and DESeq2 does report this):

Code:

samples <- read.table('samples.txt')
modelMatrix <- model.matrix(~ patient + tissue + group + tissue:group, data=as.data.frame(samples))
qr(modelMatrix)$rank
ncol(modelMatrix)

gives rank 14 on a 15-column matrix (where samples.txt is the sample matrix from my first post). It's unclear to me where the linear dependence is arising, but it's hard to argue with R.

In response to your other comments about the contrasts, I understand the distinction between the different testing schema. We are interested in effects across groups/tissues irrespective of whether they are unique to the interaction, which is why I was using the sum of the interaction and the main effect, and that is why I was concerned about not fitting a main effect for the tissue. Of course, interaction-specific effects are interesting, as well, just not solely so in this case.

Do you have any ideas as to the origin of the degeneracy in my model matrix and how I might rectify it? I've racked my brain trying to figure out the problem here but to no avail...

Thanks again for your help!

**Michael Love** · 10-23-2014, 08:07 AM

You're right, I forgot that group is also a linear combination of patient. You can just remove the main group effect then, as this is already accounted for by patient. This gives you:

Code:

~ patient + tissue + tissue:group

Now the only unfortunate thing is that the main group G2 vs G1 effect has to be written in terms of patient. The interactions could be written using the code above, but to add the interaction to the main effect you need to build the main effect from the patients.

G2 vs G1 is:

(X30844 + X30855 + X31132 + X31193 + X31195)/5 - (X30999 + X31002 + X31122 + X31134 + X31188)/5

So you can't use the list() style contrast anymore, but instead you have to use the numeric argument to contrast. The numeric should correspond to the order of resultsNames(dds). Give the patients a 1/5 or -1/5 if they are in G2 or G1 respectively. Give the interaction of interest a 1 and -1 for the numerator term and the denominator term respectively. See ?results for more info on specifying numeric contrasts.

**dmosk** · 10-23-2014, 11:09 AM

Thanks, Michael! You are completely right that the groups are linear combinations of the patients - that slipped right past me. Unfortunately, the model you suggested is still giving me a 15-column matrix (and it's thus degenerate).

Your comment about numeric contrasts gave me an idea, though. As I mentioned in my first post, the only full-rank model matrix I had successfully created was by blocking the patients in the groups (as two sets of X1 through X5, with one set for each group), and using formula "~ group + group

atient + group:tissue." I could then get at the main effects for the tissues by averaging over the group:tissue interactions for a given tissue type. Does that sound reasonable?

Also, one thing I didn't mention is that we do have a hypothesis that, ideally, we'd like to test, in that patients in one group are expected to be more variable than those in the other group. Without fitting a batch effect, I don't think this can be gotten at, since some batches are comprised of a single group. Is there any way I can solve this problem, or is it impossible to fully extricate the batch effect from those of the other variables?

**Michael Love** · 10-23-2014, 11:45 AM

I think the problem with that design is that that model equates patient X1 in group 1 with patient X1 in group 2 which is not true (check if you see a patientX1.groupG2 column). Also tissue T1 in group 1 with tissue T1 in group 2 (check if you see a tissueT1.group2 column). I need to think more, but ATM I can't think of a model with interactions that has less parameters to fit than samples.

**Michael Love** · 10-23-2014, 11:46 AM

To explore the question of variability, I would look at PCA plots and color by group. Tests of differential variability exist, but that's not part of DESeq2.

**dmosk** · 10-23-2014, 12:57 PM

There is a groupG1.patientX2 column, but my feeling was that it's OK to do it like this because I'm not fitting a main effect for the patient variable, so the patient effects are compartmentalized to being within-groups (since the patient:group interaction is the only place where the patient term appears in the model). I realize I can't, for instance, average over groupG1.patientX2 and groupG2.patientX2 to get a patientX2 effect, but I was under the impression that the groups' main effects and the group:tissue interaction effects were valid and disentangled from patient-specific variation. (There is a also a groupG2.tissueT1 column but I'm unclear as to why this is a problem; it is the same set of three tissues that are being extracted from each patient, regardless of group.) Are my assumptions incorrect?

Also, what's really escaping me right now is why the model you proposed doesn't give a full-rank design matrix/why the experimental design is insufficient to fully specify the variables of interest and their interactions, as in the "~ patient + tissue + tissue:group" formula you suggested.

I really appreciate all your advice and support. Needless to say, I have found this problem to be frustratingly challenging, and your guidance is very helpful!

**Michael Love** · 10-23-2014, 01:10 PM

Can you print the column names of the model matrix with your suggested design?

re: why the model doesn't give full rank. I think it is because you have the missing sample and that if you excluded the other samples from that patient it would be full rank.

**dmosk** · 10-23-2014, 02:17 PM

Sure - this is "~ group + group

atient + group:tissue" after replacing the patient names with X1 through X5 for G1, and likewise for G2:

Code:

> colnames(modelMatrix)
 [1] "(Intercept)"       "groupG2"           "groupG1:patientX2"
 [4] "groupG2:patientX2" "groupG1:patientX3" "groupG2:patientX3"
 [7] "groupG1:patientX4" "groupG2:patientX4" "groupG1:patientX5"
[10] "groupG2:patientX5" "groupG1:tissueT2"  "groupG2:tissueT2" 
[13] "groupG1:tissueT3"  "groupG2:tissueT3"

I took this idea from section 3.5 of edgeR's vignette, which gave an example for having both between- and within-group comparisons, as there are here.

I had actually already tried adding back in that one patient's missing sample but the matrix with "~ patient + tissue + tissue:group" remained degenerate. Only by adding an additional patient to each group was I able to create a full-rank design with that formula. I could try to lobby to get two more samples sequenced if the data I have is definitely insufficient to appropriately model the variables of interest.

To get back to the group-level variability, I realize that I can't test for that directly within DESeq2. The reason I mentioned it is that I would want to base whatever metric I use on the normalized values given by the software, but the normalized values for e.g. X30844 and X30855 are intertwined with the effects of batch 2, which is comprised entirely of those two samples, if I don't model batch effects. Since I can't model batch effects directly in my formula, could I attempt to remove them first with sva/Combat, or would that violate the assumptions of DESeq2 (even though there wouldn't be a batch term in the formula)?

**Michael Love** · 10-24-2014, 06:04 AM

Do you see what I meant above though? The problem with your design is that there is no "patientX1:groupG2" effect or equivalently "groupG2

atientX1". This means that patient X1 is considered the same effect in the two groups (which is a problem because they are not even the same patient).

Similarly there is no "groupG2:tissueT1". So your model also assumes tissue T1 is the same effect in the two groups.

I can't think of a way to set up the model with interactions without equating the base levels of your factors across groups, which is not a good assumption to make.

**Michael Love** · 10-24-2014, 06:05 AM

Note that any model which contains "patient" as a main effect controls for batch, because the batches are all composed of the sums of patient columns.

**dmosk** · 10-24-2014, 10:32 AM

You're right about the patient terms controlling for batch effects - I was thinking that I'd want to use the patients' coefficients to assess in-group variability but I realize now that I wouldn't need to.

I do see that the design I gave does not explicitly have a term for G2:X1, but isn't this implicit? The intercept column minus the sum of the group

atient columns gives G2:X1, and likewise for G2:T1 (although I remain confused as to why the latter would be an issue either way, given that it is the same tissue type in each group). Accordingly, if I were to look at resultsNames(.) of the associated DESeq object, I would see, in the expanded model, a G2:X1 term.

Going back to the model you suggested, I have two questions:
First, I must have made some error when I claimed that, by adding in an additional patient to each group, I was able to achieve a full-rank design. I have tried to replicate this but I have been unsuccessful - I just tried simulating the addition of four patients to each group and the matrix was still degenerate. Is there some inherent linear dependency that's still present there?
Second, with that model, building the main effect for the groups by averaging over the patients' coefficients concerns me. Wouldn't the group effects then be confounded by patient-specific effects?

Thanks again, Michael. I'd be completely stuck without your assistance!

**Michael Love** · 10-24-2014, 11:09 AM

Yes, you're right, sorry it took me a while to see what kind of matrix you were getting.

And my suggested design doesn't work because sum(patient 6-10) = T1:G2 + T2:G2 + T3:G2.

So then you should go ahead and use the ~ group + group

atient + group:tissue.

Given the complexity, I'd use betaPrior=FALSE to keep the coefficients the same as the standard ones returned by model.matrix.

You can use numeric contrasts then to answer your questions. One way that i've taught how to generate the numeric contrast is to form the column mean of each group. For example, if you want to compare tissue 1 across the two groups, the column mean of the tissue 1 samples in group 1 (assuming you had that extra sample) is

Code:

1.0 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.0 0.0

and for group 2 this is

Code:

1.0 1.0 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.0

the second minus the first gives the contrast:

Code:

0.0  1.0 -0.2  0.2 -0.2  0.2 -0.2  0.2 -0.2  0.2  0.0  0.0  0.0  0.0

I think this should correspond to your colnames, but check if you get the same. The numbers will be adjusted given you are missing one T1 sample.

**dmosk** · 10-24-2014, 11:14 AM

OK I think I figured out the linear dependence in "~ patient + tissue + tissue:group" - it's between patient and tissue:group --
Because each patient is in one group and produces each tissue, the sum of the interaction terms over each tissue for a given group, minus all the patients in that group except one, equals the remaining patient. If I had two only two tissues and four patients (P1-P4; two in each group), then e.g. T1:G2 + T2:G2 - P4 = P3. For clarity, I simulated this:

Code:

> modelMatrix <- model.matrix(~ patient + tissue:group, data=as.data.frame(samples))
> modelMatrix
                (Intercept) patientP2 patientP3 patientP4 tissueT1:groupG1
30844_CD8_CM              1         0         0         0                0
30844_CD8_Naive           1         0         0         0                1
30855_CD8_CM              1         1         0         0                0
30855_CD8_Naive           1         1         0         0                1
30999_CD8_CM              1         0         1         0                0
30999_CD8_Naive           1         0         1         0                0
31002_CD8_CM              1         0         0         1                0
31002_CD8_Naive           1         0         0         1                0
                tissueT2:groupG1 tissueT1:groupG2 tissueT2:groupG2
30844_CD8_CM                   1                0                0
30844_CD8_Naive                0                0                0
30855_CD8_CM                   1                0                0
30855_CD8_Naive                0                0                0
30999_CD8_CM                   0                0                1
30999_CD8_Naive                0                1                0
31002_CD8_CM                   0                0                1
31002_CD8_Naive                0                1                0

Then, if I can't directly model both the individual patient effects and the tissue:group interaction terms at the same time, is there another model that would be appropriate? Are you definitely convinced that the model from the edgeR manual is incorrect?

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

DESeq2 model matrix formula

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News