Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 model matrix formula

    Hi all,

    I am trying to figure out the correct formula to use for my model in an RNA-seq experiment I am analyzing with DESeq2.

    The experimental design covers 10 patients, done in 4 batches. No single batch is comprised solely of a single patient. The patients are distributed equally between two groups. Group 1 contains 2 male and 3 female patients and Group 2 is 3 male/2 female. 1 G1 male and 1 G2 male are Asian; the remainder of the patients are Caucasian. From each patient, we have collected 3 different tissue types.

    The primary goals of the experiment are to: (1) understand how each tissue differs between groups and (2) understand how the tissues differ from one another within groups. The gender, race, and batch effects are not of particular importance, but I'd like to be able to control for them. We have a moderate expectation that there will be differences between genders; whether race will have an effect is less clear.

    I initially tried "~ patient + batch + ethnicity + gender + group + tissue + group:tissue" but, aside from the fact that I'm unsure whether this is correct, my design matrix is degenerate. No matter what variables I drop or how I add interaction terms, R always tells me that my design matrix has rank (at most) 14.

    Eventually, I followed the example given in section 3.5 of the edgeR manual (even though I am using DESeq2), using only "~ group + groupatient + group:tissue" and blocking the patients within the groups. This design matrix does have 14 columns and is full rank, but I am not using the gender/batch/ethnicity information, and it is unclear to me whether I can accurately assess differences between two tissues within the same group: I know different groups within a tissue should be group1 + group1.tissue1 versus group2 + group2.tissue1, but I cannot likewise do tissue1 + group1.tissue1 versus tissue2 + group1.tissue2 because I am not fitting main effects for the tissues.

    I feel like the complication here is that there are multiple blocking variables, and I have both within- and between-group comparisons. I thought about trying to use sva or RUVseq to try to remove the effects of the variables that are not of real interest, but using sva runs into the same problem of needing to have a full model with those variables included, and using RUVseq doesn't take advantage of the known relationships between the variables and the samples (plus I don't have any well defined control genes).

    Here is my actual sample matrix, if it's helpful (note that one of the 30 samples was removed for QC reasons):
    Code:
    group	patient	gender	ethnicity	tissue	batch
    G2	X30844	M	Asian	T2	Batch_1
    G2	X30844	M	Asian	T3	Batch_1
    G2	X30844	M	Asian	T1	Batch_1
    G2	X30855	M	Caucasian	T2	Batch_1
    G2	X30855	M	Caucasian	T3	Batch_1
    G2	X30855	M	Caucasian	T1	Batch_1
    G1	X30999	F	Caucasian	T2	Batch_3
    G1	X30999	F	Caucasian	T3	Batch_3
    G1	X30999	F	Caucasian	T1	Batch_3
    G1	X31002	F	Caucasian	T2	Batch_3
    G1	X31002	F	Caucasian	T3	Batch_3
    G1	X31122	F	Caucasian	T2	Batch_4
    G1	X31122	F	Caucasian	T3	Batch_4
    G1	X31122	F	Caucasian	T1	Batch_4
    G2	X31132	M	Caucasian	T2	Batch_4
    G2	X31132	M	Caucasian	T3	Batch_4
    G2	X31132	M	Caucasian	T1	Batch_4
    G1	X31134	M	Asian	T2	Batch_4
    G1	X31134	M	Asian	T3	Batch_4
    G1	X31134	M	Asian	T1	Batch_4
    G1	X31188	M	Caucasian	T2	Batch_5
    G1	X31188	M	Caucasian	T3	Batch_5
    G1	X31188	M	Caucasian	T1	Batch_5
    G2	X31193	F	Caucasian	T2	Batch_5
    G2	X31193	F	Caucasian	T3	Batch_5
    G2	X31193	F	Caucasian	T1	Batch_5
    G2	X31195	F	Caucasian	T2	Batch_5
    G2	X31195	F	Caucasian	T3	Batch_5
    G2	X31195	F	Caucasian	T1	Batch_5
    Thanks so much for your help and sorry about the long post!

  • #2
    hi,

    The patient effect is more fine grained than a number of other effects: gender, batch, ethnicity. So by including the patient effect as a term in the design, you control for these differences as well. That is the reason you cannot build a full rank design matrix which includes patient, gender, batch and ethnicity.

    I would recommend you use the design:

    Code:
    ~ patient + tissue + group + tissue:group
    Your questions of interest are:

    (1) understand how each tissue differs between groups and (2) understand how the tissues differ from one another within groups.



    The first question is found by investigating the interaction effects. For instance

    Code:
    results(dds, contrast=list("tissueT1.groupG2","tissueT1.groupG1"))
    tests the interaction, i.e. it tests for genes which show a different G2 vs G1 effect in tissue T1 compared to the general G2 vs G1 effect.

    The group G2 vs G1 effect in tissue T1 is this interaction effect above plus the main G2 vs G1 effect:

    Code:
    results(dds, contrast=list(c("tissueT1.groupG2","groupG2"),
    c("tissueT1.groupG1","groupG1")))
    so adding the interaction and the main effect tests for genes which show any G2 vs G1 effect in tissue T1, whether because there is only a main G2 vs G1 effect, or only a T1-specific G2 vs G1 effect, or there is a main effect and a specific effect which are added together. (If this is confusing, check a statistical reference on interaction terms).



    The second question is found using the same logic. For instance

    Code:
    results(dds, contrast=list("tissueT2.groupG1","tissueT1.groupG1"))
    tests for genes which show a group-G1-specific tissue T2 vs T1 effect beyond the main T2 vs T1 effect.

    To build the T2 vs T1 effect in group G1, you need to add the interaction effect and the main T2 vs T1 effect:

    Code:
    results(dds, contrast=list(c("tissueT2.groupG1","tissueT2"),
    c("tissueT1.groupG1","tissueT1")))
    so this test for those genes which show a tissue T2 vs T1 effect in group G1.

    Comment


    • #3
      Hi Michael,

      Thanks for the reply.

      I had already tried using the design you suggested (as I agree it is reasonable), but it gives me a degenerate matrix (and DESeq2 does report this):
      Code:
      samples <- read.table('samples.txt')
      modelMatrix <- model.matrix(~ patient + tissue + group + tissue:group, data=as.data.frame(samples))
      qr(modelMatrix)$rank
      ncol(modelMatrix)
      gives rank 14 on a 15-column matrix (where samples.txt is the sample matrix from my first post). It's unclear to me where the linear dependence is arising, but it's hard to argue with R.

      In response to your other comments about the contrasts, I understand the distinction between the different testing schema. We are interested in effects across groups/tissues irrespective of whether they are unique to the interaction, which is why I was using the sum of the interaction and the main effect, and that is why I was concerned about not fitting a main effect for the tissue. Of course, interaction-specific effects are interesting, as well, just not solely so in this case.

      Do you have any ideas as to the origin of the degeneracy in my model matrix and how I might rectify it? I've racked my brain trying to figure out the problem here but to no avail...

      Thanks again for your help!

      Comment


      • #4
        You're right, I forgot that group is also a linear combination of patient. You can just remove the main group effect then, as this is already accounted for by patient. This gives you:

        Code:
        ~ patient + tissue + tissue:group
        Now the only unfortunate thing is that the main group G2 vs G1 effect has to be written in terms of patient. The interactions could be written using the code above, but to add the interaction to the main effect you need to build the main effect from the patients.

        G2 vs G1 is:

        (X30844 + X30855 + X31132 + X31193 + X31195)/5 - (X30999 + X31002 + X31122 + X31134 + X31188)/5

        So you can't use the list() style contrast anymore, but instead you have to use the numeric argument to contrast. The numeric should correspond to the order of resultsNames(dds). Give the patients a 1/5 or -1/5 if they are in G2 or G1 respectively. Give the interaction of interest a 1 and -1 for the numerator term and the denominator term respectively. See ?results for more info on specifying numeric contrasts.

        Comment


        • #5
          Thanks, Michael! You are completely right that the groups are linear combinations of the patients - that slipped right past me. Unfortunately, the model you suggested is still giving me a 15-column matrix (and it's thus degenerate).

          Your comment about numeric contrasts gave me an idea, though. As I mentioned in my first post, the only full-rank model matrix I had successfully created was by blocking the patients in the groups (as two sets of X1 through X5, with one set for each group), and using formula "~ group + groupatient + group:tissue." I could then get at the main effects for the tissues by averaging over the group:tissue interactions for a given tissue type. Does that sound reasonable?

          Also, one thing I didn't mention is that we do have a hypothesis that, ideally, we'd like to test, in that patients in one group are expected to be more variable than those in the other group. Without fitting a batch effect, I don't think this can be gotten at, since some batches are comprised of a single group. Is there any way I can solve this problem, or is it impossible to fully extricate the batch effect from those of the other variables?

          Comment


          • #6
            I think the problem with that design is that that model equates patient X1 in group 1 with patient X1 in group 2 which is not true (check if you see a patientX1.groupG2 column). Also tissue T1 in group 1 with tissue T1 in group 2 (check if you see a tissueT1.group2 column). I need to think more, but ATM I can't think of a model with interactions that has less parameters to fit than samples.

            Comment


            • #7
              To explore the question of variability, I would look at PCA plots and color by group. Tests of differential variability exist, but that's not part of DESeq2.

              Comment


              • #8
                There is a groupG1.patientX2 column, but my feeling was that it's OK to do it like this because I'm not fitting a main effect for the patient variable, so the patient effects are compartmentalized to being within-groups (since the patient:group interaction is the only place where the patient term appears in the model). I realize I can't, for instance, average over groupG1.patientX2 and groupG2.patientX2 to get a patientX2 effect, but I was under the impression that the groups' main effects and the group:tissue interaction effects were valid and disentangled from patient-specific variation. (There is a also a groupG2.tissueT1 column but I'm unclear as to why this is a problem; it is the same set of three tissues that are being extracted from each patient, regardless of group.) Are my assumptions incorrect?

                Also, what's really escaping me right now is why the model you proposed doesn't give a full-rank design matrix/why the experimental design is insufficient to fully specify the variables of interest and their interactions, as in the "~ patient + tissue + tissue:group" formula you suggested.

                I really appreciate all your advice and support. Needless to say, I have found this problem to be frustratingly challenging, and your guidance is very helpful!

                Comment


                • #9
                  Can you print the column names of the model matrix with your suggested design?

                  re: why the model doesn't give full rank. I think it is because you have the missing sample and that if you excluded the other samples from that patient it would be full rank.

                  Comment


                  • #10
                    Sure - this is "~ group + groupatient + group:tissue" after replacing the patient names with X1 through X5 for G1, and likewise for G2:
                    Code:
                    > colnames(modelMatrix)
                     [1] "(Intercept)"       "groupG2"           "groupG1:patientX2"
                     [4] "groupG2:patientX2" "groupG1:patientX3" "groupG2:patientX3"
                     [7] "groupG1:patientX4" "groupG2:patientX4" "groupG1:patientX5"
                    [10] "groupG2:patientX5" "groupG1:tissueT2"  "groupG2:tissueT2" 
                    [13] "groupG1:tissueT3"  "groupG2:tissueT3"
                    I took this idea from section 3.5 of edgeR's vignette, which gave an example for having both between- and within-group comparisons, as there are here.

                    I had actually already tried adding back in that one patient's missing sample but the matrix with "~ patient + tissue + tissue:group" remained degenerate. Only by adding an additional patient to each group was I able to create a full-rank design with that formula. I could try to lobby to get two more samples sequenced if the data I have is definitely insufficient to appropriately model the variables of interest.

                    To get back to the group-level variability, I realize that I can't test for that directly within DESeq2. The reason I mentioned it is that I would want to base whatever metric I use on the normalized values given by the software, but the normalized values for e.g. X30844 and X30855 are intertwined with the effects of batch 2, which is comprised entirely of those two samples, if I don't model batch effects. Since I can't model batch effects directly in my formula, could I attempt to remove them first with sva/Combat, or would that violate the assumptions of DESeq2 (even though there wouldn't be a batch term in the formula)?

                    Comment


                    • #11
                      Do you see what I meant above though? The problem with your design is that there is no "patientX1:groupG2" effect or equivalently "groupG2atientX1". This means that patient X1 is considered the same effect in the two groups (which is a problem because they are not even the same patient).

                      Similarly there is no "groupG2:tissueT1". So your model also assumes tissue T1 is the same effect in the two groups.

                      I can't think of a way to set up the model with interactions without equating the base levels of your factors across groups, which is not a good assumption to make.
                      Last edited by Michael Love; 10-24-2014, 06:38 AM.

                      Comment


                      • #12
                        Note that any model which contains "patient" as a main effect controls for batch, because the batches are all composed of the sums of patient columns.

                        Comment


                        • #13
                          You're right about the patient terms controlling for batch effects - I was thinking that I'd want to use the patients' coefficients to assess in-group variability but I realize now that I wouldn't need to.

                          I do see that the design I gave does not explicitly have a term for G2:X1, but isn't this implicit? The intercept column minus the sum of the groupatient columns gives G2:X1, and likewise for G2:T1 (although I remain confused as to why the latter would be an issue either way, given that it is the same tissue type in each group). Accordingly, if I were to look at resultsNames(.) of the associated DESeq object, I would see, in the expanded model, a G2:X1 term.

                          Going back to the model you suggested, I have two questions:
                          First, I must have made some error when I claimed that, by adding in an additional patient to each group, I was able to achieve a full-rank design. I have tried to replicate this but I have been unsuccessful - I just tried simulating the addition of four patients to each group and the matrix was still degenerate. Is there some inherent linear dependency that's still present there?
                          Second, with that model, building the main effect for the groups by averaging over the patients' coefficients concerns me. Wouldn't the group effects then be confounded by patient-specific effects?

                          Thanks again, Michael. I'd be completely stuck without your assistance!

                          Comment


                          • #14
                            Yes, you're right, sorry it took me a while to see what kind of matrix you were getting.

                            And my suggested design doesn't work because sum(patient 6-10) = T1:G2 + T2:G2 + T3:G2.

                            So then you should go ahead and use the ~ group + groupatient + group:tissue.

                            Given the complexity, I'd use betaPrior=FALSE to keep the coefficients the same as the standard ones returned by model.matrix.

                            You can use numeric contrasts then to answer your questions. One way that i've taught how to generate the numeric contrast is to form the column mean of each group. For example, if you want to compare tissue 1 across the two groups, the column mean of the tissue 1 samples in group 1 (assuming you had that extra sample) is

                            Code:
                            1.0 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.0 0.0
                            and for group 2 this is

                            Code:
                            1.0 1.0 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.0
                            the second minus the first gives the contrast:

                            Code:
                            0.0  1.0 -0.2  0.2 -0.2  0.2 -0.2  0.2 -0.2  0.2  0.0  0.0  0.0  0.0
                            I think this should correspond to your colnames, but check if you get the same. The numbers will be adjusted given you are missing one T1 sample.

                            Comment


                            • #15
                              OK I think I figured out the linear dependence in "~ patient + tissue + tissue:group" - it's between patient and tissue:group --
                              Because each patient is in one group and produces each tissue, the sum of the interaction terms over each tissue for a given group, minus all the patients in that group except one, equals the remaining patient. If I had two only two tissues and four patients (P1-P4; two in each group), then e.g. T1:G2 + T2:G2 - P4 = P3. For clarity, I simulated this:
                              Code:
                              > modelMatrix <- model.matrix(~ patient + tissue:group, data=as.data.frame(samples))
                              > modelMatrix
                                              (Intercept) patientP2 patientP3 patientP4 tissueT1:groupG1
                              30844_CD8_CM              1         0         0         0                0
                              30844_CD8_Naive           1         0         0         0                1
                              30855_CD8_CM              1         1         0         0                0
                              30855_CD8_Naive           1         1         0         0                1
                              30999_CD8_CM              1         0         1         0                0
                              30999_CD8_Naive           1         0         1         0                0
                              31002_CD8_CM              1         0         0         1                0
                              31002_CD8_Naive           1         0         0         1                0
                                              tissueT2:groupG1 tissueT1:groupG2 tissueT2:groupG2
                              30844_CD8_CM                   1                0                0
                              30844_CD8_Naive                0                0                0
                              30855_CD8_CM                   1                0                0
                              30855_CD8_Naive                0                0                0
                              30999_CD8_CM                   0                0                1
                              30999_CD8_Naive                0                1                0
                              31002_CD8_CM                   0                0                1
                              31002_CD8_Naive                0                1                0
                              Then, if I can't directly model both the individual patient effects and the tissue:group interaction terms at the same time, is there another model that would be appropriate? Are you definitely convinced that the model from the edgeR manual is incorrect?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X