Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2: Continuos variable in the design

    Hi!
    Im going to try finding differential expressed genes "correlated" with a continuous variable using DESeq2.

    This is the design Im using, you can see that its only one variable and its continuous. I have 26 biological replicates.

    > des <- model.matrix(~GIR)
    > des
    (Intercept) GIR
    1 1 4.401988
    2 1 4.959263
    3 1 4.386282
    4 1 1.530456
    5 1 1.806429
    6 1 4.335553
    7 1 4.206755
    8 1 5.913078
    9 1 2.427184
    10 1 4.695216
    11 1 8.343508
    12 1 3.179318
    13 1 4.197272
    14 1 6.592866
    15 1 8.607653
    16 1 5.313869
    17 1 6.416293
    18 1 6.021505
    19 1 5.291005
    20 1 8.586633
    21 1 9.803922
    22 1 8.777473
    23 1 8.312925
    24 1 9.480470
    25 1 8.782435
    26 1 6.975749
    attr(,"assign")
    [1] 0 1


    I used the "Wald" test and here is 1 gene from the results:

    dds <- DESeq(ddsFullCountTable, test = c("Wald"))
    res <- results(dds)
    res = res[order(res$padj, na.last = NA), ]

    head(res,1)

    baseMean log2FoldChange lfcSE stat pvalue
    <numeric> <numeric> <numeric> <numeric> <numeric>
    NINJ1 286.41601 -0.12059350 0.02202829 -5.474483 4.387909e-08



    Now, have I done it correctly? What have I actually answered and how to interpret the fold change?


    Thank you very much!

  • #2
    I don't see anything obviously wrong at least. The fold change of a covariate is the change per unit increase/decrease.

    Comment


    • #3
      Thats great.

      The reason Im asking is because Im thinking of using cross-validation:


      But thats an area Im again is unfamiliar with..

      Comment


      • #4
        What have I actually answered and how to interpret the fold change?
        In the DESeq2 vignette, we have:

        If the variable of interest is continuous-valued, then the reported log2 fold change is per unit of change of that variable.

        When you test with a continuous covariate, x, you are supposing the that counts follow the following formula:

        counts proportional to 2^(a * x)

        Where a is the log2FoldChange.

        Comment


        • #5
          Hi again! I have now tried the same approach, but included one more time point (paired data). This is my design (I did not copy in the "patients" term, since its big):

          des <- model.matrix(~patients+timepoints+dGIR)

          > des
          (Intercept) timepoints2 dGIR
          1 1 0 4.401988
          2 1 0 4.959263
          3 1 0 4.386282
          4 1 0 1.530456
          5 1 0 1.806429
          6 1 0 4.335553
          7 1 0 4.206755
          8 1 0 5.913078
          9 1 0 2.427184
          10 1 0 4.695216
          11 1 0 8.343508
          12 1 0 3.179318
          13 1 0 4.197272
          14 1 1 7.033431
          15 1 1 4.395280
          16 1 1 4.867971
          17 1 1 4.651163
          18 1 1 3.273684
          19 1 1 4.918936
          20 1 1 4.308176
          21 1 1 7.803321
          22 1 1 4.884005
          23 1 1 5.275498
          24 1 1 9.759326
          25 1 1 4.972376
          26 1 1 4.102564
          27 1 0 6.592866
          28 1 0 8.607653
          29 1 0 5.313869
          30 1 0 6.416293
          31 1 0 6.021505
          32 1 0 5.291005
          33 1 0 8.586633
          34 1 0 9.803922
          35 1 0 8.777473
          36 1 0 8.312925
          37 1 0 9.480470
          38 1 0 8.782435
          39 1 0 6.975749
          40 1 1 9.494048
          41 1 1 8.817365
          42 1 1 6.958610
          43 1 1 8.303265
          44 1 1 10.467523
          45 1 1 14.250726
          46 1 1 11.707989
          47 1 1 10.816993
          48 1 1 10.332435
          49 1 1 12.841037
          50 1 1 14.824017
          51 1 1 6.165730
          attr(,"assign")
          [1] 0 1 2
          attr(,"contrasts")
          attr(,"contrasts")$timepoints
          [1] "contr.treatment"


          So, do I now get the genes that is related the change in GIR?

          Comment


          • #6
            Yes.

            I am writing more words so that i can post this post.

            Comment


            • #7
              Thank you for your time, if you don't mind I have two more questions:

              1. Why does the result from only one time point not overlap at all with the test with two time points?

              2. Why does the CPM values for a gene from the results not correlate with GIR using Spearman/Pearson correlation?

              Comment


              • #8
                I recommend that you check more examples by eye, by plotting log2(normalized counts) over the continuous variable. Note that you need to split by group. The test is of linearity of log scale counts with your continuous variable, with the same slope in the two groups.

                Here's a simple example with an artificial count matrix showing that the genes with smallest p-value for a test of the continuous variable 'x' are those the genes with high correlation of log2(norm. counts + 1) and that continuous variable 'x'.

                Code:
                > set.seed(1)
                > m <- matrix(round(runif(1000,1,100)), ncol=10)
                > dds <- DESeqDataSetFromMatrix(m, DataFrame(x=1:10), ~x)
                > dds <- DESeq(dds, fitType="mean") # fitType=mean specified bc counts are very artificial
                > res <- results(dds)
                > order(res$pvalue)[1:5]
                [1] 76 65 77 94 86
                > order(-abs(cor(t(log2(counts(dds,normalized=TRUE) + 1)), dds$x)))[1:5]
                [1]  1 65 76 77 94

                Comment


                • #9
                  Thank you very much, most helpful.

                  Checking the test with 1 time point gave exactly what you said, but I did not check p-values of the correlations.

                  When testing with two time points I got only 15/20 when based on p-value of Pearson correlation:

                  #Getting counts
                  m <- log2(counts(dds,normalized=TRUE) + 1)

                  #Keeping only FDR < 10%
                  mt <- m[order(res$pvalue)[1:20],]

                  #Adding GIR values
                  mgir <- cbind(t(mt),GIR)

                  #Making delta values (time point 2 - time points 1)
                  delta <- (mgir[c(14:26, 39:50),]-(mgir[c(1:13, 27:38),]))

                  #Checking correlations, returns a matrix with gene vs GIR, rho and p-value
                  cor <- flattenSquareMatrix(cor.prob(delta))

                  #Of the 20 genes with FDR < 0.1, these have p < 0.05 running Pearson:
                  sum(cor[191:210,]$p < 0.05)
                  #[1] 15

                  That maybe ok, sad thing is the non existing overlap between the test with 1 time point and the test with 2 time points.

                  Im thinking of not testing log2(counts)+1 linearity against GIR, but rather CPM/FPKM vs GIR. Would maybe make more sense in regards to normalisation and linearity. Agree?

                  Comment


                  • #10
                    hi,

                    Just to be clear, I am not saying the negative binomial test is the same as Pearson correlation, or a test of Pearson correlation. I think you might be missing what is being tested for with DESeq2 with the multi-factor design. If you have ~ group + x, where x is continuous, you are looking for genes which have consistent slope in both groups. The naive correlation-based approach would be to subtract, for each group, the mean value of the log normalized counts. Then to look at plot(x, log-normalized-counts-centered-per-group). But you can also just look at plot(x, log-normalized-counts) and color the points by group. You should be able to see by eye what makes a gene get a low p-value.

                    Here is an example with Normal data:

                    Code:
                    > x <- c(1:10, 1:10)
                    > y <- rnorm(20, c(1:10, -(1:10)), 1)
                    > g <- rep(1:2,each=10)
                    > summary(lm(y ~ x)) # not small p-value for both groups
                    ...
                    Coefficients:
                                Estimate Std. Error t value Pr(>|t|)
                    (Intercept)   0.5693     3.0898   0.184    0.856
                    x            -0.1494     0.4980  -0.300    0.768
                    
                    > summary(lm(y[1:10] ~ x[1:10])) # small p-value for one group
                    ...
                    Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
                    (Intercept)  0.39201    0.61561   0.637    0.542    
                    x[1:10]      0.84588    0.09921   8.526 2.75e-05 ***
                    
                    > plot(x,y,col=g) # this explains why
                    There is not a way to perform analysis of counts ~ variables with DESeq2. There are a couple issues with linear models on raw counts but here is an important one: suppose you observe data of (x, count) like the following: (2,100), (3,50). Now you want to predict what the count will be for x=5. The linear model prediction would be a count of -50. Performing analysis with the log link assures that expected counts based on the model are always non-negative.

                    Comment


                    • #11
                      One addition, if you use the following transformed variable:

                      log2x = log2(x)

                      Then you will get the following relation at the level of raw counts:

                      log2(counts) ~ a log2x
                      counts ~ 2^(a log2(x))
                      counts ~ x^a

                      So then the coefficient from DESeq2 for log2x would be the exponent for x on the raw count scale.

                      Comment


                      • #12
                        Thank you very much for all your help. For many reasons I started over using FeatureCount.
                        But when I try to run DESeq2 now, I get an error:

                        DESeq2 error: inv(): matrix appears to be singular

                        > sessionInfo()
                        R version 3.1.0 (2014-04-10)
                        Platform: x86_64-apple-darwin13.1.0 (64-bit)

                        locale:
                        [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

                        attached base packages:
                        [1] parallel stats graphics grDevices utils datasets methods base

                        other attached packages:
                        [1] edgeR_3.6.2 limma_3.20.4 DESeq2_1.4.5 RcppArmadillo_0.4.300.0 Rcpp_0.11.1
                        [6] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.7 BiocGenerics_0.10.0

                        loaded via a namespace (and not attached):
                        [1] annotate_1.42.0 AnnotationDbi_1.26.0 Biobase_2.24.0 DBI_0.2-7 genefilter_1.46.1
                        [6] geneplotter_1.42.0 grid_3.1.0 lattice_0.20-29 locfit_1.5-9.1 RColorBrewer_1.0-5
                        [11] RSQLite_0.11.4 splines_3.1.0 stats4_3.1.0 survival_2.37-7 tools_3.1.0
                        [16] XML_3.98-1.1 xtable_1.7-3 XVector_0.4.0

                        It should be the same analysis with one time point as I showed earlier (not included continuous variable yet):

                        > dds
                        class: DESeqDataSet
                        dim: 23710 26
                        exptData(0):
                        assays(1): counts
                        rownames(23710): DDX11L1 WASH7P ... GOLGA2P3Y GOLGA2P2Y
                        rowData metadata column names(0):
                        colnames(26): D102-A1 D104-A1 ... N187-A1 N188-A1
                        colData names(4): Targets Timepoint Group Patients

                        > as.data.frame( colData(dds) )

                        Timepoint Group Patients
                        1 pT2D 2
                        1 pT2D 4
                        1 pT2D 17
                        1 pT2D 21
                        1 pT2D 53
                        1 pT2D 55
                        1 pT2D 61
                        1 pT2D 62
                        1 pT2D 67
                        1 pT2D 73
                        1 pT2D 76
                        1 pT2D 77
                        1 pT2D 79
                        1 Control 1
                        1 Control 8
                        1 Control 13
                        1 Control 70
                        1 Control 71
                        1 Control 72
                        1 Control 75
                        1 Control 81
                        1 Control 82
                        1 Control 83
                        1 Control 86
                        1 Control 87
                        1 Control 88

                        Sorry for begin "off-topic", but I can't continue with your suggestions when getting this error.

                        Thanks!

                        Comment


                        • #13
                          [new post below]
                          Last edited by Michael Love; 05-27-2014, 07:14 AM. Reason: irrelevant response

                          Comment


                          • #14
                            sorry, I just saw your note that you didn't include the continuous covariate. Can you please include all code which produced this error?

                            If you are only analyzing one timepoint and not including the covariate, then your design is just:

                            ~ patients ?

                            Fitting this model is problematic, because patients is not a grouping factor. It appears there are no repeated patients in this dataset. It should have prompted an error about not having any replicates. Is this a factor or an integer variable in your colData?

                            I would recommend a design of ~ timepoint + GIR, and leave out the patient term.

                            Comment


                            • #15
                              Originally posted by Michael Love View Post

                              ~ patients ?
                              Haha!
                              Yes, I was trying to sort everything nicely, since Im going to try a lot of things later. But of course I made a stupid mistake.

                              It works now, re-doing the analysis with one time point:

                              ddsA1GIR <- DESeqDataSetFromMatrix(
                              countData = assay(dds),
                              colData = colData(dds),
                              design = ~GIR)

                              ddsA1GIR <- DESeq(ddsA1GIR, test="Wald")

                              I will continue this test later.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X