Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESEQ2: How to use dispersion estimated from one dataset in another dataset

    Hi all,

    I have recently started using DeSeq2 and I would like to know if it is possible to do something like this:

    I have 3 time course experiments (1 control and 2 treated (A and B)) but I do not have biological replicates for B treated experiment.

    So I calculated the dispersion for experiment A versus control and I wonder if there is a way to use the same dispersion for the comparison of B versus control.
    How to do that ? Is it biologically, scientifically "legal" to do that ?

    By the way, there is evidence that they have similar dispersion and I'd like to avoid using the "blind" method.

    I would appreciate it to let me know your ideas about it.

  • #2
    The best approach would be to combine both analyses into one, by using a DESeqDataSet with all the samples. The design would still be something like: ~ time + condition + condition:time. In order to test the two treatments separately, you can use the likelihood ratio test with different design matrices presented to the reduced argument (here you have to build the matrices outside of the DESeq() function, and pass them in, as is done with limma for example). Something like:

    full.mm <- model.matrix(~ time + condition + condition:time, data=colData(dds))

    # now contruct 2 reduced matrices for treatment A and treatment B, by removing the columns for the interaction terms which contain treatment A and B respectively.

    The for each reduced matrix, you run these lines to get a results table:

    ddsA <- DESeq(dds, full=full.mm, reduced=reduced.mmA, test="LRT")
    resA <- results(ddsA)

    Comment


    • #3
      Hi Michael,

      First of all many thanks for your prompt answer.
      I tried to apply your suggestions.
      As I am not sure for my understanding and the interpretations I give to my results, I expose (in bold) some of my considerations and questions arising through the steps.
      I would appreciate it a lot , if you could answer some of them.

      # times*: 0,1,2 represent stages*: vegetative, early, late respectively (early and late in a sexual cycle)
      > times=factor(c(rep(1,2), rep(2,3), 1, 0, 1, rep(2,3), 1, 2, 0, 1, rep(2,2), 1, rep(2,2), 1, rep(0,2), 1, rep(2,5), 0), levels=c(0,1,2))

      > condition = factor(c(rep("A", 14), rep("Controls",9), rep("B",7)), levels=c("Controls","A","B"))

      Consideration 1
      I consider that the log2 Fold Change results will be*: treatment A over Controls and B over Controls* (the first element of the vector levels=c("Controls","A","B") is always the denominator)...


      > colData = data.frame( condition=condition, times=times)

      # quarteto is my data frame containing the counts
      > rownames(colData)=colnames(quarteto)
      > dds<-DESeqDataSetFromMatrix(countData= as.matrix(quarteto), colData= colData ,design=formula(~times+condition+condition:times))
      > colData
      condition times
      ND7.T0 A 1
      ND7.T10 A 1
      ND7.T20 A 2
      ND7.T30 A 2
      ND7.T40 A 2
      ND7.T5 A 1
      ND7.V A 0
      ICL7.T0 A 1
      ICL7.T10 A 2
      ICL7.T20 A 2
      ICL7.T35 A 2
      ICL7.T5 A 1
      ICL7.T50 A 2
      ICL7.veg A 0
      T0.2 Controls 1
      T20.2 Controls 2
      T20.3 Controls 2
      T3 Controls 1
      T30 Controls 2
      T45 Controls 2
      T5.2 Controls 1
      V1.2 Controls 0
      VK1 Controls 0
      EZL1.T0 B 1
      EZL1.T10 B 2
      EZL1.T20 B 2
      EZL1.T35 B 2
      EZL1.T5 B 2
      EZL1.T50 B 2
      EZL1.veg B 0
      >
      >
      > full.mm <- model.matrix(~ times + condition + condition:times, data=colData(dds))
      > head(full.mm)
      (Intercept) times1 times2 conditionA conditionB times1:conditionA
      ND7.T0 1 1 0 1 0 1
      ND7.T10 1 1 0 1 0 1
      ND7.T20 1 0 1 1 0 0
      ND7.T30 1 0 1 1 0 0
      ND7.T40 1 0 1 1 0 0
      ND7.T5 1 1 0 1 0 1
      times2:conditionA times1:conditionB times2:conditionB
      ND7.T0 0 0 0
      ND7.T10 0 0 0
      ND7.T20 1 0 0
      ND7.T30 1 0 0
      ND7.T40 1 0 0
      ND7.T5 0 0 0

      > # contruction of 2 reduced matrices for treatment A and treatment B, by removing the columns for the interaction terms which contain treatment A and B respectively.

      Consideration 2
      Do I consider right that the only columns concerning the interaction terms of conditionA are*:
      times1:conditionA and times2:conditionA*?


      > reducedA.mm <- full.mm[ ,!colnames(full.mm) %in% c("times1:conditionA", "times2:conditionA")]

      > reducedB.mm <- full.mm[ ,!colnames(full.mm) %in% c("times1:conditionB", "times2:conditionB")]

      > head(reducedA.mm)
      (Intercept) times1 times2 conditionA conditionB times1:conditionB
      ND7.T0 1 1 0 1 0 0
      ND7.T10 1 1 0 1 0 0
      ND7.T20 1 0 1 1 0 0
      ND7.T30 1 0 1 1 0 0
      ND7.T40 1 0 1 1 0 0
      ND7.T5 1 1 0 1 0 0
      times2:conditionB
      ND7.T0 0
      ND7.T10 0
      ND7.T20 0
      ND7.T30 0
      ND7.T40 0
      ND7.T5 0

      > # treated_B vs Controls
      > ddsB <- DESeq(dds, full=full.mm, reduced=reducedB.mm, test="LRT")
      estimating size factors
      estimating dispersions
      gene-wise dispersion estimates
      mean-dispersion relationship
      final dispersion estimates
      fitting model and testing
      -- replacing outliers and refitting for 4444 genes
      -- DESeq argument 'minReplicatesForReplace' = 7
      -- original counts are preserved in counts(dds)
      estimating dispersions
      fitting model and testing
      >
      > resultsNames(ddsB)
      [1] "Intercept" "times1" "times2"
      [4] "conditionA" "conditionB" "times1.conditionA"
      [7] "times2.conditionA" "times1.conditionB" "times2.conditionB"
      >
      > resB_1 <- results(ddsB, name="times1.conditionB")
      > head(resB_1)
      log2 fold change (MLE): times1.conditionB
      LRT p-value: full vs reduced
      DataFrame with 6 rows and 6 columns
      baseMean log2FoldChange lfcSE stat pvalue
      <numeric> <numeric> <numeric> <numeric> <numeric>
      PTET.51.1.G0010001 146.5136783 -0.62466508 0.7898300 0.6319728 0.7290694
      PTET.51.1.G0010002 122.5083339 -0.68161198 0.8059814 1.1202526 0.5711369
      PTET.51.1.G0010003 100.0616624 -0.53628223 0.5763633 2.2757097 0.3205058
      PTET.51.1.G0010004 0.2590626 -5.89587066 10.5174621 0.4181308 0.8113422
      PTET.51.1.G0010005 111.5651704 -0.13174595 0.6950362 1.1823154 0.5536859
      PTET.51.1.G0010006 115.1214038 0.07501774 0.9922750 0.9200433 0.6312700
      padj
      <numeric>
      PTET.51.1.G0010001 0.99998
      PTET.51.1.G0010002 0.99998
      PTET.51.1.G0010003 0.99998
      PTET.51.1.G0010004 NA
      PTET.51.1.G0010005 0.99998
      PTET.51.1.G0010006 0.99998
      >

      Question 1
      resB_1 <- results(ddsB, name="times1.conditionB")
      How are the results resB_1 interpreted*?
      Is it the differential expression from time 0 to time 1 specific to conditionB ??

      Question 2
      How is the «*log2 fold change (MLE): times1.conditionB*» interpreted at the results resB_1? How is it calculated (what are the 2 parts of comparison)*?
      Is it something like*: times1.conditionB vs (Controls+times1.conditionA)*?


      > # COUNTS PLOTS #
      >
      > data1 <- plotCounts(ddsB, which.min(resB_1$padj), intgroup=c("times","condition"), returnData=TRUE)
      > pdf("counts_over_time.pdf")
      > ggplot(data1, aes(x=times, y=count, color=condition, group=condition)) + geom_point() + stat_smooth(se=FALSE,method="loess") + scale_y_log10()
      Error in predLoess(y, x, newx, s, weights, pars$robust, pars$span, pars$degree, :
      NA/NaN/Inf dans un appel à une fonction externe (argument 5)
      Il y a eu 16 avis (utilisez warnings() pour les visionner)

      > dev.off()
      null device
      1
      > The plot is Attached

      Question 3
      Why error*? Is there a problem with the loess method*?


      > resB_1 <- results(ddsB, name="times1.conditionB")
      > resB_2 <- results(ddsB, name="times2.conditionB")
      >
      > min_fold_change=2
      >
      > ####
      > resB_1 <- resB_1[!is.na(resB_1$padj) & resB_1$padj <=0.05,]
      > resB_1 <- resB_1[resB_1$log2FoldChange >= log2(min_fold_change) | resB_1$log2FoldChange <= log2(1/min_fold_change),]
      >
      > resB_2 <- resB_2[!is.na(resB_2$padj) & resB_2$padj <=0.05,]
      > resB_2 <- resB_2[resB_2$log2FoldChange >= log2(min_fold_change) | resB_2$log2FoldChange <= log2(1/min_fold_change),]
      >
      > list_of_differentially_expressed_genes_for_B_vs_Controls<-unique(c(rownames(resB_1),rownames(resB_2)))
      > length(list_of_differentially_expressed_genes_for_B_vs_Controls)
      [1] 33

      Question 3
      Is this the right way*:
      list_of_differentially_expressed_genes_for_B_vs_Controls<-unique(c(rownames(resB_1),rownames(resB_2)))
      to obtain the list of all the differentially expressed genes for treatment B vs Control*?
      Attached Files

      Comment


      • #4
        Your code looks correct.

        One comment: the outlier detection is useful when there are a few genes with outliers. Here there are many genes being flagged, which either means: there is a lot of variability, which the outlier detection method is not picking up on or there is a sample with consistent outlier counts. So I'd recommend setting minReplicatesForReplace=Inf to turn off filtering, and to use plotPCA on rlog or VST data to identify if there is an obvious outlier sample which should be removed.

        Question 1: you don't need to focus on the log2FoldChange. When you perform a likelihood ratio test, you are testing multiple terms. Here you are testing whether there are treatment specific differences over time. There are multiple terms involved in this test: the additional effect of treatmentA at time 1, the additional effect of treatmentA at time 2 (compared to the base level in controls). The p-values and adjusted p-values are the important columns. This is described in the help page for results, under "On p-values":

        ?results

        Question 2: the meaning of each of the terms in an interaction model is quite complex, and I often recommend that users who have complex experimental designs who are not familiar with analyzing these speak to a local statistician who can help explain all the terms. There is nothing special about DESeq2 here, these are the same terms that would appear in a linear regression time course analysis, so any person with statistical training should be able to explain these to you in person.

        Question 3: this is a ggplot2 error, so I'm not sure. it's likely related to 0's in the count column and the log10 scaling.

        Your last question: you are testing for treatment-specific differences across multiple time points, so it doesn't make sense to do LFC thresholding here. Or if you want to threshold, remember that there are multiple time points being tested, and each one has it's own interaction term.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X