Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • aggp11
    Member
    • Jun 2011
    • 87

    DESeq2 log2 fold change discrepancy

    Hi,

    I am working with an RNA-Seq dataset and used DESeq2 to perform differential gene expression analysis using a multi-factor design.

    In my results I noticed that the log2FC values were different compared to if I would just take the log ratio of the condition means (as suggested in the following post):



    I checked my design matrices and all the data and I just can't understand why this is different.

    Below is the code I used:

    library(DESeq2)
    coldata <- read.table("Run1_run2_deseq2_design.txt", sep="\t", header=T, row.names=1)
    coldata <- DataFrame(coldata)
    countdata <- read.table("Run1_run2_raw_counts.counts", sep="\t", header=T, row.names=1)
    countdata.run1.run2 <- cbind(countdata[,c(2:4,6:8,10:12,14:16)])
    colnames(countdata.run1.run2) <- rownames(coldata)
    ddsFromCounts <- DESeqDataSetFromMatrix(countData=countdata.run1.run2, colData=coldata, design=~Type+Condition+Type:Condition)
    ddsFromCounts$Condition <- factor(ddsFromCounts$Condition, levels=c("U","E"))
    dds.run1.run2 <- DESeq(ddsFromCounts)



    > colData(dds.run1.run2)
    DataFrame with 12 rows and 3 columns
    Type Condition sizeFactor
    <factor> <factor> <numeric>
    U_U_1 U U 0.8218577
    M_U_1 M U 1.1845552
    M_E_1 M E 1.2826040
    U_E_1 U E 1.1260139
    E_U_1 E U 0.9145150
    ... ... ... ...
    M_U_2 M U 1.4016941
    M_E_2 M E 0.7907137
    U_E_2 U E 1.4520507
    E_U_2 E U 0.4845640
    E_E_2 E E 0.9343742


    These are the design groups that I can use (resultsNames):

    "TypeE" "TypeM" "TypeU" "ConditionU" "ConditionE" "TypeE.ConditionU" "TypeM.ConditionU" "TypeU.ConditionU" "TypeE.ConditionE" "TypeM.ConditionE" "TypeU.ConditionE"


    Here are the first six rows of my results for a particular contrast:

    > head(results(dds.run1.run2, contrast=list("TypeM.ConditionU","TypeE.ConditionU")))

    log2 fold change (MAP): TypeM.ConditionU vs TypeE.ConditionU
    Wald test p-value: TypeM.ConditionU vs TypeE.ConditionU
    DataFrame with 6 rows and 6 columns
    baseMean log2FoldChange lfcSE stat pvalue padj
    <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
    ##### 6.83635425 -0.29831487 0.25660567 -1.1625420 0.245015380 NA
    ##### 0.77856172 0.02932029 0.12641235 0.2319417 0.816583317 NA
    ##### 0.10539002 -0.03229292 0.06085041 -0.5306935 0.595631173 NA
    ##### 2.52917312 0.08576176 0.19506918 0.4396479 0.660192139 NA
    ##### 320.61150606 0.30625330 0.11398239 2.6868476 0.007212986 0.1360898
    ##### 0.07568229 0.01855059 0.05945254 0.3120235 0.755022638 NA


    Following is some diagnostic code to compare the "results" fold change with the log of condition means for the 5th feature from the results table (rowMeans = 320):

    > nc.test <- round(counts(dds.run1.run2, normalized=TRUE)[5,])
    > data.frame(type=colData(dds.run1.run2)$Type, condition=colData(dds.run1.run2)$Condition, normcounts=nc.test)
    type condition normcounts
    U_U_1 U U 301
    M_U_1 M U 365
    M_E_1 M E 297
    U_E_1 U E 346
    E_U_1 E U 272
    E_E_1 E E 361
    U_U_2 U U 337
    M_U_2 M U 327
    M_E_2 M E 287
    U_E_2 U E 345
    E_U_2 E U 262
    E_E_2 E E 347

    > log2(mean(nc.test[colData(dds.run1.run2)$Type == "M" & colData(dds.run1.run2)$Condition == "U"])/mean(nc.test[colData(dds.run1.run2)$Type == "E" & colData(dds.run1.run2)$Condition == "U"]))

    [1] 0.3739323


    As you'll notice that the log2FC value from results is 0.30625330 whereas it is 0.3739323 from the log of condition mean ratio. I am just not sure what might be affecting this.

    I will appreciate any help on this.

    Thank you!

    > sessionInfo()
    R version 3.1.2 (2014-10-31)
    Platform: i386-w64-mingw32/i386 (32-bit)

    locale:
    [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252

    attached base packages:
    [1] parallel stats4 stats graphics grDevices utils datasets methods base

    other attached packages:
    [1] DESeq2_1.6.2 RcppArmadillo_0.4.550.1.0 Rcpp_0.11.3 GenomicRanges_1.18.3 GenomeInfoDb_1.2.3 IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1

    loaded via a namespace (and not attached):
    [1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 Biobase_2.26.0 BiocParallel_1.0.0 brew_1.0-6 checkmate_1.5.1 cluster_1.15.3
    [12] codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1 digest_0.6.6 fail_1.2 foreach_1.4.2 foreign_0.8-61 Formula_1.1-2 genefilter_1.48.1 geneplotter_1.44.0 ggplot2_1.0.0
    [23] grid_3.1.2 gtable_0.1.2 Hmisc_3.14-6 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 MASS_7.3-35 munsell_0.4.2 nnet_7.3-8 plyr_1.8.1
    [34] proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1 rpart_4.1-8 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2 survival_2.37-7 tools_3.1.2
    [45] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0
    Last edited by aggp11; 12-19-2014, 09:39 AM.
  • Michael Love
    Senior Member
    • Jul 2013
    • 333

    #2
    Now is a great chance to read the just-publised manuscript ;-)

    http://genomebiology.com/2014/15/12/550/abstract (see Fig 2)

    or see Fig 1 in the vignette:

    vignette("DESeq2")

    Relevant is the betaPrior argument of ?DESeq and the addMLE argument of ?results

    Comment

    • lmolokin
      Member
      • Jul 2012
      • 24

      #3
      high count outlier

      Michael,
      I have a question related to LFC shrinkage but involving a high count gene instead (the gene with the highest mean in my dataset to be exact). From what I can tell, DESeq2 has deemed this gene a dispersion outlier based on gene's data from the DESeqDataSet object:

      baseMean 8.65E+05
      baseVar 2.16E+12
      allZero FALSE
      dispGeneEst 2.888
      dispFit 0.030
      dispersion 2.888
      dispIter 7
      dispOutlier TRUE
      dispMAP 8.31E-02

      Granted its dispersion is high, this gene is the target of my experiment's treatment and is expected to be heavily down regulated in the treated group vs control and according to edgeR by more than a hundred fold. However, the results of the DESeq2 Wald test on this gene yield:

      baseMean 864665.543
      log2FoldChange -0.127
      lfcSE 0.074
      stat -1.704
      pvalue 0.088
      padj 0.533

      I realize that outlier LFC's are shruken based on several factors but is it possible to have a massive fold change be shruken to such a low and insignificant value? Is there an alternate approach here that would reduce the severity of the shrinking?

      Thanks!

      Comment

      • Michael Love
        Senior Member
        • Jul 2013
        • 333

        #4
        You can turn off the LFC shrinkage, if you want to go with the MLE LFC estimate by setting betaPrior=FALSE in DESeq(). You will then get an MLE estimate of fold change which is simple (average in group2)/(average in group1). The average can be thrown off when you have high within-group variance, as reflected by the very large dispersion estimate for this gene. What are the counts like for this gene?

        Another way to understand the variance of the MLE LFC estimate is that if you removed the sample with the largest count, how much would the estimate change?

        Code:
        k <- counts(dds, normalized=TRUE)[ gene.idx, ]
        idx <- which.max(k)
        cond <- dds$condition[-idx]
        k <- k[-idx]
        log2(mean(k[cond == "B"])/mean(k[cond == "A"]))

        Comment

        • lmolokin
          Member
          • Jul 2012
          • 24

          #5
          Here are the individual sample counts for this gene:

          No code has to be inserted here.

          Comment

          • Michael Love
            Senior Member
            • Jul 2013
            • 333

            #6
            Note that the range within each group is more than x10. So while there is a consistent difference (treatment always less than ~1/8 of the control), there is very large within-group variance, and this results in more shrinkage for the LFC of this particular gene. Are those normalized counts? Also is the model ~ tx or ~ subj + tx?

            Comment

            • lmolokin
              Member
              • Jul 2012
              • 24

              #7
              These are just raw counts. Is there a way pull the normalized counts?

              Model includes subj to account for subj-to-subj variability.

              The high degree of shrinkage makes more sense to me after looking at these counts again.

              Comment

              • Michael Love
                Senior Member
                • Jul 2013
                • 333

                #8
                counts(dds, normalized=TRUE) will give normalized counts.

                But then I might recommend using betaPrior=FALSE here, justified by the reasoning that there are very large expected fold changes for individual genes, but not so many large fold changes that the width of the prior adjusts to allow such large fold changes. The shrinking of fold changes requires that the software can estimate the range of reasonable values for LFC by looking at the distribution of LFCs (particularly the upper quantile of the absolute LFC). But there might be precise fold changes which are above this upper quantile, and so the prior is too narrow for the targeted genes. The prior might then be a bad assumption for this dataset, so I think it's reasonable to turn it off here. The inference works just as well without the prior included.

                Comment

                • lmolokin
                  Member
                  • Jul 2012
                  • 24

                  #9
                  Makes sense. Thanks Michael, always a huge help!

                  Comment

                  Latest Articles

                  Collapse

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 10:09 AM
                  0 responses
                  10 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  20 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  27 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  22 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...