Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 log2 fold change discrepancy

    Hi,

    I am working with an RNA-Seq dataset and used DESeq2 to perform differential gene expression analysis using a multi-factor design.

    In my results I noticed that the log2FC values were different compared to if I would just take the log ratio of the condition means (as suggested in the following post):



    I checked my design matrices and all the data and I just can't understand why this is different.

    Below is the code I used:

    library(DESeq2)
    coldata <- read.table("Run1_run2_deseq2_design.txt", sep="\t", header=T, row.names=1)
    coldata <- DataFrame(coldata)
    countdata <- read.table("Run1_run2_raw_counts.counts", sep="\t", header=T, row.names=1)
    countdata.run1.run2 <- cbind(countdata[,c(2:4,6:8,10:12,14:16)])
    colnames(countdata.run1.run2) <- rownames(coldata)
    ddsFromCounts <- DESeqDataSetFromMatrix(countData=countdata.run1.run2, colData=coldata, design=~Type+Condition+Type:Condition)
    ddsFromCounts$Condition <- factor(ddsFromCounts$Condition, levels=c("U","E"))
    dds.run1.run2 <- DESeq(ddsFromCounts)



    > colData(dds.run1.run2)
    DataFrame with 12 rows and 3 columns
    Type Condition sizeFactor
    <factor> <factor> <numeric>
    U_U_1 U U 0.8218577
    M_U_1 M U 1.1845552
    M_E_1 M E 1.2826040
    U_E_1 U E 1.1260139
    E_U_1 E U 0.9145150
    ... ... ... ...
    M_U_2 M U 1.4016941
    M_E_2 M E 0.7907137
    U_E_2 U E 1.4520507
    E_U_2 E U 0.4845640
    E_E_2 E E 0.9343742


    These are the design groups that I can use (resultsNames):

    "TypeE" "TypeM" "TypeU" "ConditionU" "ConditionE" "TypeE.ConditionU" "TypeM.ConditionU" "TypeU.ConditionU" "TypeE.ConditionE" "TypeM.ConditionE" "TypeU.ConditionE"


    Here are the first six rows of my results for a particular contrast:

    > head(results(dds.run1.run2, contrast=list("TypeM.ConditionU","TypeE.ConditionU")))

    log2 fold change (MAP): TypeM.ConditionU vs TypeE.ConditionU
    Wald test p-value: TypeM.ConditionU vs TypeE.ConditionU
    DataFrame with 6 rows and 6 columns
    baseMean log2FoldChange lfcSE stat pvalue padj
    <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
    ##### 6.83635425 -0.29831487 0.25660567 -1.1625420 0.245015380 NA
    ##### 0.77856172 0.02932029 0.12641235 0.2319417 0.816583317 NA
    ##### 0.10539002 -0.03229292 0.06085041 -0.5306935 0.595631173 NA
    ##### 2.52917312 0.08576176 0.19506918 0.4396479 0.660192139 NA
    ##### 320.61150606 0.30625330 0.11398239 2.6868476 0.007212986 0.1360898
    ##### 0.07568229 0.01855059 0.05945254 0.3120235 0.755022638 NA


    Following is some diagnostic code to compare the "results" fold change with the log of condition means for the 5th feature from the results table (rowMeans = 320):

    > nc.test <- round(counts(dds.run1.run2, normalized=TRUE)[5,])
    > data.frame(type=colData(dds.run1.run2)$Type, condition=colData(dds.run1.run2)$Condition, normcounts=nc.test)
    type condition normcounts
    U_U_1 U U 301
    M_U_1 M U 365
    M_E_1 M E 297
    U_E_1 U E 346
    E_U_1 E U 272
    E_E_1 E E 361
    U_U_2 U U 337
    M_U_2 M U 327
    M_E_2 M E 287
    U_E_2 U E 345
    E_U_2 E U 262
    E_E_2 E E 347

    > log2(mean(nc.test[colData(dds.run1.run2)$Type == "M" & colData(dds.run1.run2)$Condition == "U"])/mean(nc.test[colData(dds.run1.run2)$Type == "E" & colData(dds.run1.run2)$Condition == "U"]))

    [1] 0.3739323


    As you'll notice that the log2FC value from results is 0.30625330 whereas it is 0.3739323 from the log of condition mean ratio. I am just not sure what might be affecting this.

    I will appreciate any help on this.

    Thank you!

    > sessionInfo()
    R version 3.1.2 (2014-10-31)
    Platform: i386-w64-mingw32/i386 (32-bit)

    locale:
    [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252

    attached base packages:
    [1] parallel stats4 stats graphics grDevices utils datasets methods base

    other attached packages:
    [1] DESeq2_1.6.2 RcppArmadillo_0.4.550.1.0 Rcpp_0.11.3 GenomicRanges_1.18.3 GenomeInfoDb_1.2.3 IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1

    loaded via a namespace (and not attached):
    [1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 Biobase_2.26.0 BiocParallel_1.0.0 brew_1.0-6 checkmate_1.5.1 cluster_1.15.3
    [12] codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1 digest_0.6.6 fail_1.2 foreach_1.4.2 foreign_0.8-61 Formula_1.1-2 genefilter_1.48.1 geneplotter_1.44.0 ggplot2_1.0.0
    [23] grid_3.1.2 gtable_0.1.2 Hmisc_3.14-6 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 MASS_7.3-35 munsell_0.4.2 nnet_7.3-8 plyr_1.8.1
    [34] proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1 rpart_4.1-8 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2 survival_2.37-7 tools_3.1.2
    [45] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0
    Last edited by aggp11; 12-19-2014, 09:39 AM.

  • #2
    Now is a great chance to read the just-publised manuscript ;-)

    http://genomebiology.com/2014/15/12/550/abstract (see Fig 2)

    or see Fig 1 in the vignette:

    vignette("DESeq2")

    Relevant is the betaPrior argument of ?DESeq and the addMLE argument of ?results

    Comment


    • #3
      high count outlier

      Michael,
      I have a question related to LFC shrinkage but involving a high count gene instead (the gene with the highest mean in my dataset to be exact). From what I can tell, DESeq2 has deemed this gene a dispersion outlier based on gene's data from the DESeqDataSet object:

      baseMean 8.65E+05
      baseVar 2.16E+12
      allZero FALSE
      dispGeneEst 2.888
      dispFit 0.030
      dispersion 2.888
      dispIter 7
      dispOutlier TRUE
      dispMAP 8.31E-02

      Granted its dispersion is high, this gene is the target of my experiment's treatment and is expected to be heavily down regulated in the treated group vs control and according to edgeR by more than a hundred fold. However, the results of the DESeq2 Wald test on this gene yield:

      baseMean 864665.543
      log2FoldChange -0.127
      lfcSE 0.074
      stat -1.704
      pvalue 0.088
      padj 0.533

      I realize that outlier LFC's are shruken based on several factors but is it possible to have a massive fold change be shruken to such a low and insignificant value? Is there an alternate approach here that would reduce the severity of the shrinking?

      Thanks!

      Comment


      • #4
        You can turn off the LFC shrinkage, if you want to go with the MLE LFC estimate by setting betaPrior=FALSE in DESeq(). You will then get an MLE estimate of fold change which is simple (average in group2)/(average in group1). The average can be thrown off when you have high within-group variance, as reflected by the very large dispersion estimate for this gene. What are the counts like for this gene?

        Another way to understand the variance of the MLE LFC estimate is that if you removed the sample with the largest count, how much would the estimate change?

        Code:
        k <- counts(dds, normalized=TRUE)[ gene.idx, ]
        idx <- which.max(k)
        cond <- dds$condition[-idx]
        k <- k[-idx]
        log2(mean(k[cond == "B"])/mean(k[cond == "A"]))

        Comment


        • #5
          Here are the individual sample counts for this gene:

          No code has to be inserted here.

          Comment


          • #6
            Note that the range within each group is more than x10. So while there is a consistent difference (treatment always less than ~1/8 of the control), there is very large within-group variance, and this results in more shrinkage for the LFC of this particular gene. Are those normalized counts? Also is the model ~ tx or ~ subj + tx?

            Comment


            • #7
              These are just raw counts. Is there a way pull the normalized counts?

              Model includes subj to account for subj-to-subj variability.

              The high degree of shrinkage makes more sense to me after looking at these counts again.

              Comment


              • #8
                counts(dds, normalized=TRUE) will give normalized counts.

                But then I might recommend using betaPrior=FALSE here, justified by the reasoning that there are very large expected fold changes for individual genes, but not so many large fold changes that the width of the prior adjusts to allow such large fold changes. The shrinking of fold changes requires that the software can estimate the range of reasonable values for LFC by looking at the distribution of LFCs (particularly the upper quantile of the absolute LFC). But there might be precise fold changes which are above this upper quantile, and so the prior is too narrow for the targeted genes. The prior might then be a bad assumption for this dataset, so I think it's reasonable to turn it off here. The inference works just as well without the prior included.

                Comment


                • #9
                  Makes sense. Thanks Michael, always a huge help!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  15 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  55 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  70 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X