SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq log2 fold change error agne Bioinformatics 16 07-23-2014 06:15 AM
RPKM vs Fold Change ppm Bioinformatics 4 05-06-2014 08:49 AM
Cuffdiff output shows log2(fold change) as infinite nr23 Bioinformatics 3 06-07-2013 10:01 AM
How DegSeq calculated log2(Fold change) when one of the read counts is zero... mhjung General 0 03-01-2013 11:22 AM
DEXSeq log2 fold changes yfang01 Bioinformatics 1 02-22-2013 12:36 AM

Reply
 
Thread Tools
Old 12-19-2014, 09:35 AM   #1
aggp11
Member
 
Location: Wisconsin

Join Date: Jun 2011
Posts: 87
Default DESeq2 log2 fold change discrepancy

Hi,

I am working with an RNA-Seq dataset and used DESeq2 to perform differential gene expression analysis using a multi-factor design.

In my results I noticed that the log2FC values were different compared to if I would just take the log ratio of the condition means (as suggested in the following post):

http://grokbase.com/t/r/bioconductor...th-zero-counts

I checked my design matrices and all the data and I just can't understand why this is different.

Below is the code I used:

library(DESeq2)
coldata <- read.table("Run1_run2_deseq2_design.txt", sep="\t", header=T, row.names=1)
coldata <- DataFrame(coldata)
countdata <- read.table("Run1_run2_raw_counts.counts", sep="\t", header=T, row.names=1)
countdata.run1.run2 <- cbind(countdata[,c(2:4,6:8,10:12,14:16)])
colnames(countdata.run1.run2) <- rownames(coldata)
ddsFromCounts <- DESeqDataSetFromMatrix(countData=countdata.run1.run2, colData=coldata, design=~Type+Condition+Type:Condition)
ddsFromCounts$Condition <- factor(ddsFromCounts$Condition, levels=c("U","E"))
dds.run1.run2 <- DESeq(ddsFromCounts)



> colData(dds.run1.run2)
DataFrame with 12 rows and 3 columns
Type Condition sizeFactor
<factor> <factor> <numeric>
U_U_1 U U 0.8218577
M_U_1 M U 1.1845552
M_E_1 M E 1.2826040
U_E_1 U E 1.1260139
E_U_1 E U 0.9145150
... ... ... ...
M_U_2 M U 1.4016941
M_E_2 M E 0.7907137
U_E_2 U E 1.4520507
E_U_2 E U 0.4845640
E_E_2 E E 0.9343742


These are the design groups that I can use (resultsNames):

"TypeE" "TypeM" "TypeU" "ConditionU" "ConditionE" "TypeE.ConditionU" "TypeM.ConditionU" "TypeU.ConditionU" "TypeE.ConditionE" "TypeM.ConditionE" "TypeU.ConditionE"


Here are the first six rows of my results for a particular contrast:

> head(results(dds.run1.run2, contrast=list("TypeM.ConditionU","TypeE.ConditionU")))

log2 fold change (MAP): TypeM.ConditionU vs TypeE.ConditionU
Wald test p-value: TypeM.ConditionU vs TypeE.ConditionU
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##### 6.83635425 -0.29831487 0.25660567 -1.1625420 0.245015380 NA
##### 0.77856172 0.02932029 0.12641235 0.2319417 0.816583317 NA
##### 0.10539002 -0.03229292 0.06085041 -0.5306935 0.595631173 NA
##### 2.52917312 0.08576176 0.19506918 0.4396479 0.660192139 NA
##### 320.61150606 0.30625330 0.11398239 2.6868476 0.007212986 0.1360898
##### 0.07568229 0.01855059 0.05945254 0.3120235 0.755022638 NA


Following is some diagnostic code to compare the "results" fold change with the log of condition means for the 5th feature from the results table (rowMeans = 320):

> nc.test <- round(counts(dds.run1.run2, normalized=TRUE)[5,])
> data.frame(type=colData(dds.run1.run2)$Type, condition=colData(dds.run1.run2)$Condition, normcounts=nc.test)
type condition normcounts
U_U_1 U U 301
M_U_1 M U 365
M_E_1 M E 297
U_E_1 U E 346
E_U_1 E U 272
E_E_1 E E 361
U_U_2 U U 337
M_U_2 M U 327
M_E_2 M E 287
U_E_2 U E 345
E_U_2 E U 262
E_E_2 E E 347

> log2(mean(nc.test[colData(dds.run1.run2)$Type == "M" & colData(dds.run1.run2)$Condition == "U"])/mean(nc.test[colData(dds.run1.run2)$Type == "E" & colData(dds.run1.run2)$Condition == "U"]))

[1] 0.3739323


As you'll notice that the log2FC value from results is 0.30625330 whereas it is 0.3739323 from the log of condition mean ratio. I am just not sure what might be affecting this.

I will appreciate any help on this.

Thank you!

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.6.2 RcppArmadillo_0.4.550.1.0 Rcpp_0.11.3 GenomicRanges_1.18.3 GenomeInfoDb_1.2.3 IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 Biobase_2.26.0 BiocParallel_1.0.0 brew_1.0-6 checkmate_1.5.1 cluster_1.15.3
[12] codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1 digest_0.6.6 fail_1.2 foreach_1.4.2 foreign_0.8-61 Formula_1.1-2 genefilter_1.48.1 geneplotter_1.44.0 ggplot2_1.0.0
[23] grid_3.1.2 gtable_0.1.2 Hmisc_3.14-6 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 MASS_7.3-35 munsell_0.4.2 nnet_7.3-8 plyr_1.8.1
[34] proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1 rpart_4.1-8 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2 survival_2.37-7 tools_3.1.2
[45] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0

Last edited by aggp11; 12-19-2014 at 09:39 AM.
aggp11 is offline   Reply With Quote
Old 12-19-2014, 10:24 AM   #2
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Now is a great chance to read the just-publised manuscript ;-)

http://genomebiology.com/2014/15/12/550/abstract (see Fig 2)

or see Fig 1 in the vignette:

vignette("DESeq2")

Relevant is the betaPrior argument of ?DESeq and the addMLE argument of ?results
Michael Love is offline   Reply With Quote
Old 02-19-2015, 12:52 PM   #3
lmolokin
Member
 
Location: Beltsville, MD, USA

Join Date: Jul 2012
Posts: 24
Default high count outlier

Michael,
I have a question related to LFC shrinkage but involving a high count gene instead (the gene with the highest mean in my dataset to be exact). From what I can tell, DESeq2 has deemed this gene a dispersion outlier based on gene's data from the DESeqDataSet object:

baseMean 8.65E+05
baseVar 2.16E+12
allZero FALSE
dispGeneEst 2.888
dispFit 0.030
dispersion 2.888
dispIter 7
dispOutlier TRUE
dispMAP 8.31E-02

Granted its dispersion is high, this gene is the target of my experiment's treatment and is expected to be heavily down regulated in the treated group vs control and according to edgeR by more than a hundred fold. However, the results of the DESeq2 Wald test on this gene yield:

baseMean 864665.543
log2FoldChange -0.127
lfcSE 0.074
stat -1.704
pvalue 0.088
padj 0.533

I realize that outlier LFC's are shruken based on several factors but is it possible to have a massive fold change be shruken to such a low and insignificant value? Is there an alternate approach here that would reduce the severity of the shrinking?

Thanks!
lmolokin is offline   Reply With Quote
Old 02-19-2015, 01:03 PM   #4
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

You can turn off the LFC shrinkage, if you want to go with the MLE LFC estimate by setting betaPrior=FALSE in DESeq(). You will then get an MLE estimate of fold change which is simple (average in group2)/(average in group1). The average can be thrown off when you have high within-group variance, as reflected by the very large dispersion estimate for this gene. What are the counts like for this gene?

Another way to understand the variance of the MLE LFC estimate is that if you removed the sample with the largest count, how much would the estimate change?

Code:
k <- counts(dds, normalized=TRUE)[ gene.idx, ]
idx <- which.max(k)
cond <- dds$condition[-idx]
k <- k[-idx]
log2(mean(k[cond == "B"])/mean(k[cond == "A"]))
Michael Love is offline   Reply With Quote
Old 02-19-2015, 01:28 PM   #5
lmolokin
Member
 
Location: Beltsville, MD, USA

Join Date: Jul 2012
Posts: 24
Default

Here are the individual sample counts for this gene:

tx
subj
counts
control1
1,713,852
control2
494,568
control3
5,675,537
control4
419,712
treated1
3,016
treated2
16,474
treated3
56,382
treated4
6,851
lmolokin is offline   Reply With Quote
Old 02-19-2015, 01:44 PM   #6
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Note that the range within each group is more than x10. So while there is a consistent difference (treatment always less than ~1/8 of the control), there is very large within-group variance, and this results in more shrinkage for the LFC of this particular gene. Are those normalized counts? Also is the model ~ tx or ~ subj + tx?
Michael Love is offline   Reply With Quote
Old 02-19-2015, 01:50 PM   #7
lmolokin
Member
 
Location: Beltsville, MD, USA

Join Date: Jul 2012
Posts: 24
Default

These are just raw counts. Is there a way pull the normalized counts?

Model includes subj to account for subj-to-subj variability.

The high degree of shrinkage makes more sense to me after looking at these counts again.
lmolokin is offline   Reply With Quote
Old 02-19-2015, 02:09 PM   #8
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

counts(dds, normalized=TRUE) will give normalized counts.

But then I might recommend using betaPrior=FALSE here, justified by the reasoning that there are very large expected fold changes for individual genes, but not so many large fold changes that the width of the prior adjusts to allow such large fold changes. The shrinking of fold changes requires that the software can estimate the range of reasonable values for LFC by looking at the distribution of LFCs (particularly the upper quantile of the absolute LFC). But there might be precise fold changes which are above this upper quantile, and so the prior is too narrow for the targeted genes. The prior might then be a bad assumption for this dataset, so I think it's reasonable to turn it off here. The inference works just as well without the prior included.
Michael Love is offline   Reply With Quote
Old 02-20-2015, 09:32 AM   #9
lmolokin
Member
 
Location: Beltsville, MD, USA

Join Date: Jul 2012
Posts: 24
Default

Makes sense. Thanks Michael, always a huge help!
lmolokin is offline   Reply With Quote
Reply

Tags
deseq2, differential expression, rna sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO