PDA

View Full Version : DESeq2


Simon Anders
03-13-2013, 09:19 AM
Hi,

An announcement of interest to users of DESeq:

Mike Love, Wolfgang Huber and I have been updating the DESeq package. This resulted in the package DESeq2, which is already now available (http://www.bioconductor.org/packages/devel/bioc/html/DESeq2.html) from the Bioconductor development branch, and scheduled to be included in the next Bioconductor release.

For several release cycles, the original package (DESeq) will be maintained at its current functionality, in order to not disrupt the workflows of DESeq users. For new projects, we recommend using DESeq2. Major innovations are:

* Base class: SummarizedExperiment (from the GenomicRanges package) is used as the superclass for storing the data, rather than eSet. This allows closer integration with upstream workflows involving GenomicRanges features, such as summarizeOverlaps, and facilitates downstream analyses of the genomic regions of interest.

* Simplified workflow: the wrapper function DESeq() performs all steps for a differential expression analysis. The individual steps are of course also accessible.

* More powerful statistics: incorporation of prior distributions into the estimation of dispersions and fold changes (empirical-Bayes shrinkage). The dispersion shrinkage improves power compared to the old DESeq. The fold changes shrinkage help moderate the otherwise large spread in log fold changes for genes with low counts, while it has negligible effect on genes with high counts; it may be particularly useful for visualisation, clustering, classification, ordination (PCA, MDS), similar to the variance-stabilizing transformation in the old DESeq. A Wald test for significance is provided as the default inference method, with the chi-squared test of the previous version is also available. A manuscript is in preparation.

* Normalization: it is possible to provide a matrix of sample- and gene-specific normalization factors, which allows the use of normalisation factors from Bioconductor packages such as cqn and EDASeq.

Examples of usage are provided in the vignette, and more details are available in the manual pages (specifically, the DESeq function and estimateDispersions function).

Enjoy -

Mike, Simon, Wolfgang.

chadn737
03-13-2013, 10:33 AM
Exciting news, thanks Simon. You guys have created some of the best tools out there and I am excited to see what this offers.

PS. I notice that the vignette is as well written as your last and puts the details on a level that people like me can easily grasp. Thanks.

turnersd
03-13-2013, 10:38 AM
Great news, thanks Simon, Mike, Wolfgang. Looking forward to going through the vignette. Are you publishing any comparisons with the original DESeq and/or other tools?

EGrassi
03-27-2013, 01:09 AM
Thanks! Looking forward to the new options.

Is it just me or the bioconductor linked vignette is a 16Mb pdf with 10701 pages and a lot of repetitions starting from page 6?

Simon Anders
03-27-2013, 01:15 AM
Is it just me or the bioconductor linked vignette is a 16Mb pdf with 10701 pages and a lot of repetitions starting from page 6?

Yes, that's a bug that Mike has already fixed. The corrected vignette should become available today or tomorrow.

FLYINGDOLPHIN
05-29-2013, 06:27 PM
Hi,

I'd love to install deseq2. But I have a problem in installing in.


> source("http://bioconductor.org/biocLite.R")
BioC_mirror = http://www.bioconductor.org
Change using chooseBioCmirror().
> biocLite("DESeq2")
Using R version 2.11.1, biocinstall version 2.6.10.
Installing Bioconductor version 2.6 packages:
[1] "DESeq2"
Please wait...

Warning message:
In getDependencies(pkgs, dependencies, available, lib) :
package ‘DESeq2’ is not available

I have no problem in installing deseq though.

Thanks a lot!

Q

chadn737
05-29-2013, 06:29 PM
I would first try making sure R is up to date.

FLYINGDOLPHIN
05-29-2013, 06:35 PM
oh, it is a R package from the core installation.
R version 2.11.1, biocinstall version 2.6.10.

Thanks.
Q

Simon Anders
05-29-2013, 10:51 PM
If you have an old R, the BiocLite script will pull matching old Bioconductor packages. DESeq2 is new, and while you could install it manually on an old R, the recommended way is to use R 3.0.0, and then biocLite will find and install it automatically.

FLYINGDOLPHIN
05-30-2013, 03:29 AM
Thanks!
Q

Heisman
05-30-2013, 06:39 AM
As said above, you guys do great things. Question; do you believe that this package would be good for determining differentially methylated regions from enrichment data (ie, MBD-seq or MeDIP-seq)? I know there are some papers using DESeq for this purpose, so I imagine so, but I was curious if you have any specific thoughts or caveats in mind.

FLYINGDOLPHIN
05-30-2013, 09:57 AM
Hi,

Does deseq development version better than the release version? The manual that i found on Bioconductor seems to aim for the development version since the "normalize=True" does not work. I am just wondering whether it is worthwhile to switch to a development version and where to find it?

Thanks.
Q

trpc
07-25-2013, 10:19 AM
Hi
This is my DESeqDataSet object

> Genes
class: DESeqDataSet
dim: 21937 4
exptData(0):
assays(1): counts
rownames(21937): 0610007C21Rik 0610007L01Rik ... Zzef1 Zzz3
rowData metadata column names(0):
colnames(4): 13-5.bam 13-5-5.bam E15-2.sorted.bam E15-5.sorted.bam
colData names(2): fileName E13E15

Since I want to test only "13-5.bam" and "13-5-5.bam", so I set colData(Genes)$E13E15 like this:

>colData(Genes)$E13E15 <- factor(colData(Genes)$E13E15,levels=c("E13WT","E13KO"))
> colData(Genes)
DataFrame with 4 rows and 2 columns
fileName E13E15
<BamFileList> <factor>
13-5.bam ######## E13WT
13-5-5.bam ######## E13KO
E15-2.sorted.bam ######## NA
E15-5.sorted.bam ######## NA

However, when I change the DESeqDataSet object into Deseq object, it reports error:

> Genes<-DESeq(Genes)
estimating size factors
estimating dispersions
same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting generalized linear model

error: element-wise multiplication: incompatible matrix dimensions: 4x1 and 2x1

Error:

zhang51
08-13-2013, 07:24 AM
Dear Simon and everyone,

I have downloaded DESeq2 package with no trouble. But when I tried to run your example R code, I got trouble even with the first line "dds <- DESeqDataSet(se = se, design = ~ condition)". The error message is "error in evaluating the argument 'x' in selecting a method for function 'assays': Error: object 'se' not found".

I have run "library(DESeq2)" at the very beginning. I wonder what else I should do to make it work?

I used R-3.0.1 for the installation.

Thanks in advance.

Jenny

Simon Anders
08-13-2013, 07:27 AM
Which example code are your running, i.e., what documentation are you reading?

dpryan
08-13-2013, 07:33 AM
Dear Simon and everyone,

I have downloaded DESeq2 package with no trouble. But when I tried to run your example R code, I got trouble even with the first line "dds <- DESeqDataSet(se = se, design = ~ condition)". The error message is "error in evaluating the argument 'x' in selecting a method for function 'assays': Error: object 'se' not found".

I have run "library(DESeq2)" at the very beginning. I wonder what else I should do to make it work?


Assuming you're following the vignette, you missed a few lines, which would have setup "se".

library("parathyroidSE")
data("parathyroidGenesSE")
se <- parathyroidGenesSE
colnames(se) <- colData(se)$run

Edit: Just go to the next page of the vignette, I expect what you're looking at is the "Quick Start" section.

zhang51
08-13-2013, 10:11 AM
To Simon,

I'm running code in the file http://www.bioconductor.org/packages/2.13/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf

I encountered the same problem when running meanSdPlot function. It works only when normalized=FALSE. There is an error message when using normalized=TRUE. Please see below:

In is.na(sizeFactors(object)) :
is.na() applied to non-(list or vector) of type 'NULL'
Error in meanSdPlot(log2(counts(dds, normalized = TRUE)[notAllZero, ] + :
error in evaluating the argument 'x' in selecting a method for function 'meanSdPlot': Error in .local(object, ...) :
first calculate size factors, add normalizationFactors, or set normalized=FALSE

What would be the solution to this problem?

Thanks

zhang51
08-13-2013, 10:20 AM
I figured it out. I did not run DESeq function first before making this plot. Now it works.

Thanks

haggardd
08-22-2013, 11:38 AM
Hello,

DESeq was great in that the nbinomTest() function allowed for me to specify which conditions I wanted to run the test on. However, DESeq2 appears to not have this functionality. I have data that was run through HTSeq and so am using the DESeqDataSetFromHTSeqCount() function. I am analyzing data that underwent two different exposure concentrations at different percentages of vehicle (i.e. 1% vehicle control, 10uM chemical in 1% vehicle, 0.1% vehicle control, and 1uM chemical in 0.1% vehicle). There are four total conditions for this experiment and I want to compare the expression between pairs with the same percentage of vehicle (i.e. 1% vehicle control versus 10uM chemical in 1% vehicle). However, when I run DESeq() the resultsNames() I get to choose from are only from four comparisons and not the pairwise comparisons I need. Is this possible in DESeq2 as it was in the original DESeq to specify which conditions I want to test? Or would it just be easier to initially have the comparisons I want to analyze already separated instead of how I am doing it currently? Thoughts?

Thanks!

Michael Love
08-22-2013, 10:25 PM
For making comparisons of multiple conditions (not only against the base level of a condition), we have recently implemented contrasts in the development branch. This allows one to fit a single model, then generate log2 fold change estimates, standard errors and tests of null hypotheses for other comparisons.

The functionality is described in section 3.2 of the vignette, 'Contrasts' and in the man page for ?results, for DESeq2 version 1.1.x which is paired with Bioc 2.13.

http://bioconductor.org/packages/2.13/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf

You can either try using the devel branch of Bioconductor...

http://bioconductor.org/developers/how-to/useDevel/

or this version will be released in Oct 15, 2013:

http://bioconductor.org/developers/release-schedule/

NateP
08-27-2013, 07:40 AM
Glad to see contrasts appear in DESeq2, thanks!

One minor thing I noticed when running contrasts:
When you use the Cooks Outlier filtering on the base analysis, the outlier genes (with "NA" pvalues in the regular analysis) are no longer filtered when you do a contrast. I'm not sure if that is intentional or not?

Michael Love
08-27-2013, 08:18 AM
You're right. Thanks for pointing this out.

I will work on a fix, probably moving the cooksCutoff argument to results() function, so it can be modified without rerunning the GLM fitting.

Michael Love
08-29-2013, 06:18 AM
I committed this fix in version 1.1.31 of the devel branch. It made sense to move the outlier detection by Cook's distance as well as all p-value adjustment to the results() function.

KHubbard
10-16-2013, 08:36 AM
Hello,

In DESeq, it was possible to view/export the baseMean values for each condition (baseMeanA and baseMeanB) in addition to the overall baseMean. In DESeq2, I have only been able to view/export the overall baseMean. Is it possible to acquire the baseMean values for each condition as well? Thanks!

Michael Love
10-16-2013, 08:53 AM
We generalized the code, and as these columns are only appropriate for a 2 condition analysis, they were dropped.

You can get these with some custom code though:

baseMeanA <- rowMeans(counts(dds,normalized=TRUE)[,colData(dds)$condition == "A"])

More general:

baseMeanPerLvl <- sapply(levels(colData(dds)$condition), function(lvl) rowMeans(counts(dds,normalized=TRUE)[,colData(dds)$condition == lvl]))

alisrpp
10-16-2013, 09:57 AM
Hi,

I'm new in DESeq2 and i'm having problems to create my dds.

I have a matrix with eXpress results (read counts) with 2 conditions (SP, EB) and 2 replicates/condition.

This is what i'm doing:

Cele_SPvsEB = read.csv (file.choose(), header=TRUE, row.names=1)
CeleDesign <- data.frame(
row.names = colnames(Cele_SPvsEB),
condition = factor(c("SP", "SP", "EB", "EB")))
dds <- DESeqDataSetFromMatrix(countData = Cele_SPvsEB,
colData = CeleDesign,
design = ~ condition)
dds

And i'm getting this error message:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object '.setDummyField' not found

Any help?
Thanks.

Michael Love
10-16-2013, 10:04 AM
Can you show the output of traceback()

and also can you show the output of head(Cele_SPvsEB)

alisrpp
10-16-2013, 10:13 AM
Sure!

> traceback()
14: get(name, envir = asNamespace(pkg), inherits = FALSE)
13: methods:::.setDummyField
12: (function (value)
{
if (missing(value))
`.->data`
else {
methods:::.setDummyField(.self, ".->data", "SimpleList",
"data", FALSE, value)
value
}
})(quote(<S4 object of class "SimpleList">))
11: assign(field, value, envir = env)
10: envRefSetField(.Object, field, classDef, selfEnv, elements[[field]])
9: methods::initRefFields(.Object, classDef, selfEnv, list(...))
8: initialize(value, ...)
7: initialize(value, ...)
6: methods::new(def, ...)
5: .ShallowSimpleListAssays$new(data = assays)
4: .local(assays, ...)
3: SummarizedExperiment(assays = SimpleList(counts = countData),
colData = colData, ...)
2: SummarizedExperiment(assays = SimpleList(counts = countData),
colData = colData, ...)
1: DESeqDataSetFromMatrix(countData = Cele_SPvsEB, colData = CeleDesign,
design = ~condition)

> head(Cele_SPvsEB)
C4_APR10_spermatic_cysts CRL_2APR10_spermatic_cysts CRL_1_15JUL11_embryos
comp100_c0_seq1 21 24 0
comp100004_c0_seq1 48 0 494
comp1000074_c0_seq1 42 0 55
comp1000109_c0_seq1 32 18 0
comp100011_c0_seq1 40 3 298
comp100013_c0_seq1 92 0 62
CRL_2_15JUL11_embryos
comp100_c0_seq1 0
comp100004_c0_seq1 0
comp1000074_c0_seq1 0
comp1000109_c0_seq1 0
comp100011_c0_seq1 0
comp100013_c0_seq1 0

Michael Love
10-16-2013, 10:29 AM
Googling the setDummyField error I came to this thread

http://permalink.gmane.org/gmane.science.biology.informatics.conductor/48611

Where they suggest there might be a problem with using R 3.0.0 with packages built with R 3.0.1. Could this be the case? Can you paste your sessionInfo()

alisrpp
10-16-2013, 01:04 PM
Done! I updated R and now is working perfecly.
Thanks Michael!

alisrpp
10-16-2013, 01:35 PM
Now i'm getting this error:

Warning messages:
1: glm.fit: algorithm did not converge
2: In estimateDispersionsFit(object, fitType = fitType, quiet = quiet) :
parametric fit failed, trying local fit. use plotDispEsts() to check the quality of the local fit

This is my code:

#Count matrix input
Cele_SPvsEB = read.csv (file.choose(), header=TRUE, row.names=1)
CeleDesign <- data.frame(
row.names = colnames(Cele_SPvsEB$target_id),
condition = factor(c("SP", "SP", "EB", "EB")))
dds <- DESeqDataSetFromMatrix(countData = Cele_SPvsEB,
colData = CeleDesign,
design = ~ condition)
dds

I obtained this:

class: DESeqDataSet
dim: 198821 4
exptData(0):
assays(1): counts
rownames(198821): comp100_c0_seq1 comp100004_c0_seq1 ... comp99994_c0_seq1 comp99999_c0_seq1
rowData metadata column names(0):
colnames(4): 1 2 3 4
colData names(1): condition

And after running:

#Differential expression analysis
dds <- DESeq(dds)

(...) the error warning is appearing.

Any idea how to fix that?

I also want to change the colnames, how can i do that?

Thanks.

Michael Love
10-17-2013, 05:11 AM
You don't need to fix anything.

There are 3 levels of messages to the user in R: message, warning and stop/error. A warning is a non-fatal problem. In this case, I am warning the user that the default method for fitting the dispersion trend line didn't work so the software is doing something else instead.

The way to read the warning is that "algorithm did not converge" within the function estimateDispersionsFit. Then it says that the parametric fit failed, and that a local fit will be tried instead.

If you look up ?estimateDispersionsFit it will link over to ?estimateDispersions, here I describe the 3 options for the fitType argument, and the default is parametric. The local fit usually does well as a substitute for the parametric fit, but I also warn the user that they should manually examine the plot generated by plotDispEsts(dds). You should check this plot to make sure the local fit is capturing the trend in the black points, as it should be. If the local fit does not seem to capture the trend in the black points, you can use the last option fitType="mean" instead. Note that the fitType arguments can be provided to DESeq() as well as to estimateDispersions().

For changing columns names, check ?colnames.

alisrpp
10-17-2013, 04:25 PM
The local fit usually does well as a substitute for the parametric fit, but I also warn the user that they should manually examine the plot generated by plotDispEsts(dds). You should check this plot to make sure the local fit is capturing the trend in the black points, as it should be.

This is my dispersion plot, is not exactly like in the manual... so i don't know what to think.
Can you give me your opinion, please?

Thanks,

Michael Love
10-17-2013, 07:19 PM
As there is not a clear dependence of the dispersion estimates on the mean of normalized counts, I would recommend to use fitType="mean".

sindrle
10-19-2013, 06:06 AM
This is what I got with default setting (parametric), does it look ok or should I try mean or local fit?

One question:
What happens when I increase the "maxit" value? From 100 to 200, 159 was reduced to 110 rows that did not converge in beta. I dont know what this means.

Thanks!

2572

Michael Love
10-19-2013, 06:28 AM
This dispersion plot looks good/typical.

re:maxit: Unlike with linear regression, there is not a closed form solution for generalized linear models (the model used by DESeq2). Instead there is an iterative algorithm to find the optimal betas (the log fold changes). 'maxit' is the maximum number of iterations to run, and if the convergence criteria has not been met then the software prints a warnings and tells you how to find these genes for which this was the case.

However, I believe most datasets would have all genes converge if you used DESeq2 v1.2 which was just released.

This involves upgrading R to 3.0.2 and Bioconductor to v2.13. Information here:

http://www.bioconductor.org/install/

sindrle
10-19-2013, 05:45 PM
I upgraded all as you said, but still 159 not fitted at maxit 100.

Michael Love
10-20-2013, 01:49 PM
Then these might just be difficult rows for some reason.

You can examine them with:

head(counts(dds[!mcols(dds)$betaConv,]))

You can exclude them from the results table with:

res <- results(dds[mcols(dds)$betaConv,])

alisrpp
10-21-2013, 08:04 AM
As there is not a clear dependence of the dispersion estimates on the mean of normalized counts, I would recommend to use fitType="mean".

Hi,

By doing what you suggested:

dds<- estimateSizeFactors(dds)
ddsMean <- estimateDispersions(dds, fitType="mean")
ddsMean <- nbinomWaldTest(ddsMean)
plotDispEsts(ddsMean)

I'm obtaining the subsequent plot.
It's really weird, right?
Should i continue with fitType="mean"?

Thanks.

Michael Love
10-21-2013, 01:50 PM
There is a lot of shrinkage of dispersion estimates (the blue points are close to the red line) presumably because you have few degrees of freedom (number of samples minus number of parameters to estimate). Yes, I would use fitType="mean" for this dataset as there is not a clear trend.

alisrpp
10-21-2013, 08:06 PM
There is a lot of shrinkage of dispersion estimates (the blue points are close to the red line) presumably because you have few degrees of freedom (number of samples minus number of parameters to estimate). Yes, I would use fitType="mean" for this dataset as there is not a clear trend.

Hi Michael,

First of all, thank you for all the help.

I tried fitType="mean" with the code:

#Count matrix input
Cele_SPvsLR = read.csv (file.choose(), header=TRUE, row.names=1)
CeleDesign <- data.frame(
row.names = colnames(Cele_SPvsLR),
condition = factor(c("SP", "SP", "LR", "LR")))
dds <- DESeqDataSetFromMatrix(countData = Cele_SPvsLR,
colData = CeleDesign,
design = ~ condition)
dds


#Est size factor = normalize for library size
dds<- estimateSizeFactors(dds)
ddsMean <- estimateDispersions(dds, fitType="mean")
ddsMean <- nbinomWaldTest(ddsMean)
#For a parametric dispersion
dds <- DESeq(dds)
#plot dispersion
plotDispEsts(ddsMean)
plotDispEsts(dds)

#Differential expression analysis
resultsNames(ddsMean)
res <- results(ddsMean, name= "condition_SP_vs_LR")
res <- res[order(res$padj),]
head(res)
plotMA(ddsMean)
sum(res$padj < .1, na.rm=TRUE)

This resulted in:

> sum(res$padj < .1, na.rm=TRUE)
[1] 0

But when I tried fitType="local" I get:

> sum(res$padj < .1, na.rm=TRUE)
[1] 337

(*see attached plots)

The other "weird" (at least for me) thing is that by filtering by overall count to my fitType="local" results I get less genes:

> plot(res$baseMean, pmin(-log10(res$pvalue), 50),
+ log="x", xlab="mean of normalized counts",
+ ylab=expression(-log[10](pvalue)))
Warning message:
In xy.coords(x, y, xlabel, ylabel, log) :
19407 x values <= 0 omitted from logarithmic plot
> abline(v=10, col="red", lwd=1)
> use <- res$baseMean >= 10 & !is.na(res$pvalue)
> table(use)
use
FALSE TRUE
93445 105376
> resFilt <- res [use, ]
> resFilt$padj <- p.adjust(resFilt$pvalue, method="BH")
> sum(resFilt$padj < .1, na.rm=TRUE)
[1] 152

Just as a refresher, I'm working with 4 samples (2 replicates/reproductive stages).

Do you think that it is a bad idea to use fitType="local" given that it is the fitType that gives me results?

Can you help me understand why by filtering my "local" results I am obtaining less genes?

Thanks, thanks, and thanks.

Michael Love
10-22-2013, 04:46 AM
The local fit helps you out by dipping down at the high counts, giving you lower estimates of dispersion there than by taking the mean overall. The large circles are genes whose dispersion estimates are not shrunk because they deviate too far from the fitted line. The local fit is useable as well I would say.

I presume -- because you haven't posted the output of sessionInfo() -- that you could be using a version of DESeq2 which performs automatic independent filtering to optimize the number of rejections. Check the Details section of ?results and the argument 'independentFiltering'.

sindrle
10-22-2013, 07:37 AM
Hi!
Quick question; what is the default normalisation method in DESeq2?
Cant find anything in the manual.

Thanks!

Michael Love
10-22-2013, 07:45 AM
estimateSizeFactors() is the same function used for normalization as in DESeq, so the method is the same median ratio method as in DESeq. I can try to make this more clear in the man page for DESeq().

sindrle
10-22-2013, 07:59 AM
Ok, but typing in "estimateSizeFactors()" to look for options gives no info just (object, ...)

But what is the default? Im curious because Im running edgeR with "upper quartile".

Thanks!

dpryan
10-22-2013, 08:06 AM
If you're trying to do a comparison, the edgeR equivalent would be the "RLE" method in calcNormFactors(). estimateSizeFactors() doesn't take an option in this regard (see help(estimateSizeFactors) for what options it does have).

sindrle
10-22-2013, 08:08 AM
Ah!
Always a good answer from you. Thanks!

alisrpp
10-22-2013, 12:38 PM
The local fit helps you out by dipping down at the high counts, giving you lower estimates of dispersion there than by taking the mean overall. The large circles are genes whose dispersion estimates are not shrunk because they deviate too far from the fitted line. The local fit is useable as well I would say.

I presume -- because you haven't posted the output of sessionInfo() -- that you could be using a version of DESeq2 which performs automatic independent filtering to optimize the number of rejections. Check the Details section of ?results and the argument 'independentFiltering'.

Here is my sessionInfo()

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] gplots_2.11.3 MASS_7.3-29 KernSmooth_2.23-10 caTools_1.14
[5] gdata_2.13.2 gtools_3.1.0 RColorBrewer_1.0-5 DESeq2_1.2.0
[9] RcppArmadillo_0.3.920.1 Rcpp_0.10.5 GenomicRanges_1.14.1 XVector_0.2.0
[13] IRanges_1.20.0 BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 bitops_1.0-6
[5] DBI_0.2-7 genefilter_1.44.0 lattice_0.20-24 locfit_1.5-9.1
[9] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 survival_2.37-4
[13] tools_3.0.2 XML_3.95-0.2 xtable_1.7-1

I'm still having some questions about the independent filtering. I am understanding that I have to perform it right after the:

dds<- estimateSizeFactors(dds)
ddsLocal <- estimateDispersions(dds, fitType="local")
ddsLocal <- nbinomWaldTest(ddsLocal)
res <- results(ddsLocal, name= "condition_MR_vs_BR")
res <- res[order(res$padj),]

With the code:

use <- res$baseMean >= 10 & !is.na(res$pvalue)
resFilt <- res [use, ]
resFilt$padj <- p.adjust(resFilt$pvalue, method="BH")

And then continue with (for example):

#filter for upregulated and downregulated genes
resSig <- resFilt[ resFilt$padj < .1, ]
write.table(resSig[ order( resSig$log2FoldChange, -resSig$baseMean ), ]
,"~\\DESeq2\\BRvsMR_old_DownRegulated.txt")
write.table(resSig[ order( -resSig$log2FoldChange, -resSig$baseMean ),
],"~\\DESeq2\\BRvsMR_old_UpRegulated.txt")

But i also found this code:

filterThreshold <- 2.0
keep <- rowMeans ( counts(ddsLocal, normalized=TRUE)) > filterThreshold

And now i'm really confused. What is this for? (the last code) When i have to use it? Am I doing something wrong?

Thank you again for the help.

Michael Love
10-22-2013, 12:51 PM
You are using DESeq2 v1.2 and reading the documentation off the Bioc website for version v1.0.

This is why I recommended you to look up the man page for ?results.

And it's always safer to use:

browseVignettes("DESeq2")

to ensure you are reading the documentation for the version you are using.

alisrpp
10-22-2013, 02:33 PM
You are using DESeq2 v1.2 and reading the documentation off the Bioc website for version v1.0.

This is why I recommended you to look up the man page for ?results.

And it's always safer to use:

browseVignettes("DESeq2")

to ensure you are reading the documentation for the version you are using.

Ups! Thanks!

By using:

browseVignettes("DESeq2")

i am obtaining

/library/DESeq2/doc/DESeq2.pdf not found

Any idea about why this is happening?

i4u412
10-23-2013, 02:56 AM
Hi

I just started using DEseq2, so please correct if i am wrong.
My sample description has five time points (with 2 replicates).
(t0,t0,t1,t1,t2,t2,t3,t3,t4,t4,t5,t5)

if i set the timepoint design, i always get the differential expression through initial time point (control).

In addition to this, i also want to have differential expression with respect to the immediate time points (like t1 to t0, t2 to t1, t3 to t2, t4 to t3 and t5 to t4). Can you please tell me how to set the factor levels to get this type of comparison.

thank you

alisrpp
10-24-2013, 09:12 AM
Hi,

I'm using DESeq2_1.2.0 and I am having this error:

> resSig <- res[ res$padj < .1, ]
Error in normalizeSingleBracketSubscript(i, x, byrow = TRUE, exact = FALSE) :
subscript contains NAs

It's really weird because this error started today and I have been using the same datasets and the same Rscript for the last 3 weeks without this error coming up.

Just in case, here is my code:

#Count matrix input
Cele_SPvsLR_old = read.csv (file.choose(), header=TRUE, row.names=1)
CeleDesign <- data.frame(
row.names = colnames(Cele_SPvsLR_old),
condition = factor(c("SP", "SP", "LR", "LR")))
dds <- DESeqDataSetFromMatrix(countData = Cele_SPvsLR_old,
colData = CeleDesign,
design = ~ condition)
dds

#Est size factor = normalize for library size
dds<- estimateSizeFactors(dds)
ddsLocal <- estimateDispersions(dds, fitType="local")
ddsLocal <- nbinomWaldTest(ddsLocal)
plotDispEsts(ddsLocal)

#Differential expression analysis
resultsNames(ddsLocal)
res <- results(ddsLocal, name= "condition_SP_vs_LR")
res <- res[order(res$padj),]
head(res)
plotMA(ddsLocal, ylim=c(-2,2), main="DESeq2")
sum(res$padj < .1, na.rm=TRUE)

#filter for significant genes
resSig <- res[ res$padj < 0.1, ]

And my sessionInfo()

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.2.0 RcppArmadillo_0.3.920.1 Rcpp_0.10.5 GenomicRanges_1.14.1
[5] XVector_0.2.0 IRanges_1.20.0 BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 DBI_0.2-7
[5] genefilter_1.44.0 grid_3.0.2 lattice_0.20-24 locfit_1.5-9.1
[9] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2
[13] survival_2.37-4 tools_3.0.2 XML_3.95-0.2 xtable_1.7-1

Can anyone give me an explanation about why am I having this error now but not in the past using exactly the same Rscript and datasets? Does anyone know how to fix it?

Thanks!

Simon Anders
10-27-2013, 01:52 AM
Please use
resSig <- res[ which( res$padj < .1 ), ]
to get rid of the NAs.

sindrle
10-27-2013, 09:24 AM
Can I ask why you set it to .1 ?

fpesce
11-12-2013, 01:56 AM
Hi,
I am using DESeq2 version 1.2.5 and when I use this design:

> design(dds) <- formula(~ AGE + SEX + CONDITION)

Defining the variables:
> colData(dds)$CONDITION <- relevel(colData(dds)$CONDITION, "R")
> colData(dds)$SEX <- relevel(colData(dds)$SEX, "M")
Whereas AGE is a continuous variable

I get this error:
> dds <- DESeq(dds)
using pre-existing size factors
estimating dispersions
you had estimated dispersions, replacing these
gene-wise dispersion estimates
error: inv(): matrix appears to be singular

On the other hand, there are no problems with these combinations:
> design(dds) <- formula(~ AGE)
> design(dds) <- formula(~ CONDITION)
> design(dds) <- formula(~ SEX + CONDITION)

While I get the same error with:
> design(dds) <- formula(~ AGE + CONDITION)

Any clue what's going on here?
Thank you so much in advance for your help

dpryan
11-12-2013, 02:53 AM
Could "AGE" have been converted to a factor at some point?

fpesce
11-12-2013, 03:10 AM
Could "AGE" have been converted to a factor at some point?

No it has not. I checked:

> is(colData(dds)$AGE)
[1] "numeric" "vector" "atomic" "vectorORfactor"

Thanks for the suggestion

Michael Love
11-13-2013, 11:13 AM
Can you please post colData(dds) (or some scrubbed version of it)? This helps me figure out what the design matrix looks like, and therefore why some designs are causing errors.

fpesce
11-14-2013, 01:20 AM
Can you please post colData(dds) (or some scrubbed version of it)? This helps me figure out what the design matrix looks like, and therefore why some designs are causing errors.

Here it is:
> colData(dds)
DataFrame with 122 rows and 4 columns
CONDITION KEEP AGE SEX
<factor> <integer> <numeric> <factor>
ID1 A 1 47.06913 F
ID2 A 1 45.33333 M
ID3 A 1 59.63039 F
ID4 A 1 49.00753 M
ID5 A 1 34.94319 M
... ... ... ... ...
ID118 B 1 30.00684 M
ID119 B 1 41.90828 F
ID120 B 1 39.04449 F
ID121 B 1 39.64682 M
ID122 B 1 16.70363 M

IsBeth
11-14-2013, 04:49 AM
Hello! =)

I'm a beginner (Blutiger Anfänger) at bioinformatics and so I've got a beginners question, sorry! :o Nowadays I'm trying to run DESeq2 with a dataset without replicates. It's a dataset with two columns (2 conditions, "control" and "infected").
Which method do you recommend me to estimate the dispersion of my data? ( "pooled", "pooled-CR", "per-condition", "blind"?). I was reading about these methods but it's difficult for me to say which one I should use.:confused:

If I just use the command "estimateDispersions" without specifying the method to make use of, which one is applied?

What should I be careful with?


Thanks in advance!

Michael Love
11-14-2013, 08:05 AM
fpesce,

sorry I can't replicate any bug with mixed numeric and categorical variables for >100 samples. If you can email me a small reproducible example (maybe subsetting the rows of dds) I can look into it. My email is listed in help(package="DESeq2")

But I would recommend turning age into a categorical variable. You can split it into biologically meaningful groups: <18 y/o, 19-35, etc. Then the average effect of each group is controlled for, regardless of the trend over years. This often makes more sense rather than assuming a constant log2 fold change per year.

You can turn numerics into factors using cut():

> age <- 40:50
> cut(age, breaks=c(0,30,45,60,200))
[1] (30,45] (30,45] (30,45] (30,45] (30,45] (30,45] (45,60] (45,60] (45,60]
[10] (45,60] (45,60]
Levels: (0,30] (30,45] (45,60] (60,200]

Michael Love
11-14-2013, 08:12 AM
IsBeth,

There is information about the methods used by estimateDispersions in the ?estimateDispersions man page.

In DESeq2, the Cox-Reid estimator (a penalized maximum likelihood estimator) is the only method provided, as it covers all experimental designs.

it is also discussed in this man page what happens in the case of a 1 vs 1 comparison.

fpesce
11-15-2013, 12:48 AM
fpesce,

sorry I can't replicate any bug with mixed numeric and categorical variables for >100 samples. If you can email me a small reproducible example (maybe subsetting the rows of dds) I can look into it. My email is listed in help(package="DESeq2")

But I would recommend turning age into a categorical variable. You can split it into biologically meaningful groups: <18 y/o, 19-35, etc. Then the average effect of each group is controlled for, regardless of the trend over years. This often makes more sense rather than assuming a constant log2 fold change per year.

You can turn numerics into factors using cut():

> age <- 40:50
> cut(age, breaks=c(0,30,45,60,200))
[1] (30,45] (30,45] (30,45] (30,45] (30,45] (30,45] (45,60] (45,60] (45,60]
[10] (45,60] (45,60]
Levels: (0,30] (30,45] (45,60] (60,200]

Thank you so much Mike, this completely makes sense and actually solved this issue.

IsBeth
11-20-2013, 09:57 AM
Thanks Mike for your answer =)

I should have said before that I have read already the brief explanations about the different methods (sorry), yet my problem is that I have not enough statistic skills to understand why I'm using a certain dispersion method and not another in DEseq (not DEseq2). Is there a way to get specific math and statistic skills to understand R properly?

mbahin
12-02-2013, 06:45 AM
Hi all,

I am trying to use DESeq2 just after HTSeq-count.

I am following the tuto in the DESeq2 vignette but when I try to do the:
res <- results(dds)

Although I think I just followed the tuto properly, I get this error message
Error in is.numeric(cooksCutoff) : 'cooksCutoff' is missing

I can't find any answer about this error...

Cheers

mbahin
12-05-2013, 05:40 AM
Hi all,

I just found the answser, I was trying to use DESeq2 with only two count files (waiting for the other one to come).
Now that I am using it with 6 files it's ok. It looks like it was the issue.

Cheers

Michael Love
12-09-2013, 04:27 AM
hi,

Sorry, this was a simple bug for when number of samples equaled number of coefficients, which also came up on the Bioconductor mailing list. I pushed a fix to the release branch.

http://comments.gmane.org/gmane.science.biology.informatics.conductor/51865

Mike

kmcarr
02-27-2014, 12:18 PM
Hello,

In DESeq, it was possible to view/export the baseMean values for each condition (baseMeanA and baseMeanB) in addition to the overall baseMean. In DESeq2, I have only been able to view/export the overall baseMean. Is it possible to acquire the baseMean values for each condition as well? Thanks!

I'm bumping up KHubbard's question because I have the same one and it seems it didn't get answered the first time around. I have looked at the structure of the DESeqDataSet object (best as an R neophyte can) and don't see them in there. Is there any way to get these values in DESeq2?

Thanks.

dpryan
02-27-2014, 01:55 PM
I'm bumping up KHubbard's question because I have the same one and it seems it didn't get answered the first time around. I have looked at the structure of the DESeqDataSet object (best as an R neophyte can) and don't see them in there. Is there any way to get these values in DESeq2?

Thanks.

Wasn't Michael's reply (http://seqanswers.com/forums/showpost.php?p=118965&postcount=25) (the next post) in regard to that?

kmcarr
02-27-2014, 02:23 PM
Wasn't Michael's reply (http://seqanswers.com/forums/showpost.php?p=118965&postcount=25) (the next post) in regard to that?

In the words of Homer Simpson, "DOH!!"

Thank Devon, and apologies to everyone.

IsBeth
06-20-2014, 06:01 AM
Hello,

I'm using DESeq2 and so far it seems to work well. But why are the log fold changes so low? They have values between 0 and 1, even when the genes are differentially expressed.

Simon Anders
06-20-2014, 06:06 AM
Maybe there is not much happening in your data?

Can you post an MA plot?

Also, make sure you understand the conspt of shrunken fold changes. (See our paper (http://biorxiv.org/content/early/2014/05/27/002832) for details.)


Edit: Values between 0 and 1? No negative values? Sounds like you are looking at the p values, not the log fold changes.

IsBeth
06-20-2014, 06:16 AM
In my MAplot are differential expressed genes (in red). Yes, there are also negative values. I think I don't understand the concept of shrunken fold changes, still a newbie :o

It just seemed strange to me as I compared the lfc values after using edgeR and DESeq2

Simon Anders
06-20-2014, 06:27 AM
If you don't show us the MA plot, we can't tell you whether it looks right. That's why I suggested you post it. Also tell us about the biology of your experiment.

EdgeR's log fold-changes are not shrunken, DESeq2's are. They are a new feature in DESeq2, and explained in our paper.

Hence, sorry if the following question sounds a bit direct but: By "I don't understand the concept of shrunken fold changes", do you mean that you have read our preprint (http://biorxiv.org/content/early/2014/05/27/002832) but found the explanation unclear (in which case we should improve it) or that you have not yet looked at our paper (in which case we would be disappointed that nobody reads what we wrote)? ;-) You don't need to read the whole thing; just the section on shrunken fold changes.

IsBeth
06-20-2014, 06:32 AM
I'm reading your paper now. I will post my MA plot and ask more as soon I've read it. Thank you for your reply and for the link of your paper! =)

chris_s
10-28-2014, 11:13 AM
I'm new in DESeq2 and i'm having problems to create my dds.

I have a matrix with eXpress results (read counts) with 2 conditions (SP, EB) and 2 replicates/condition.

This is what i'm doing:

Cele_SPvsEB = read.csv (file.choose(), header=TRUE, row.names=1)
CeleDesign <- data.frame(
row.names = colnames(Cele_SPvsEB),
condition = factor(c("SP", "SP", "EB", "EB")))
dds <- DESeqDataSetFromMatrix(countData = Cele_SPvsEB,
colData = CeleDesign,
design = ~ condition)
dds

And i'm getting this error message:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object '.setDummyField' not found

Any help?
Thanks.

Hi,

I am in a fairly similar situation. I have used eXpress to generate read counts and they specifically recommend using the effective counts output for differential expression analysis. However, when I try to create the DESeqDataSet from the count matrix I get an error saying that :

Error in DESeqDataSet(se, design = design, ignoreRank) :
some values in assay are not integers

I know that DESeq2 normally takes in raw read counts, which probably explains why it is looking for integers, but since eXpress advises to use effective read counts which are not integers (see below): "The estimated number of fragments generated from this target in the sequencing experiment, adjusted for fragment and length biases. In other words, this is the expected number of reads from the experiment if these biases did not exist. This is the value recommended for input to count-biased differential expression tools."

I am keen to use the effective reads as this is applies a correction to an important potential bias.

Here is a subset from my countTable:

A1_effcounts A3_effcounts A10_effcounts
evgtrinCL81615Contig1_79790 2.583273 0.000000 0.000000
evgtrinCL11213Contig1_47858 14.839541 9.894940 19.304148
evgtrinCL34839Contig1_73815 18.764835 0.000000 0.000000
evgtrinCL90968Contig1_58906 0.000000 0.000000 6.883716
evgtrinCL83431Contig1_102122 1.358742 1.354169 1.150552
evgtrinCL34506Contig1_139871 4.719385 2.416909 0.000000

and my design file:

treatment time
A1_effcounts SED 24h
A3_effcounts SED Week2
A10_effcounts CON 0
A11_effcounts CON 24h
A13_effcounts CON Week2
H1_effcounts SED 24h
H3_effcounts SED Week2
H10_effcounts CON 0
H11_effcounts CON 24h



So do you use the total counts, the unique counts or is there a work around to use the effective counts from the eXpress output (e.g. rounding them to the nearest integer)?

Thanks for the help!

Cheers,

Chris

dpryan
10-28-2014, 11:24 AM
Unique counts. If you want to use the effective counts then put them through voom() and use limma.

TomHarrop
10-30-2014, 01:22 AM
Hi,

Sorry to dredge up this thread again but I have a question about the r-log transformation in the current version of DESeq2 (1.6.1).

I have a set of 8 libraries. Two of them were produced from degraded RNA. Using the previous version(s) of DESeq2, when I made density plots of r-log expression values, the two libraries that were produced from "bad" RNA had distributions that were obviously different from the libraries made from "good" RNA. This was accompanied by clear separation between the "bad" libraries and their matching biological replicates on a PCA plot made from r-log expression values.

With the current version, in density plots produced with the exact same code, all the distributions look the same as each other (i.e. the "bad" libraries now look the same as the "good" ones). The differentiation on the PCA plot is still there, though.

I am wondering if this observation can be explained by changes in the r-log transformation in the most recent version(s) of DESeq2? Unfortunately I can't say exactly which version I was using before, but it would have been the current Bioconductor release around the start of September.

Please let me know if any more information is required.

Thanks a lot for reading!

Tom

Michael Love
10-30-2014, 07:50 AM
The estimator for the amount of shrinkage changed (this and all changes are noted in NEWS). Note that rlog is just one of the possible transformations, which we have found useful in many situations, but certainly it can shrink too much or too little given the diversity of datasets.

I'd try out the varianceStabilizingTranformation, or even log2(counts(dds, normalized=TRUE) + pc) for varying pseudocount.

raphael123
11-03-2014, 01:44 PM
Hi !

I update my DEseq2 package to 1.4.5. Now my replaceOutliersWithTrimmedMean function is not replacing anything ! I have a very clear outlier that was remove previously and now any way I call replaceOutliersWithTrimmedMean, my counts(ddsClean)["LINC00624",] and my counts(dds)["LINC00624",] are axactly the same (with a basemean around 8 and an outlier around 240 ...

I didn t change the way I call the cooks outlier replacement funciton ... Any idea ?

TomHarrop
11-03-2014, 01:51 PM
The estimator for the amount of shrinkage changed (this and all changes are noted in NEWS). Note that rlog is just one of the possible transformations, which we have found useful in many situations, but certainly it can shrink too much or too little given the diversity of datasets.

I'd try out the varianceStabilizingTranformation, or even log2(counts(dds, normalized=TRUE) + pc) for varying pseudocount.

Hi Michael,

Forgot to thank you for your suggestions. Simple log-transformation with pseudocount reveals the differences in distribution of expression values in the degraded libraries.

Cheers,

Tom

Michael Love
11-10-2014, 04:47 AM
hi Raphael,

In version 1.4, DESeq() replaces outliers for certain sample sizes as described in the help file ?DESeq, and that we therefore don't recommend using the replaceOutliers function manually anymore. Also see ?DESeq for description of how the original and replacement counts are stored. If you have further questions, maybe best to start a new thread, and post the code you are using. Also, you might as well update to the current Bioconductor release 3.0 / DESeq2 version 1.6.

harlequin
01-06-2015, 04:41 PM
Hi,

Can someone briefly explain how the "filterThreshold" attribute -which as far as I can tell is a threshold for the mean of normalized counts- of the results object is calculated in DESEq2? I've read the function manual but it's not entirely clear to me. Is it picked by an optimization algorithm so that the number of genes with adjusted P-values < 0.1 is maximum? Thanks!

Simon Anders
01-07-2015, 02:52 PM
Is it picked by an optimization algorithm so that the number of genes with adjusted P-values < 0.1 is maximum?

Yes, exactly so.

zhuya5607
01-12-2015, 07:36 AM
:confused:

I can successfully follow all the commands in DESeq2 manual (updated version on December 16, 2014) except plotDispEsts(dds).

plotDispEsts(dds) gives the error message:
no slot of name "fitInfo" for this object of class "DESeqDataSet"

the following were the basic steps i ran

countData <- read.table("count_table_123vs789.txt",header=TRUE,row.names=1)
countData <- as.matrix(countData)
condition <- factor(c(rep("mirA",3),rep("mirB",3)))

coldata <- data.frame(row.names=colnames(countData),condition)
dds <- DESeqDataSetFromMatrix(countData=countData, colData=coldata, design=~condition)
dds <- DESeq(dds)
res <- results(dds)
plotDispEsts(dds)

Everything ran smoothly but the last step plotDispEsts(dds).
Can anyone please help me?

Michael Love
01-12-2015, 10:26 AM
You have an older version of the older package "DESeq" masking the DESeq2 version.

We fixed the dispatch of plotDispEsts for DESeq/DESeq2/DEXSeq a year ago by creating a generic, so that all these functions can work with all libraries loaded, but if you have an outdated version of DESeq, you still experience the problem.

biocLite()

and update all. You might need to update your version of R as well. What's your

sessionInfo()

zhuya5607
01-13-2015, 02:53 AM
You have an older version of the older package "DESeq" masking the DESeq2 version.

We fixed the dispatch of plotDispEsts for DESeq/DESeq2/DEXSeq a year ago by creating a generic, so that all these functions can work with all libraries loaded, but if you have an outdated version of DESeq, you still experience the problem.

biocLite()

and update all. You might need to update your version of R as well. What's your

sessionInfo()

Thanks for your reply! You are probably right about this, I did have a old version of DESeq installed before. However, i used biocLite("DESeq") to update,
the same problem is still there. Is my R version too old?

Here is my sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.4.5 RcppArmadillo_0.4.600.0 Rcpp_0.11.3 GenomicRanges_1.16.4 GenomeInfoDb_1.0.2
[6] IRanges_1.22.10 BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.1 Biobase_2.24.0 DBI_0.3.1 DESeq_1.16.0 RColorBrewer_1.1-2 RSQLite_1.0.0
[7] XML_3.98-1.1 XVector_0.4.0 annotate_1.42.1 genefilter_1.46.1 geneplotter_1.42.0 grid_3.1.0
[13] lattice_0.20-29 locfit_1.5-9.1 splines_3.1.0 stats4_3.1.0 survival_2.37-7 tools_3.1.0
[19] xtable_1.7-4

Michael Love
01-13-2015, 06:02 AM
I don't see why the problem should exist if you have updated BiocGenerics, DESeq, DESeq2 and DEXSeq to the same version of Bioconductor.

I recommend users stay with the current release of Bioconductor, which would be one version ahead of the one you have, by using the current release of R.

But if you don't have time to upgrade, or are in the middle of an analysis and need to save the state, a workaround is to use double colon to specify the library: DESeq2::plotDispEsts

zhuya5607
01-13-2015, 06:43 AM
I don't see why the problem should exist if you have updated BiocGenerics, DESeq, DESeq2 and DEXSeq to the same version of Bioconductor.

I recommend users stay with the current release of Bioconductor, which would be one version ahead of the one you have, by using the current release of R.

But if you don't have time to upgrade, or are in the middle of an analysis and need to save the state, a workaround is to use double colon to specify the library: DESeq2::plotDispEsts

OK, I updated R this time, install DESeq, DESeq2 and DEXSeq again using biocLite().
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.4.5 RcppArmadillo_0.4.600.0 Rcpp_0.11.3 GenomicRanges_1.16.4 GenomeInfoDb_1.0.2
[6] IRanges_1.22.10 BiocGenerics_0.10.0 BiocInstaller_1.14.3

loaded via a namespace (and not attached):
[1] annotate_1.42.1 AnnotationDbi_1.26.1 Biobase_2.24.0 DBI_0.3.1 DESeq_1.16.0 genefilter_1.46.1
[7] geneplotter_1.42.0 grid_3.1.2 lattice_0.20-29 locfit_1.5-9.1 RColorBrewer_1.1-2 RSQLite_1.0.0
[13] splines_3.1.2 stats4_3.1.2 survival_2.37-7 tools_3.1.2 XML_3.98-1.1 xtable_1.7-4
[19] XVector_0.4.0


However, i noticed a warning message when i load DESeq2

> library(DESeq2)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following object is masked _by_ ‘.GlobalEnv’:

plotDispEsts

The following objects are masked from ‘package parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply,
parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, as.vector, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find,
get, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rep.int, rownames, sapply, setdiff, sort, table, tapply, union, unique, unlist

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Rcpp
Loading required package: RcppArmadillo


OK, even I tried DESeq2::plotDispEsts, still not working
standardGeneric for "plotDispEsts" defined from package "BiocGenerics"

function (object, ...)
standardGeneric("plotDispEsts")
<environment: 0x7fa45c1e1148>
Methods may be defined for arguments: object
Use showMethods("plotDispEsts") for currently available ones.

> showMethods("plotDispEsts")
Function: plotDispEsts (package BiocGenerics)
object="DESeqDataSet"


########Problem solved now###
Because i was loading DESeq2 in the folder where old DESeq and DEXseq were installed, the old plotDispEsts always mask the new plotDispEsts object.
I just changed a new working directory and reload DESeq2, then problem solved. Thanks for you help, Michael! :D

Neuromancer
01-22-2015, 10:59 AM
Well, obviously it can, but the changes in the output I see are rather dramatic.

Here's in brief my situtation:

Data from an RNAseq experiment (mouse) was
mapped to GRCm38p2 in Feb. 2014, using STAR,
counted with HTSeq (latest version),
DEG estimations done with DESeq2 (latest version at that time)
and Ensemble release e75 GTF file.
Results were interesting and made sense, biologically.

Now we remapped everything to GRCm38p3
using Ensemble release e78 GTF and latest versions of those programms (which have not changed much).
Results are radically different, with some important changes gone, (which we validated by qPCR!)
(example: a gene with log2FC=-0,63, padj=0.0688 in the first mapping, and log2FC=-0.21, padj=0.22 in the second) - however, the counts for each sample and on average (baseMean) are very very similar, so I guess it's not the mapping that makes the difference.

Notably the number of annotated genes found in e78 is much higher compared to e75 so I expect this has an influence on the adjusted p-value - but how can it affect the log2FC!? Could it be that the number of genes tested affects the gene models / dispersions that DESeq2 estimates?

I would be happy about any comments.

Thanks.

dpryan
01-22-2015, 11:14 AM
Presumably e78 added a bunch of overlapping genes that are resulting in heavily altered counts for those genes.

Neuromancer
01-22-2015, 11:36 PM
Presumably e78 added a bunch of overlapping genes that are resulting in heavily altered counts for those genes.

Well, at least the raw counts are very(!) similar, not to say: almost exactly the same for most genes.
Also, with more/better genome information, I think, the DEG estimations should be getting better, not worse, right? We confirmed a couple of genes by qPCR and in many cases they are showing surprisingly accurate matches of the RNAseq results, but then again, we do find some genes where this nice match is completely lost using e78...

dpryan
01-22-2015, 11:48 PM
I wonder if the additional genes in the analysis are skewing the prior in either the fits or the p-value adjustment (just adding a background uniform p-value distribution in that case). That'd be the other likely possibility. What happens if you simply exclude the added genes from the analysis?

Michael Love
01-23-2015, 06:03 AM
re: changing the gene model from e75 to e78: Although it might seem at first that having the same read count for a given gene should mean that the results are identical, consider that all the parameter estimation steps involve looking at all the genes: including the library size normalization, the dispersion estimation, the posterior log fold changes, and the p-value adjustment (as Devon points out). If you want less dependence between genes, you could try using the setting betaPrior=FALSE with the DESeq() function, and then the inference is on maximum likelihood estimates of log fold change. This is a choice for the investigator, if they prefer slightly noisier LFC estimates, but an analysis such that individual genes will have less effect on each other's fold changes. But there will still be dependence between genes on dispersion estimation and p-value adjustment, so you shouldn't expect the p-values to be identical after changing gene models. The inference is nearly identical in terms of hypothesis testing, although genes on the margin of a significance boundary can have p-values move slightly one way or the other.

Neuromancer
01-23-2015, 07:49 AM
Dear Michael and Devon,

thank you for taking the time to think about this and write an answer.
These are importnat consideradtions and we will take them into account now.

The other thing that has changed between the two analyses is of course the DESeq2 version (v1.4.1 vs 1.6.3), which might also influence results slightly, I guess.

Removing the new IDs is not really an option because it not only new ones, but also substitutes and other changes that are updated with new ensembl releases.

We'll try the option beta-prior=false to see what effect this has on our data set.

Thanks again,

Roman

Gonza
01-29-2015, 07:26 AM
Hi all,

I am new to DEseq2. Maybe this is silly question, but I do not understand why meanSdPlot fails.... any ideas to solve this problem?


library("vsn")
par(mfrow=c(1,3))
notAllZero <- (rowSums(counts(dds))>0)
meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1))
meanSdPlot(assay(rld[notAllZero,]))
meanSdPlot(assay(vsd[notAllZero,]))

Error: could not find function "meanSdPlot"


Thanks

IsBeth
02-03-2015, 03:49 AM
Hello!

I'm trying DESeq2 with synthetic RNA-seq count data without replicates (40% of the synthetic genes are differentially expressed (DE), with a up - and downregulation ratio of 50%). Although DESeq2 detects these ratios correctly, it doesn't detect the DE genes. The result table is empty, and my MAplot doesn't show differential expression either. Is there any solution to this? Do I have to change any parameter?

Michael Love
02-03-2015, 04:35 AM
When you ran DESeq(), you should have seen the following message printed to console:

estimating dispersion by treating samples as replicates.
read the ?DESeq section on 'Experiments without replicates'

Did you read this section?

IsBeth
02-03-2015, 06:34 AM
Yes, I did. And I'm sorry, since I've seen that this problem is treated in other posts (such as "DESeq2 without biol replicates"). I tried the following code, but it doesn't work by now (I need to obtain a result table with statistics like the adjusted p-values and lfc):

design_vector <- c("C", "T")

coldat=DataFrame(grp=factor(design_vector), each=1)
dds <- DESeqDataSetFromMatrix(raw_filter, colData=coldat, design =~grp)
dds <- DESeq(dds)

rld <- rlogTransformation( dds )
NOTE: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.

deseq2.res <- results(dds)


What do I have to change? Cause I'm sure I'm doing it completely wrong:confused:

Michael Love
02-03-2015, 06:50 AM
I guess the point I want to emphasize is the quote from the original DESeq paper, regarding the without-replicates analysis:

"Some overestimation of the variance may be expected, which will make that approach conservative."
Furthermore, "while one may not want to draw strong conclusions from such an analysis, it
may still be useful for exploration and hypothesis generation."

The analysis without replicates is very conservative, and only recommended for exploration (for example, you can look at which log fold changes are largest in the MA plot).

So to answer your question from before:

"it doesn't detect the DE genes. The result table is empty, and my MAplot doesn't show differential expression either"

The without-replicates analysis is not expected to produce small p-values or adjusted p-values. We have overestimated the variance by treating the samples from different conditions as replicates for the purpose of estimating variance.

IsBeth
02-03-2015, 07:07 AM
The without-replicates analysis is not expected to produce small p-values or adjusted p-values. We have overestimated the variance by treating the samples from different conditions as replicates for the purpose of estimating variance

Ok, so when estimating the variance this way, it is not sufficient to detect DE genes?:eek: I know that to work with no replicates is not appropiate, but in these cases I need to provide some information of the data for the end user :o. Isn't there some way to obtain a DE genes list (even if the results are to be taken with care)?
(I think that it was possible with DESeq)

Michael Love
02-03-2015, 07:26 AM
DESeq() in DESeq2 also estimates p-values, and adjusted p-values (for those genes which pass independent filtering step).

I would guess that you are filtering with a p-value or adjusted p-value threshold which is less than all the p-values or adjusted p-values in the object returned by results(), and generating a 0-row dataframe.

IsBeth
02-04-2015, 02:45 AM
Hi Michael,

You were absolutely right :) I had filtered the object returned from the results() function with a minimum log2foldChange of 1.5 and an adjusted p-value of 0.01. It was definitively too much. The lfc of the result data is less than 1 in most cases, and the adj p-value is superior than one...

With no filtering, I get the results table with the statistics I need, but no DE genes are detected and so none are plotted (with red points). Is there some way tu increase the statistical power (maybe in the results() function)?Or to change some parameters in order to mark genes as differentially expressed (even if their values are not magnific)?

Thank you for your time =)

apredeus
02-18-2015, 07:48 PM
Here's a bit random and fascinating observation I've made when using DESeq2.

I've had three groups of experiments - with conditions labelled as "B", "12", and "24". I've ran pairwise comparison with donor-sensitive design on all three groups, getting differential expression for them in the same way:

B vs 12 (cond1 = B, cond2 = 12)
B vs 24 (cond1 = B, cond2 = 24)
12 vs 24 (cond1 = 12, cond2 = 24)

so, in first two cases, log2FC was reported for cond1/cond2 (i.e. positive numbers were for genes down-regulated at 12 or 24). In the latter case, log2FC was reported for cond2/cond1.

The differences were subtle enough that it took me a long long while to catch this error. I figured it's probably due to some sorting - either internal or in the condition file. Maybe it would make sense to write explicitly in the logs which one is cond1 and which one is cond2?

Cheers

Simon Anders
02-18-2015, 10:46 PM
How exactly have you called the "results" function?

In any case, the first line of the output of the results function indicates the exact comparison that is reported. Can you check whether this line was correct in your case?

apredeus
02-18-2015, 11:06 PM
How exactly have you called the "results" function? In any case, the first line of the output of the results function indicates the exact comparison that is reported. Can you check whether this line was correct in your case?

Oh, I see what the problem was. I have an R function that does the following:

res <- results(ddsMatrix)
resOrdered <- res[order(res$padj),]

Thus I did not see the output you are talking about. Still though, maybe it's a good idea to print it along with messages "estimating size factors" etc?

At any rate, thank you for your reply!

Michael Love
02-19-2015, 06:45 AM
We cannot print it during model fitting, because we fit a model without knowing what comparison will be made. Will it be B vs A? Or C vs B? Or all the interaction terms together as a likelihood ratio test? This is why it is printed when you print the results object to the console, and the information is also stored as metadata columns:

mcols(res)

Check the DESeq2 vignette section on factor levels, which addresses your question.

R's factor function chooses a level ordering alphabetically unless you specify a meaningful order.

apredeus
02-19-2015, 12:51 PM
I see. I'll just add that call to my script then, should be enough. Thank you!

TomHarrop
03-26-2015, 03:13 AM
Hi again,

I'm using the LRT with DESeq2 as follows.

> design(dds)
~batch + stage
> dds < - DESeq(dds, test = "LRT", reduced = ~ batch)

I see that I can access the calculated deviance for each gene:

> mcols(mcols(dds), use.names=TRUE)[c(20,25),]
DataFrame with 2 rows and 2 columns
type description
<character> <character>
LRTStatistic results LRT statistic: '~ batch + stage' vs '~ batch'
deviance results deviance of the full model

Just so I can illustrate to some colleagues what I'm doing using some actual data, I'm wondering, for a given gene, if it's possible to retrieve the expression values predicted by the full and reduced models from dds after running the LRT?

Thanks,

Tom

Michael Love
03-26-2015, 05:07 AM
The fitted means, mu_ij in the paper, for the full model are in assays(dds)[["mu"]]. Not that these include the size factor scaling, so you would have to divide those out to get the q_ij in the paper. See the formula in the vignette or paper for more details.

We don't store the reduced design fitted means, but you can generate these easily by running the GLM without a log fold change prior:

dds.reduced <- dds
design(dds.reduced) <- ~ batch
dds.reduced <- nbinomWaldTest(dds.reduced, betaPrior=FALSE)
assays(dds.reduced)[["mu"]]

TomHarrop
03-26-2015, 08:34 AM
Perfect, thanks.

rozitaa
03-26-2015, 10:57 AM
Hi,

I have recently have changed from DESeq to DESeq2. So I need some help in designing my contrast for the binomial tests. This is my design table:

sample tissue gut_microbiota
1 5231 Si5 GF
2 5231 PC GF
3 5231 liver GF
4 5232 Si5 GF
5 5232 PC GF
6 5232 liver GF
7 5233 Si5 GF
8 5233 PC GF
9 5233 liver GF
10 5234 Si5 GF
11 5234 PC GF
12 5234 liver GF
13 5161 Si5 mono
14 5161 PC mono
15 5161 liver mono
16 5162 Si5 mono
17 5162 PC mono
18 5162 liver mono
19 5163 Si5 mono
20 5163 PC mono
21 5163 liver mono
22 5164 Si5 prevExci
23 5164 PC prevExci
24 5164 liver prevExci
25 5164 liver prevExci
26 5165 Si5 prevExci
27 5165 PC prevExci
28 5165 liver prevExci
29 5166 Si5 prevExci
30 5166 PC prevExci
31 5166 liver prevExci
32 5167 Si5 prev
33 5167 PC prev
34 5167 liver prev
35 5168 Si5 prev
36 5168 PC prev
37 5168 liver prev
38 5169 Si5 prev
39 5169 PC prev
40 5169 liver prev
41 5170 Si5 prev
42 5170 PC prev
43 5170 liver prev
44 5171 Si5 mono
45 5171 PC mono
46 5171 liver mono
47 5172 Si5 mono
48 5172 PC mono
49 5172 liver mono
50 5173 Si5 mono
51 5173 PC mono
52 5173 liver mono
53 5174 Si5 prev
54 5174 PC prev
55 5174 liver prev
56 5175 Si5 prev
57 5175 PC prev
58 5175 liver prev
59 5176 Si5 prev
60 5176 PC prev
61 5176 liver prev
62 5177 Si5 prevMono
63 5177 PC prevMono
64 5177 liver prevMono
65 5178 Si5 prevMono
66 5178 PC prevMono
67 5178 liver prevMono
68 5179 Si5 prevMono
69 5179 PC prevMono

and this is how I have designed my count data:
dds = DESeqDataSetFromHTSeqCount(sampleTable=sample_table, directory='~/htseq/', design= ~ tissue + gut_microbiota)

Now I would like to test if any gut_microbiota level has significant effects on any of my genes at each separate tissue; (i.e. running the test for each tissue separately). How should I design my contrast formula? and How I should specify specific tissue at each test?

Thanks!

rozitaa
03-26-2015, 11:13 AM
I ran a code like this (hope it is the correct one?!):

dds_LRT_liver <- nbinomLRT(dds[, dds$tissue=='liver'], full=~gut_microbiota, reduced=~1)
dds_LRT_PC <- nbinomLRT(dds[, dds$tissue=='PC'], full=~gut_microbiota, reduced=~1)
dds_LRT_Si5 <- nbinomLRT(dds[, dds$tissue=='Si5'], full=~gut_microbiota, reduced=~1)

enrico16
06-02-2015, 05:16 AM
See: https://support.bioconductor.org/p/68277/

--

Hi Michael,

Is the Cook's distance-based flagging of p-values performed when using a continuous variable?

I don't see how the condition:

"At least 3 replicates are required for flagging, as it is difficult to judge which sample
might be an outlier with only 2 replicates."

can realistically be met for continuous variables.

If that's the case, how can I remove genes that display such a behavior?
Please let me know if this needs clarification.

Thanks!

Joe Dever
06-10-2015, 05:42 AM
Hi,

I am not familiar with R but I used DESeq in the past following its vignette.
I can't understand how to use DESeq2. I tried to follow the section for HT-Seq input but then it switches to loading the Pasilla dataset.

Is there any tutorial anywhere showing clearly how to use DESeq2 with HTseq-count, like the DESeq vignette did? Please advise.

Michael Love
06-10-2015, 06:39 AM
The first step is the only step which matters how you generate the counts. This is the step when you build the DESeqDataSet. So you only need to follow the htseq steps up until DESeqDataSetFromHTSeqCount(). After that, you have the DESeqDataSet, which you can save as "dds", and then you can follow the rest of the steps below. You can either follow this instruction in the vignette, or you can also read the help file for this particular function by typing ? and the function name into R and press enter:

?DESeqDataSetFromHTSeqCount

Note that you can name the object however you like. To follow the rest of the workflow you can choose 'dds':

dds <- DESeqDataSetFromHTSeqCount(...)

where you will fill in ... with the appropriate code from the vignette and help.

Joe Dever
06-11-2015, 01:30 AM
The first step is the only step which matters how you generate the counts. This is the step when you build the DESeqDataSet. So you only need to follow the htseq steps up until DESeqDataSetFromHTSeqCount(). After that, you have the DESeqDataSet, which you can save as "dds", and then you can follow the rest of the steps below. You can either follow this instruction in the vignette, or you can also read the help file for this particular function by typing ? and the function name into R and press enter:

?DESeqDataSetFromHTSeqCount

Note that you can name the object however you like. To follow the rest of the workflow you can choose 'dds':

dds <- DESeqDataSetFromHTSeqCount(...)

where you will fill in ... with the appropriate code from the vignette and help.

Thanks for your reply, I'll try again.

amolinaro
06-23-2015, 12:02 PM
Hi all,

I'm doing single cell RNAseq on a heterogeneous population of cells & I would like to do DE analysis to identify different subpopulations. I plan on using the tophat/cufflinks/monocle pipeline since it was specifically designed for single cell data, but I would also like to apply a raw count method to verify the results. Can DESeq2 be used for DE analysis of single cell data? If so, how would I go about doing that and what would the output look like? I'm very new to bioinformatics so apologies if these are silly questions...

Thanks in advance!

apredeus
06-23-2015, 09:16 PM
Hello

single cell data is very peculiar, and as a rule I'd say, no - you can not use regular methods for DE analysis of single-cell data. Look at SCDE by P. Kharchenko et. al:

http://pklab.med.harvard.edu/scde/index.html

Overall, people more often use PCA than DE with single-cell data. As you work with them some more, you probably will see why :)

Caitriona McEvoy
06-29-2015, 03:50 AM
Hi all,
I am very new to r & DEseq2 but have been following the excellent vignette. I have encountered a problem though. I can create the object rld as per the command in the vignette, but when I try to run PCA as suggested I repeatedly encounter the following error:

plotPCA(rld, intgroup=c("Fibrosis", "Sample.ID"))

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"SummarizedExperiment"’
I have also tried:
plotPCA( DESeqTransform(rld ) )
and get the following:
Error in plotPCA(DESeqTransform(rld)) :
error in evaluating the argument 'x' in selecting a method for function 'plotPCA': Error: could not find function "DESeqTransform"
Can anyone advise?
Caitriona

dpryan
06-29-2015, 09:07 AM
How did you make "rld" and what version of DESeq2 are you using?

Caitriona McEvoy
06-29-2015, 09:14 AM
Thanks for replying. I am using DESeq2 1.6.3.
First I made dds:
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ Organ + Fibrosis)
then:
rld = rlog (dds)
I went on to examine rld using: head(assay(rld)), and it is listed in my environment as a large SummarisedExperiment.
Caitriona

Michael Love
07-01-2015, 07:44 AM
hi Caitriona

It looks like you have a mix of out-of-date and new packages. These often generate conflicts. This can occur if you install Bioconductor packages using install.packages() rather than with biocLite().

Try this:

source("http://bioconductor.org/biocLite.R")
biocValid()

Caitriona McEvoy
07-06-2015, 01:45 AM
Hi Michael, Thank you for your reply. I installed packages as suggested, and it's working. Thanks! Caitriona