SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
EDGE-pro into DESeq, DESeq run error? epistatic Bioinformatics 5 09-02-2015 04:32 AM
DESeq2 Simon Anders Bioinformatics 123 07-06-2015 01:45 AM
Cuffdiff or DESeq ETHANol RNA Sequencing 47 08-26-2013 11:19 PM
DESeq error Carmen Bioinformatics 2 01-17-2013 05:22 AM
DESeq - get chromosome glados Bioinformatics 4 09-13-2012 07:51 AM

Reply
 
Thread Tools
Old 07-23-2014, 10:19 PM   #21
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

chocicingwan is right: DESeq1 was too strict with its dispersion estimates, but the standard way of estimating log fold changes (as also used in DESeq1) partly counters this.

Better use DESeq2's new functionality to test directly against a threshold:

res <- results( dds, lfcThreshold=1 )
subset( res$padj < .1 )

will give you all genes for which the data provides signficant evidence (at 10% FDR) that their actual log2 fold change is at least +/-1, while the "traditional" approach of using

res <- results( dds )
subset( res$padj < .1 & abs( res$log2FoldChange) > 1 )

will give you genes for which there is significant evidence that their change is non-zero and for which the apparent log2 fold change is stronger than +/-1.


Whithout fold-change shrinkage, especially weak genes show exaggerated fold-change, which makes it too easy for them to get over a post-hoc threshold. This is why testing for non-zero fold change and then thresholding on estimated LFC was never a too good idea: You get too many of the weaker genes.

DESeq2 attacks this issue in two ways: (i) Shrinkage cause the fold-changes of the weak genes to be no longer exggerated. (ii) The new thresholded test gives you a list where you the false dicover rate (FDR) now really fits your test: a gene is considered a false positive if it seems to pass the LFC threshold but this was only due to noise.

As you see, it was too easy to pas the threshold before. Hence, if you want to get a similar result list as earlier, you should use a smaller threshold when you use the newer, more stringent analysis.
Simon Anders is offline   Reply With Quote
Old 07-29-2014, 04:02 PM   #22
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

Hello all,

I'm trying to compare DESeq2 with DSS. Basically I'm getting a lot more differentially expressed genes with DESeq2 than in DSS (on the order of 2-2.5 thousand vs. 5 to 6 hundred). We have 2 replicas for each condition. I've counted all the genes with FDR (in case of DSS) or adjusted p-value < 0.01 to be significant.

What could be the reason for such dramatic difference? Correct me if I'm wrong, but adjusted p-value in DESeq2 is basically FDR, right? So adjusted p-values in DESeq2 output go very low (~ 10^-250), while FDR values in DSS go as low as 10^-10. Considering this, it's probably not that surprising that the number of DE genes is so different. But what is true reason for this, and which result is closer to the "truth"?

I'd be grateful for any pointers.
apredeus is offline   Reply With Quote
Old 07-29-2014, 09:46 PM   #23
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

It is very hard to answer your question without more information. What is the biology of your experiment? What are you comparing? How do the diagniostic plots (MA plot and dispersion plot) look like?
Simon Anders is offline   Reply With Quote
Old 07-31-2014, 07:59 AM   #24
Tatsiana_by
Junior Member
 
Location: Innsbruck

Join Date: Jul 2014
Posts: 3
Default

Hello,

I have a question regarding replaceOutliers function. More precisely, it looks like after using replaceOutliers, there are cases that non-outlier becomes an outlier.

I have 9 samples: 3 replicates for 3 conditions (6 days, 12 days and 28 days, as factors).

Here is what I do and what I get:
> des2.dataset <- DESeqDataSetFromMatrix(countData = countDF, colData = pheno[colnames(countDF),], design = ~ days)

countDF is just a data frame of raw counts. pheno is a data frame with information for each sample, including column "days".

> des2.mf <- DESeq(des2.dataset)

> des2.replace <- DESeq(replaceOutliers(des2.mf,minReplicates = 3))

Now, I have 160 genes, that by default detection method were not detected to be outliers. But after outlier replacement, they became outliers.

Here is an example of one such gene:

> assays(des2.mf)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186
0.1072452 6.6523800 2.3861998 0.1974414 0.1226845 3.0418715 0.2286905
RME187 RMN013
3.1039154 11.1255294

> assays(des2.replace)[["cooks"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185
0.07305022 7.57479726 2.35945901 0.12521243 0.00511494 3.05473959
RME186 RME187 RMN013
0.53951600 3.10378440 13.54871243


The cooks distance has changed, and now sample RMN013 has become an outlier... But the counts didn't:

> assays(des2.mf)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954
> assays(des2.replace)[["counts"]]["12484",]
RME180 RME181 RME182 RME183 RME184 RME185 RME186 RME187 RMN013
18078 28989 1894 28078 4692 27 2373 0 1954

So my question would be exactly that: Why did cooks distance change?
Another is merely concern over interpretation (given that this is intended behaviour): if I used replaceOutliers, its hard to justify having outliers in the data.... Furthermore, 5 out of these 160 cases were actually significant before outlier replacement. And I would definitely like to have these 5 in my hit-list.

Any thoughts are greatly appreciated.
Tatsiana_by is offline   Reply With Quote
Old 08-01-2014, 07:03 PM   #25
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

Quote:
Originally Posted by Simon Anders View Post
It is very hard to answer your question without more information. What is the biology of your experiment? What are you comparing? How do the diagniostic plots (MA plot and dispersion plot) look like?
Hello Simon! Thank you for a prompt reply. Here's more information. This is a strand-specific paired-end RNA-seq experiment. I'm only counting read R2 for the reasons I shouldn't go into (we are working on some alterations of experimental RNA-seq methodology). The cells are murine DCs stimulated with LPS+IFNg, and conditions being compared are 2h and 4h post-treatment. There are 2 replicas for each condition.

Each replica/condition is 8.5Mbase deep, of which 70-75% are aligned uniquely. I've used HTseq to quantify the expression and have obtained 5-6 million overall "useful" counts (that were from uniquely mapped reads and have aligned with the feature in a strand-specific way).

So, here are some diagnostic plots.

MA plot of the shrunken data:



MA plot of unshrunken data in two different scales: (obtained from the original like this: ddsNP <- nbinomWaldTest(ddsMatrix,betaPrior=FALSE) )



Here are the dispersion plots. Dispersion of "40 equally spaced genes":



Dispersion estimates:



Overall, DESeq2 finds about 2600 differentially expressed genes, and DSS is limited to 500-600. I've considered genes to be differentially expressed if padj < 0.01 (fdr in case of DSS).
apredeus is offline   Reply With Quote
Old 08-02-2014, 06:10 AM   #26
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi Tatsiana,

you might consider posting in a new thread for any follow up questions, so it's easier to track Q and A.

In the vignette we say "we suggest 7 or more replicates in order to consider replacement for large Cook’s distances"

I would not use the replace outliers procedure for 3 replicates. Also, we have moved the replacement procedure inside DESeq. If you examine your data and think that the samples called outlier are not, you can just turn off Cook's filtering with: results(dds, cooksCutoff=FALSE).
Michael Love is offline   Reply With Quote
Old 08-02-2014, 06:17 AM   #27
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

apredeus, if I am not mistaken, DSS uses a local FDR procedure on the test statistics to generate the FDR column. So if you have a wide distribution of log fold changes (and a wide distribution of Wald statistics), the locfdr software will match a wide Gaussian distribution and consider many of these genes part of the null distribution. To paraphrase, this procedure expects that a central bulk of the genes are not DE, and uses these genes to estimate the FDR. But in your experiment it looks like you might have many genes changing.

Note that if you want to hone in on larger fold changes across condition, you can use the lfcThreshold argument of results. See ?results for more information.
Michael Love is offline   Reply With Quote
Old 08-06-2014, 05:53 AM   #28
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

To add to Mike's answer: Your FDR of 0.01 is very stringent. With 0.1 you will get even more genes. But you should not be surprised about that. These seem to be cells cultured in a very well-controlled setting: the asymptote of the dispersion plot is around 10^-2, i.e., the coefficient of variation between replicates is less than 10%.

Hence, you are powered to see even rather weak effects. If your treatment is now harsh enough to disturb the cells strongly, most genes will feel this, though maybe only indirectly.

What you are interested in is the genes that react _strongly_. Hence,forget about p values, look at the genes with (strong) shrunken fold change. Or use DESeq2's new "thresholded" test to get only those genes whose response to treatment exceeds a threshold that reasonably indicates what you would consider biologically relevant.
Simon Anders is offline   Reply With Quote
Old 08-09-2014, 02:05 PM   #29
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

Simon, Michael - thank you for your comments!

Quote:
Originally Posted by Simon Anders View Post
To add to Mike's answer: Your FDR of 0.01 is very stringent. With 0.1 you will get even more genes. But you should not be surprised about that. These seem to be cells cultured in a very well-controlled setting: the asymptote of the dispersion plot is around 10^-2, i.e., the coefficient of variation between replicates is less than 10%.
Yes, I understand, but you know - when you are saying that between 2h and 4h post-treatment you have about 5 thousand differentially expressed genes, people are going to freak out. However even when I filter the genes, leaving only those that have abs(log2FC) > 1, I still get more than a thousand, so it's certainly real.

However DSS was only giving me ~ 500 genes, of them 350 with log2FC above 1 or below -1. So I got very suspicious because in papers DSS and DESeq2 were giving very comparable results.

What is the right way to compare the results PRE-FDR? Correlation coefficients of the ranked lists of the top 10k genes, for example? Or distributions of un-adjusted p-values?

Quote:
Hence, you are powered to see even rather weak effects. If your treatment is now harsh enough to disturb the cells strongly, most genes will feel this, though maybe only indirectly.

What you are interested in is the genes that react _strongly_. Hence,forget about p values, look at the genes with (strong) shrunken fold change. Or use DESeq2's new "thresholded" test to get only those genes whose response to treatment exceeds a threshold that reasonably indicates what you would consider biologically relevant.
Regarding this, I had few questions. First of all, the options listed in the manual (lfcThreshold, altHypothesis) seem to be absent in the actual installed version of DESeq2; here's the output of options from results function:

Code:
function (object, name, contrast, cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta = seq(0, 0.95, by = 0.05), pAdjustMethod = "BH")
Second, I wanted to ask: when you are printing log2 fold change in the final table (results), is it shrunken or not?

Thank you again for the help, I appreciate it.
apredeus is offline   Reply With Quote
Old 08-09-2014, 07:54 PM   #30
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

If you search for documentation online, you can find documentation for the release or development version when you have an older version on your computer.

You can read the correct documentation (the one paired with the version of Bioconductor you have installed) from within R with e.g. ?DESeq and vignette("DESeq2") or browseVignettes("DESeq2")

You can find out the version you have with sessionInfo(). We print the version number of our software on the first and last page of the vignette.

The LFC are shrunken in the results table (in the next release, v1.6 in Oct 2014, there is an option to print the unshrunken LFC alongside).
Michael Love is offline   Reply With Quote
Old 08-10-2014, 11:53 AM   #31
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

Oh so I guess you must have R3.1 and later to get the newest DESeq2? Since I've installed the package very recently but I did have R 3.0.1..
apredeus is offline   Reply With Quote
Old 08-13-2014, 01:28 AM   #32
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

So when I upgraded DESeq2 1.2.10 @ R 3.0.1 to v 1.4.5 (R 3.1.1) the number of genes considered "differentially expressed" went from ~ 2500 to ~ 1300.

every other setting is very much the same. How is this even possible?
apredeus is offline   Reply With Quote
Old 08-13-2014, 04:32 AM   #33
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi apredeus, I'm note exactly sure why it would have changed so much between 1.2 and 1.4, as there were only slight changes to the gene-wise dispersion estimation routine, but I can say that the code base has stabilized since last fall. Bioc users are always 6 months behind the development cycle, which means that bugs/changes that users request in October have to wait until March to be released.

The other thing is that p-values are tail probabilities, which means they are very sensitive to slight changes in model parameters.

For example, say we have some Z scores:

> set.seed(1)
> z <- rnorm(10000,0,7)

Then we make and adjust p-value according to two models of the data:

> table(p.adjust(2*pnorm(abs(z),0,5,lower=FALSE),method="BH") < .1)

FALSE TRUE
9507 493

> table(p.adjust(2*pnorm(abs(z),0,5.25,lower=FALSE),method="BH") < .1)

FALSE TRUE
9741 259

So a 5% change in a model parameter translates to a nearly halving of adjusted p-values less than 0.1.
Michael Love is offline   Reply With Quote
Old 10-15-2014, 01:23 PM   #34
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

Michael, Simon:

would it be possible to output two mean expression columns (conditions 1 and 2) instead of only one in the results table? It would be a lot more informative I think. You can simply glance at it and tell if there are some dramatic changes.
apredeus is offline   Reply With Quote
Old 10-16-2014, 05:08 AM   #35
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by apredeus View Post
Michael, Simon:

would it be possible to output two mean expression columns (conditions 1 and 2) instead of only one in the results table? It would be a lot more informative I think. You can simply glance at it and tell if there are some dramatic changes.
This post by Michael described how to extract the baseMean per level (condition) from the DESeqDataSet object.
kmcarr is offline   Reply With Quote
Old 10-16-2014, 06:04 AM   #36
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

I can make it even more concise:

Code:
baseMeanPerLvl <- sapply(levels(dds$condition), function(lvl) rowMeans(counts(dds,normalized=TRUE)[,dds$condition == lvl]))
I didn't know until this year that dds$variable is the same as colData(dds)$variable for SummarizedExperiments.
Michael Love is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO