SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq2 Simon Anders Bioinformatics 123 07-06-2015 01:45 AM
DESeq V DESeq2 sebastion RNA Sequencing 35 10-16-2014 06:04 AM
Tenure-track job in Bioinformatics/Computational Biol. at Auburn Univ. KMH Academic/Non-Profit Jobs 0 12-03-2012 11:50 AM
DESeq: question about with replicates and without any replicates. nb509 RNA Sequencing 2 10-25-2011 06:04 AM

Reply
 
Thread Tools
Old 06-12-2013, 03:41 AM   #1
sisterdot
Junior Member
 
Location: Europe

Join Date: Apr 2013
Posts: 6
Default DESeq2 without biol replicates

Hey all,

i wanted to try DESeq2 in a pairwise comparison setting without replicates.

unfortunately using the code below on a pasilla-test-case i can only find differentially expressed genes using DESeq and not DESeq2.
(DESeq2 says "same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates" so should be getting the task)

i know it must be me, but where am I going wrong?

thanks a lot!!


#-------------------------- deseq2

Code:
library(DESeq2)
library(pasilla)
data("pasillaGenes")
countData <- counts(pasillaGenes)
countData<-countData[,c("treated1fb","untreated1fb")]
colData <- pData(pasillaGenes)[c("treated1fb","untreated1fb"),c("condition","type")]
dds <- DESeqDataSetFromMatrix(
       countData = countData,
       colData = colData,
       design = ~ condition)
colData(dds)$condition <- factor(colData(dds)$condition,levels=c("treated","untreated"))
dds
dds <- DESeq(dds)
res <- results(dds)
res <- res[order(res$padj),]
head(res)
DataFrame with 6 rows and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
FBgn0000008 60.7056967 -0.09993564 0.36063169 0.7816935 0.9973207
FBgn0000014 0.8344279 -0.03920709 0.08405437 0.6408940 0.9973207
FBgn0000015 0.4172140 -0.02034815 0.06095374 0.7385084 0.9973207

#-------------------------- deseq

Code:
library(DESeq)
library(pasilla)
data("pasillaGenes")
countData <- counts(pasillaGenes)
countData<-countData[,c("treated1fb","untreated1fb")]
condition <- factor( c( "treated", "untreated") )
cds <- newCountDataSet( countData, condition )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" )
res <- nbinomTest( cds, "treated", "untreated") 
res <- res[order(res$padj),]
head(res)
id baseMean baseMeanA baseMeanB foldChange log2FoldChange
9696 FBgn0039155 513.8001 31.70826 995.89187 31.40796187 4.973058
2605 FBgn0026562 20645.8226 6151.40275 35140.24243 5.71255758 2.514137
29 FBgn0000071 241.5499 444.75009 38.34963 0.08622736 -3.535710
pval padj
9696 9.669445e-23 1.062188e-18
2605 1.477349e-12 8.114339e-09
29 3.097019e-10 1.134025e-06
sisterdot is offline   Reply With Quote
Old 06-12-2013, 05:08 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

To be honest, we couldn't yet be bothered to explain how to analyse such data in DESeq2. It's tricky to write up because too many people will misinterpret whatever I write as if it were actually possible to conduct a meaningful statistical analysis when comparing just two samples.

So, if you promise to not use any such comparisons for actual science, here is how you do it:

Start as above:

Code:
library(DESeq2)
library(pasilla)
data("pasillaGenes")
countData <- counts(pasillaGenes)
countData<-countData[,c("treated1fb","untreated1fb")]
colData <- pData(pasillaGenes)[c("treated1fb","untreated1fb"),c("condition","type")]
dds <- DESeqDataSetFromMatrix(
       countData = countData,
       colData = colData,
       design = ~ condition)
Now use DESeq2's new "rlog transformation". This replaced the VST we had before. It transforms the average of the genes across samples to a log2 scale but "pulls in" those genes for which the evidence for strong fold changes is weak due to low counts.

Code:
rld <- rlogTransformation( dds )
As this is a logarithm-like scale, the differences between the samples can be considered as a "regularized log2 fold change". Let's make a result data frame:
Code:
res <- data.frame(
   assay(rld), 
   avgLogExpr = ( assay(rld)[,2] + assay(rld)[,1] ) / 2,
   rLogFC = assay(rld)[,2] - assay(rld)[,1] )
And now we have a ranking of genes by regularized fold changes:

Code:
> head( res[ order(res$rLogFC), ] )
            treated1fb untreated1fb avgLogExpr    rLogFC
FBgn0011260   7.830359     6.627326   7.228842 -1.203033
FBgn0001226  10.128636     8.929985   9.529311 -1.198652
FBgn0034718   8.503006     7.315640   7.909323 -1.187366
FBgn0003501   7.927864     6.743974   7.335919 -1.183889
FBgn0033635  11.126300     9.973979  10.550139 -1.152321
FBgn0033367  13.411814    12.269436  12.840625 -1.142378
This ranking put ones genes on top which are strongly downregulated in the second sample compared to the first one. If you do this with normal log expressions, the weak genes will appear at the top because they are noisiest and hence tend to have exaggerated fold changes.

The advantage of this procedure is that it does not produce any p values (which would be misleading anyway).

Last edited by Simon Anders; 06-12-2013 at 05:10 AM.
Simon Anders is offline   Reply With Quote
Old 06-12-2013, 09:58 AM   #3
sisterdot
Junior Member
 
Location: Europe

Join Date: Apr 2013
Posts: 6
Default

thanks for your help Simon- you are the best!

> So, if you promise to not use any such comparisons for actual science
hihihi...
you are totally right of course!
sisterdot is offline   Reply With Quote
Old 06-12-2013, 11:53 AM   #4
sisterdot
Junior Member
 
Location: Europe

Join Date: Apr 2013
Posts: 6
Default

hey,

could you please help again, i am a bit as it seems that the DESeq and DESeq2 regularized LogFC calculations differ strongly if i stick to the recommended procedure

looked for a potentially differentially expressed gene from pasilla (FBgn0039155 counts are 38 to 831), wanted to see what the DESeq2 regularized LogFC would be and tried to compared it to the DESeq rLogFC as depicted below.
DESeq says rLogFC is 3.095034, DESeq2 says 0.1595547 (and the later doesn't change much if i try out blind rlogTransformation, estimateSizeFactors, estimateDispersions). somehow i believe the DESeq2 procedure i used isn't doing what i want.

thanks!

----------------- deseq2

Code:
library(DESeq2)
library(pasilla)
data("pasillaGenes")
countData <- counts(pasillaGenes)
countData<-countData[,c("treated1fb","untreated1fb")]
head(countData)
colData <- pData(pasillaGenes)[c("treated1fb","untreated1fb"),c("condition","type")]
dds <- DESeqDataSetFromMatrix(
       countData = countData,
       colData = colData,
       design = ~ condition)
rld <- rlogTransformation( dds)#,blind = TRUE)
res <- data.frame(
   assay(rld), 
   avgLogExpr = ( assay(rld)[,2] + assay(rld)[,1] ) / 2,
   rLogFC = assay(rld)[,2] - assay(rld)[,1] )
res[c("FBgn0039155"),]
treated1fb untreated1fb avgLogExpr rLogFC
FBgn0039155 8.850547 9.010102 8.930324 0.1595547

----------------- deseq

Code:
library(DESeq)
library(pasilla)
data("pasillaGenes")
countData <- counts(pasillaGenes)
countData<-countData[,c("treated1fb","untreated1fb")]
condition <- factor( c( "treated", "untreated") )
cds <- newCountDataSet( countData, condition )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" )
vsd <- varianceStabilizingTransformation( cds )
res <- nbinomTest( cds, "treated", "untreated") 
rlfc <- (exprs(vsd)[, c("untreated1fb")]  -exprs(vsd)[, c("treated1fb")] )
res$rLogFC<-rlfc
res[res$id=="FBgn0039155",]
id baseMean baseMeanA baseMeanB foldChange log2FoldChange
9696 FBgn0039155 513.8001 31.70826 995.8919 31.40796 4.973058
pval padj rLogFC
9696 9.669445e-23 1.062188e-18 3.095034

Last edited by sisterdot; 06-12-2013 at 11:10 PM.
sisterdot is offline   Reply With Quote
Old 07-15-2013, 05:23 AM   #5
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi,

I made a change to rlogTransformation in the development branch of DESeq2 (v 1.1.21), which might be relevant for your analysis.

Previously, the rlog transformation by default would use a blind design formula and use the MAP dispersion estimates (the blue points in a plot produced by plotDispEsts). True, large log fold changes would have high MAP dispersions and be shrunk toward 0 (especially with no replicates). Instead we now (in the devel branch) use the fitted dispersion estimates, which shrinks log fold changes based on the general mean-dispersion trend line, and preserves true large log fold changes.

It is described here:

http://www.bioconductor.org/packages...ws/DESeq2/NEWS
Michael Love is offline   Reply With Quote
Old 07-25-2013, 02:58 AM   #6
visserm
Junior Member
 
Location: South Africa

Join Date: Jun 2011
Posts: 7
Default

Hi Michael,
Is the shrunken log fold change also applied in DESeq2 v 1.0.17?
visserm is offline   Reply With Quote
Old 07-25-2013, 03:43 AM   #7
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Yes, shrunken log fold changes were incorporated for the Wald test since the beginning of DESeq2.

You can always check the methods of the current version with ?DESeq, vignette("DESeq2"), or for more specific updates: news(package="DESeq2").
Michael Love is offline   Reply With Quote
Old 07-25-2013, 03:48 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

You are comparing DESeq(1) and DESeq2. DESeq(1) does not have and never had any shrinkage on fold changes.
Simon Anders is offline   Reply With Quote
Old 10-18-2013, 06:19 AM   #9
bioBob
Member
 
Location: Virginia

Join Date: Mar 2011
Posts: 72
Default

Is there an implicit normalization across the samples in this?
bioBob is offline   Reply With Quote
Old 11-12-2013, 08:12 PM   #10
whataBamBam
Member
 
Location: Italy

Join Date: May 2013
Posts: 27
Default

There are loads of threads on here about DeSeq and experiments with no replicates. And Simon Anders has been very patient coming on here and repeatedly explaining why they are a bad idea.

So I have come along after an experiment has been done to have a second look at the data (some bioinformatics already done by the sequencing provider) and.. yes.. they didn't have any replicates!

The bioinformatics guys that came before used DeSeq and they DID find differentially expressed transcripts. Even with adjusted p-values.

So just to be clear - I should be telling these guys they can't trust the p-values anyway since they have no replicates and to just look at fold change? We're going for just taking the top 200 absaolute fold change (for now) and confirming with qPCR.
whataBamBam is offline   Reply With Quote
Old 11-13-2013, 05:22 AM   #11
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

The section of the original DESeq paper might shed some light:
http://genomebiology.com/2010/11/10/r106

"Working without replicates
DESeq allows analysis of experiments with no biological replicates in one or even both of the conditions. While one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation. If replicates are available only for one of the conditions, one might choose to assume that the variance-mean dependence estimated from the data for that condition holds as well for the unreplicated one. If neither condition has replicates, one can still perform an analysis based on the assumption that for most genes, there is no true differential expression, and that a valid mean-variance relationship can be estimated from treating the two samples as if they were replicates. A minority of differentially abundant genes will act as outliers; however, they will not have a severe impact on the gamma-family GLM fit, as the gamma distribution for low values of the shape parameter has a heavy right-hand tail. Some overestimation of the variance may be expected, which will make that approach conservative."
Michael Love is offline   Reply With Quote
Old 11-13-2013, 06:21 AM   #12
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by whataBamBam View Post
There are loads of threads on here about DeSeq and experiments with no replicates. And Simon Anders has been very patient coming on here and repeatedly explaining why they are a bad idea.

So I have come along after an experiment has been done to have a second look at the data (some bioinformatics already done by the sequencing provider) and.. yes.. they didn't have any replicates!

The bioinformatics guys that came before used DeSeq and they DID find differentially expressed transcripts. Even with adjusted p-values.

So just to be clear - I should be telling these guys they can't trust the p-values anyway since they have no replicates and to just look at fold change? We're going for just taking the top 200 absaolute fold change (for now) and confirming with qPCR.
I think it is fairly obvious that Simon has a particular interpretation or hypothesis in mind when he is making statements about p-values(IE for the hypothesis that it is possible to test with DEseq), which isn't necessarily "valid science" either. A p-value is just the probability of the observations to be as extreme or more extreme under the null hypothesis, but you have to be aware of the hypothesis that was tested(Were there genes expressed at a different level in one sample and not the other? Could be answered quite scientifically.). Though it may be suspect that DEseq would return significant p-values, when the p-values couldn't possibly be significant, is it really answering the question that we think it is testing?
rskr is offline   Reply With Quote
Old 11-13-2013, 07:38 AM   #13
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

I wouldn't say that it is suspect to return low p-values. The point of using most software specifically designed for microarray or sequencing data, rather than just running linear or generalized linear models row-by-row, is that information can be pooled across genes.

for example, if you observe two random Normal variables and want to know if they have different means, you're out of luck. But if I provide you the population variance then you can compute a probability of seeing such a difference. Here we have an in-between case, we can learn something about the variance over all genes, but are not provided the population variance for a given gene. The question is how much can we learn about the variance of a single gene by looking over all genes. Are most of the genes not differentially expressed, such that one can learn about the mean-variance relationship over the majority of genes? The paragraph I quote above points out that the behavior will be conservative, in overestimating the variance of all genes by including some number of differentially expressed genes.
Michael Love is offline   Reply With Quote
Old 11-13-2013, 10:57 AM   #14
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Hello,

I deviate a bit from the discussion but my question fits well to this topic.
I am analyzing RNASeq data of 2 conditions and 7 replicates per condition. I performed the analysis with DESeq2, this is ok.

In one condition, two samples come from the same patient with a leukemia. At the second time, the patient is in relapse.
I would like to find what changed between these 2 time points. Of course, it can be due to additional somatic mutations. I want to study the differences on RNA as well.

I think about performing a DE analysis between these 2 "conditions" (remission, relapse). Do you think it is meaningful to perform this analysis, knowing there is only one patient?

How are you use to handle these cases? It is very patient-dependent I guess, I don't know if we could see general changes when using small groups of patients in remission and relapse.
Jane M is offline   Reply With Quote
Old 11-14-2013, 09:59 PM   #15
whataBamBam
Member
 
Location: Italy

Join Date: May 2013
Posts: 27
Default

Quote:
Originally Posted by Michael Love View Post
The section of the original DESeq paper might shed some light:
http://genomebiology.com/2010/11/10/r106

"Working without replicates
DESeq allows analysis of experiments with no biological replicates in one or even both of the conditions. While one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation. If replicates are available only for one of the conditions, one might choose to assume that the variance-mean dependence estimated from the data for that condition holds as well for the unreplicated one. If neither condition has replicates, one can still perform an analysis based on the assumption that for most genes, there is no true differential expression, and that a valid mean-variance relationship can be estimated from treating the two samples as if they were replicates. A minority of differentially abundant genes will act as outliers; however, they will not have a severe impact on the gamma-family GLM fit, as the gamma distribution for low values of the shape parameter has a heavy right-hand tail. Some overestimation of the variance may be expected, which will make that approach conservative."
Great. Actually my original interpretation (before I posted this) was correct then. That the p values are perfectly valid (in fact conservative) and the problem of no replicates is actually low statistical power.

So basically you are saying that you have less statistical power because you have overestimated the variance. And if you see significant differences DESPITE this low statistical power then go for it.

To be fair it says in the vignette (or the paper I can't remember which) that there is simply low statistical power if you have no replicates.
whataBamBam is offline   Reply With Quote
Old 11-15-2013, 09:19 AM   #16
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by whataBamBam View Post
Great. Actually my original interpretation (before I posted this) was correct then. That the p values are perfectly valid (in fact conservative) and the problem of no replicates is actually low statistical power.

So basically you are saying that you have less statistical power because you have overestimated the variance. And if you see significant differences DESPITE this low statistical power then go for it.

To be fair it says in the vignette (or the paper I can't remember which) that there is simply low statistical power if you have no replicates.
If the hypothesis test was significant, this would indicate that there isn't a problem with power, since a lower powered test should make it more difficult to get significant results, if the test was actually answering the question you were asking. Though in theory this does seem a little confusing, because getting a hypothesis to fit thousands of sample should be harder than just a few samples, however with thousands of samples the means in the population can be known very accurately, so even trivial differences like color between two placebos can be significant.
http://en.wikipedia.org/wiki/Lindley's_paradox
rskr is offline   Reply With Quote
Old 11-17-2013, 04:25 PM   #17
whataBamBam
Member
 
Location: Italy

Join Date: May 2013
Posts: 27
Default

Quote:
Originally Posted by rskr View Post
If the hypothesis test was significant, this would indicate that there isn't a problem with power, since a lower powered test should make it more difficult to get significant results, if the test was actually answering the question you were asking. Though in theory this does seem a little confusing, because getting a hypothesis to fit thousands of sample should be harder than just a few samples, however with thousands of samples the means in the population can be known very accurately, so even trivial differences like color between two placebos can be significant.
http://en.wikipedia.org/wiki/Lindley's_paradox
Yeah the first part is what I meant - well kind of. Yes a lower power test makes it more difficult to observe significant results - but we observe them. So the test had enough power to detect the differences it detected but there could be other differences it did not detect because it did not have enough power. This is what I mean by saying it's conservative.

The next part I'm less sure about.. but I think this paradox, Lindleys paradox would only apply if there were a very large number of replicates? Which we aren't ever likely to see.
whataBamBam is offline   Reply With Quote
Old 11-18-2013, 05:04 AM   #18
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by whataBamBam View Post
Yeah the first part is what I meant - well kind of. Yes a lower power test makes it more difficult to observe significant results - but we observe them. So the test had enough power to detect the differences it detected but there could be other differences it did not detect because it did not have enough power. This is what I mean by saying it's conservative.

The next part I'm less sure about.. but I think this paradox, Lindleys paradox would only apply if there were a very large number of replicates? Which we aren't ever likely to see.
I don't know, I've seen some impressive microarray datasets, I don't see any reason that when rna-seq data drops a little in price there won't be some large data sets.
rskr is offline   Reply With Quote
Old 12-02-2015, 11:26 AM   #19
tompoes
Junior Member
 
Location: Belgium

Join Date: Jul 2014
Posts: 2
Default

Hi

I want to use the DESEQ package between a control (3 biological replicates) and treatment (1 biological replicate).

IN DESeq I herefore used the following code, and got 266 genes with padj < 0.05:

table <- read.delim("test.txt")
row.names(table) <- table$Feature_ID
count_table <- table[, -1]
conds <- c("ctrl", "ctrl", "ctrl", "treatment")
cds <- newCountDataSet(count_table, conds)
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds, method="blind", sharingMode="fit-only")
results <- nbinomTest(cds, "ctrl", "treatment")

In DESeq2 I used the follwing command, but got > 10000 genes with padj < 0.05:

table <- read.delim("test.txt")
row.names(table) <- table$Feature_ID
count_table <- table[, -1]
colData <- DataFrame(condition=factor(c("ctrl", "ctrl", "ctrl", "treatment")))
dds <- DESeqDataSetFromMatrix(count_table, colData, formula(~ condition))
results <- DESeq(dds, minReplicatesForReplace=Inf)

So probably I need to add extra parameters to the DESEQ2 analysis but for now I can't figure out how?

Thank you for helping

Wannes

Last edited by tompoes; 12-02-2015 at 11:31 AM.
tompoes is offline   Reply With Quote
Old 02-23-2016, 05:03 PM   #20
Sow
Member
 
Location: Washington

Join Date: Feb 2016
Posts: 16
Default

Hi,
Can anyone advise me if its okay to normalize my data-set before mapping my reads to the genome?

Thanks
Sow is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO