SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq Multi-factor designs: determining the significance of model terms DJParker Bioinformatics 8 07-21-2014 02:21 PM
DESeq2 vs EdgeR asuav RNA Sequencing 5 05-20-2014 10:25 AM
DESeq2 - understanding multi-factor DE analysis kajot RNA Sequencing 6 04-07-2014 04:45 AM
DESeq2 multi-factor designs id0 Bioinformatics 3 01-10-2014 06:25 AM
DESeq2: Multi-factor designs sindrle Bioinformatics 10 10-21-2013 07:47 AM

Reply
 
Thread Tools
Old 12-22-2014, 05:37 AM   #1
keysoon
Junior Member
 
Location: finland

Join Date: Nov 2012
Posts: 7
Default DESeq2 vs EdgeR for multi-factor designs

Hi,

I have 31 samples from three different breed groups where I am studying the effect of diet on breeds. I constructed multi-factor designs for DESeq2 and EdgeR as follows:
Quote:
countdata<-read.csv("htseq-merged_table.csv", row.names=1)
countdata<-countdata[1: (nrow(countdata)-5),]
sampleinfo<-read.csv("SampleTable_mod.csv")
sample<-sampleinfo$Sample
breed<-sampleinfo$Breed
diet<-sampleinfo$Diet
sampletable<-data.frame(Sample=sample, breed=breed, diet=diet)
DESeq2:
Quote:
colnames(countdata)<-sampletable$Sample
ddsMat<-DESeqDataSetFromMatrix(countData=countdata, colData = sampletable, design = ~breed+diet+breed:diet )
dds<-DESeq(ddsMat)
resultsNames(dds)
#[1] "Intercept" "breedFIN" "breedTEX" "breedFXT" "dietFLU"
#[6] "dietCON" "breedFIN.dietFLU" "breedTEX.dietFLU" "breedFXT.dietFLU" "breedFIN.dietCON"
#[11] "breedTEX.dietCON" "breedFXT.dietCON"

FINTEXFlu<-results(dds, contrast=list("breedFIN.dietFLU","breedTEX.dietFLU"))
FINTEXFluSig<-subset(FINTEXFlu, padj < 0.1) #(36)
FINTEXFluSig<-FINTEXFluSig[abs(FINTEXFluSig$log2FoldChange) >=1,]
nrow(FINTEXFluSig)
25

EdgeR:
Quote:
group<-factor(paste(breed, diet, sep="."))
ED<-DGEList(counts=countdata, group=group)
Y<-calcNormFactors(ED)
design<-model.matrix(~0+group)
colnames(design)<-levels(group)
Y<-estimateGLMCommonDisp(Y, verbose=TRUE)
Y<-estimateGLMTrendedDisp(Y, design)
Y<-estimateGLMTagwiseDisp(Y, design)
fit<-glmFit(Y, design)
mycontrast<-makeContrasts(FINF.TEXF=FIN.FLU-TEX.FLU, FINF.FXTF=FIN.FLU-FXT.FLU, TEXF.FXTF=TEX.FLU-FXT.FLU, levels=design)
FINTEXF<-(glmLRT(fit, contrast=mycontrast[,"FINF.TEXF"]))
summary(Y<-decideTestsDGE(FINTEXF, p=0.1)) # (-98,+480)
ttFINTEXF<-topTags(FINTEXF, n=sum(98+480))
sigFINTEXF<-ttFINTEXF[abs(ttFINTEXF$table$logFC) >=1.0,]
nrow(sigFINTEXF)
519
Could anyone please tell me what made such a big difference in results from two different programs? I would appreciate your help.

Thank you very much!

Last edited by keysoon; 12-22-2014 at 07:06 AM.
keysoon is offline   Reply With Quote
Old 12-22-2014, 06:01 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

DESeq2 is doing some additional things, such as looking for samples with too much leverage and then excluding them/those genes. If you look at some of the genes called DE by edgeR but not by DESeq2, I suspect that some of them will be flagged due to cook's cutoff distance.
dpryan is offline   Reply With Quote
Old 12-23-2014, 12:56 AM   #3
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Dear Keysoon

I suggest investigating the results and the input data for those genes that (i) only edgeR finds, (ii) only DESeq2 finds, (iii) both find.

Devon's hypothesis is pertinent. In addition, it could be a threshold effect. I tend to find it useful to plot statistics (such as estimated fold changes, p-values) from the two methods against each other to see whether the difference is large or is just a numerical wiggle.

In addition, DESeq2 offers banded hypothesis testing, i.e you can test directly against the null hypothesis |beta|<=1, rather than (the more traditional) beta=0, and then you don't need to apply the post hoc logFC filter. This is statistically more efficient. See DESeq2's vignette.

Kind regards
Wolfgang
__________________
Wolfgang Huber
EMBL

Last edited by Wolfgang Huber; 12-23-2014 at 01:00 AM.
Wolfgang Huber is offline   Reply With Quote
Old 01-11-2015, 09:34 PM   #4
tirohia
Member
 
Location: Auckland, NZ

Join Date: Nov 2011
Posts: 46
Default

Hi.

I have a similar situation where I am getting large (2 orders of magnitude) difference between DESeq D.E. estimates and EdgeR D.E. estimates. Not even going to start talking about the huge numbers limma/voom are giving me despite being told by multiple people that I should be using that as well.

I've got 3 samples from each two positions on an inoculated plant stem, one close to the inoculation site, one distant. Repeated for treatment and control, so 12 samples all up.

So the DESeq analysis, I filtered on the log2FC. When I run it with banded hypothesis testing, I'm getting different results - I would have thought they would have been similar/the same? As an example, when I'm comparing two of the samples, I'm getting 2 vs 6 for up regulated genes, but 162 vs 348 down, banded testing vs filtering on the log2FC respectively. How do/can I figure out which is better? I assuming I'm doing the banded hypothesis correctly :

Code:
res <- results(dds, lfcThreshold=2, altHypothesis="greaterAbs")
res <- na.omit(res)
res=res[res$padj < 0.1,]
plotMA(res, ylim=yl)
res
As for comparing DESeq and EdgeR, there's the suggestion to plot the various statistics from each method. I'm doing this on the overlap - the genes identified by both DESeq and EdgeR.
The adjusted p values from each method for all of them are low. Very low. The log2FC though differ quite a bit though. I'm assuming this would due to differences in the way the two methods model and adjust the read counts. What sort of magnitude of difference would constitute something to worry about? I'm looking at DESeq averaging values around -2 and EdgeR averaging values around -4/-5. Which seems quite big to me.

I had a quick look at the base means - a number of them are quite low - between 100 -500. Which seems low, but from what I gather, isn't to horrendously low.

I think what I'm trying to ask (apologies for the rambling statement-like question) is, given the different ways that DESeq and EdgeR work, what constitutes a numerical wiggle and what constitutes a large and therefore possibly due to something else, difference.

Is it worth looking through dds at the cooks value of all the genes that EdgeR picks up but DESeq discards?

Cheers
Ben.
tirohia is offline   Reply With Quote
Old 01-12-2015, 04:11 AM   #5
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

keysoon,

You are not comparing the same test for edgeR and DESeq2.

For edgeR, you built a group variable which combines breed and diet information, and are then testing the difference between two groups of samples. This is testing, is there a difference between the group of samples in FIN.FLU and in TEX.FLU. As far as I can tell, this test is not involving the diet effect, because it does not involve the CON samples.

For DESeq2, you are testing interaction terms, which test whether the breed effect is *different* in the flu group.

Maybe you can specify the exact hypothesis you want to test.
Michael Love is offline   Reply With Quote
Old 01-12-2015, 04:17 AM   #6
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi tirohia,

We recommend plotting the unfiltered res object with plotMA, so you can see the genes with small padj relative to the rest of genes.

We recommend banded testing over filtering on the log2FC.

You can use the summary(res) function to get an idea how many outliers are being detected, assuming you are using an up-to-date version of DESeq2.
Michael Love is offline   Reply With Quote
Old 01-12-2015, 04:18 AM   #7
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi tirohia,

Maybe you can start a new post, as the two questions here concern different issues. This will make the thread easier to follow.
Michael Love is offline   Reply With Quote
Old 01-12-2015, 05:05 AM   #8
keysoon
Junior Member
 
Location: finland

Join Date: Nov 2012
Posts: 7
Default

Quote:
Originally Posted by Michael Love View Post
keysoon,

You are not comparing the same test for edgeR and DESeq2.

For edgeR, you built a group variable which combines breed and diet information, and are then testing the difference between two groups of samples. This is testing, is there a difference between the group of samples in FIN.FLU and in TEX.FLU. As far as I can tell, this test is not involving the diet effect, because it does not involve the CON samples.

For DESeq2, you are testing interaction terms, which test whether the breed effect is *different* in the flu group.

Maybe you can specify the exact hypothesis you want to test.
Hi Michael,

Thank you for noticing the actual problem. I was not quite confident with the EdgeR design. In the experiment, I am interested to see the effect of diet (FLU and CON) within and between breeds (FIN, TEX, FXT).

Best Regards,
Keysoon

Last edited by keysoon; 01-12-2015 at 05:17 AM.
keysoon is offline   Reply With Quote
Old 01-12-2015, 06:26 AM   #9
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

For simplicity, let's use the group variable for both softwares. In edgeR you are correct in including ~0+group, but in DESeq2 you should just use ~group, which allows moderation of the fold changes.

In DESeq2, the within-breed effect of diet FLU vs CON will be, for example for the FIN breed:

Code:
results(dds, contrast=c("group","FIN.FLU","FIN.CON"))
To test if the diet effect is different between two breeds, you are testing if the following contrast is zero:

(FIN.FLU - FIN.CON) - (TEX.FLU - TEX.CON)

By rearranging you get

(FIN.FLU + TEX.CON) - (FIN.CON + TEX.FLU)

This contrast can be tested in DESeq2 with:

Code:
results(dds, contrast=list(c("groupFIN.FLU","groupTEX.CON"),c("groupFIN.CON","groupTEX.FLU")))
There is another way to build contrasts with an interaction term, but to keep things simple and comparable, I'd recommend the above. You should be able to see how to form the edgeR contrasts this way.
Michael Love is offline   Reply With Quote
Old 01-13-2015, 04:04 AM   #10
keysoon
Junior Member
 
Location: finland

Join Date: Nov 2012
Posts: 7
Default

Hi Michael,

I modified the design according to your suggestion. I hope I followed it well. I have attached a part of the code below:
Code:
group<-factor(paste(breed, diet, sep="."))
sampletable<-cbind(sampletable, group=group)

#DESeq2
ddsMat<-DESeqDataSetFromMatrix(countData=countdata, colData = sampletable, design = ~group )
dds<-DESeq(ddsMat)
dTEX_Diet<-results(dds, contrast=c("group","TEX.FLU", "TEX.CON"))
dTEX_Diet_sig<-subset(dTEX_Diet, padj < 0.1)
dTEX_Diet_sig<-dTEX_Diet_sig[abs(dTEX_Diet_sig$log2FoldChange) >=1,]
nrow(dTEX_Diet_sig) 
148

dTEX_FXT_Diet<-results(dds, contrast=list(c("groupTEX.FLU","groupFXT.CON"), c("groupTEX.CON","groupFXT.FLU")))
dTEX_FXT_Diet_sig<-subset(dTEX_FXT_Diet, padj < 0.1)
dTEX_FXT_Diet_sig<-dTEX_FXT_Diet_sig[abs(dTEX_FXT_Diet_sig$log2FoldChange) >=1,]
nrow(dTEX_FXT_Diet_sig)
159

#EdgeR
cpm<-cpm(countdata)
keep<-rowSums(cpm>1)>=5
ecountdata<-countdata[keep,]
ED<-DGEList(counts=ecountdata, group=group)
Y<-calcNormFactors(ED)
design<-model.matrix(~0+group)
Y<-estimateGLMCommonDisp(Y, verbose=TRUE)
Y<-estimateGLMTrendedDisp(Y, design)
Y<-estimateGLMTagwiseDisp(Y, design)
fit<-glmFit(Y, design)

mycontrast<-makeContrasts(FIN_Diet=FIN.FLU-FIN.CON, TEX_Diet=TEX.FLU-TEX.CON, FXT_Diet=FXT.FLU-FXT.CON, FIN_TEX_Diet=(FIN.FLU-FIN.CCON)-(TEX.FLU-TEX.CON), 
       FIN_FXT_Diet=(FIN.FLU-FIN.CON)-(FXT.FLU-FXT.CON),TEX_FXT_Diet=(TEX.FLU-TEX.CON)-(FXT.FLU-FXT.CON), levels=design)

eTEX_Diet<-glmLRT(fit, contrast=mycontrast[,"TEX_Diet"])
summary(Y<-decideTestsDGE(eTEX_Diet, p=0.1))
eTEX_Diet_tt<-topTags(eTEX_Diet, n=sum(46+37))
eTEX_Diet_sig<-eTEX_Diet_tt[abs(eTEX_Diet_tt$table$logFC) >=1.0,]
nrow(eTEX_Diet_sig) 
76

eTEX_FXT_Diet<-glmLRT(fit, contrast=mycontrast[,"TEX_FXT_Diet"])
summary(Y<-decideTestsDGE(eTEX_FXT_Diet, p=0.1))
eTEX_FXT_Diet_tt<-topTags(eTEX_FXT_Diet, n=sum(52+191))
eTEX_FXT_Diet_sig<-eTEX_FXT_Diet_tt[abs(eTEX_FXT_Diet_tt$table$logFC) >=1.0,]
nrow(eTEX_FXT_Diet_sig)
243
Between two programs, 64 genes are common in TEX_Diet and 97 genes are common in TEX_FXT_Diet groups. Do you think it's normal to get that much of differences?

Thank you very much.

Best Regards,
Keysoon

Last edited by keysoon; 01-13-2015 at 04:15 AM.
keysoon is offline   Reply With Quote
Old 01-13-2015, 05:04 AM   #11
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

Yes that's normal.
Michael Love is offline   Reply With Quote
Reply

Tags
deseq2, edger, multi-factor design

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:51 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO