I recently posted a reply in another thread as I thought my question was similar, apparently not, so apologies for any confusion.
I have a couple of questions regarding the magnitude of the differences of the number of genes that DESeq2 and EdgeR find.
I have 12 samples, 4 sets of triplicates. 2 sets of triplicates are from around a wound site on a cut plant stem, one treated with a bacterial pathogen and one not. The other two are from a point distant to the wound site, again one treated with a pathogen, one not. All samples were taken two days after inoculation. This is from a small exploratory experiment which will help in the design of other experiments mapping out the initial course of the infection.
When I compare any two sites using DESeq2, I get a number of genes. When I compare using EdgeR, the number of genes identified as D.E. is two to three orders of magnitude greater.
On the off chance, that I've done something horrendously wrong the code that I'm using is as follows for DESeq:
This gives me 48 D.E genes, mostly down regulated. As advised, I've used the banded hypothesis testing and unless, I'm reading it wrong, this should give me genes with a log2FC greater than +/-2.
For EdgeR,
This gives me 97 D.E genes up-regulated and 3134 genes D.E. down-regulated.
In the previous thread I commented in, it was suggested that I plot various statistics from the two groups and compare to see if there were large differences or just small numerical variation. The adjusted p values for the two groups were essentially the same, i.e. very, very low. There was some difference in the estimated fold changes, with the logFC being calculated as almost twice (~2 vs ~4/5) as high in EdgeR as it was in DESeq. I don't know if this constitutes a large difference or not, though it seems quite high to me, given that it's a logFC.
It was also suggested that a large discrepancy could be due to the Cooks cutoff metric that DESeq uses. So I got the unfiltered DESeq results using:
This didn't really change the number of genes DESeq2 found (an extra 10 genes). I was assuming that that would rule out the Cooks cutoff being the cause of the discrepancy, though if I collect the list of genes that EdgeR identifies in a variable called edgerDown and plot the corresponding cooks cutoff data from DESeq2,
There does look like a large number of genes with a least one point looking like an outlier. I don't know whether those outliers are enough for DESeq to set them to NA without knowing how to find the cooks cutoff (so I can figure out where the 99th percentile is), which I can't figure from the DESeq documentation.
So.
Any thoughts as to where/how the huge discrepancy might be coming from? Or as to how I could go about figuring out how to find out where it's from? I'm not expecting them to be exactly the same obviously, but I would have thought they'd be in the same general ball park. Coincidently, if I use Limma/Voom, the order of magnitude difference increases again :-/
Also, what constitutes a large difference in the calculated fold changes of the two methods?
Cheers
Ben.
I have a couple of questions regarding the magnitude of the differences of the number of genes that DESeq2 and EdgeR find.
I have 12 samples, 4 sets of triplicates. 2 sets of triplicates are from around a wound site on a cut plant stem, one treated with a bacterial pathogen and one not. The other two are from a point distant to the wound site, again one treated with a pathogen, one not. All samples were taken two days after inoculation. This is from a small exploratory experiment which will help in the design of other experiments mapping out the initial course of the infection.
When I compare any two sites using DESeq2, I get a number of genes. When I compare using EdgeR, the number of genes identified as D.E. is two to three orders of magnitude greater.
On the off chance, that I've done something horrendously wrong the code that I'm using is as follows for DESeq:
Code:
design = data.frame( row.names = colnames( counts ), condition=c("Psa","Psa","Psa","PsaD","PsaD","Control","Control","Control","ControlD","ControlD"), libtype = c("paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end","paired-end") ) colData <- data.frame(row.names=colnames( counts), condition=as.factor(design$condition)) dds <- DESeqDataSetFromMatrix(countData = counts,colData = colData, design =~ condition) dds <- DESeq(dds) results <- results(dds, lfcThreshold=2, altHypothesis="greaterAbs") res <- na.omit(results) res=res[res$padj < 0.05,]
For EdgeR,
Code:
group <- factor(c("Psa","Psa","Psa","PsaD","PsaD","Sdw","Sdw","Sdw","SdwD","SdwD")) y <- DGEList(counts=counts,group=group) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) design <- model.matrix(~0+group, data=y$samples) colnames(design) <- levels(y$samples$group) fit <- glmFit(y, design) lrt <- glmLRT(fit, contrast=c(-1,0,1,0)) tags<-as.data.frame(topTags(lrt,39039,sort.by="logFC")) tags<- tags[tags$FDR < 0.05,] tagsUp<- tags[tags$logFC > 2,] tagsDown<- tags[tags$logFC < 2,]
In the previous thread I commented in, it was suggested that I plot various statistics from the two groups and compare to see if there were large differences or just small numerical variation. The adjusted p values for the two groups were essentially the same, i.e. very, very low. There was some difference in the estimated fold changes, with the logFC being calculated as almost twice (~2 vs ~4/5) as high in EdgeR as it was in DESeq. I don't know if this constitutes a large difference or not, though it seems quite high to me, given that it's a logFC.
It was also suggested that a large discrepancy could be due to the Cooks cutoff metric that DESeq uses. So I got the unfiltered DESeq results using:
Code:
rresults <- results(dds,contrast=c("condition","Control","Psa"), lfcThreshold=2, altHypothesis="greaterAbs",independentFiltering=FALSE,cooksCutoff=FALSE)
Code:
plot(factor(edgerDown),cooks[edgerDown,1:5],ylim=c(0,10))
So.
Any thoughts as to where/how the huge discrepancy might be coming from? Or as to how I could go about figuring out how to find out where it's from? I'm not expecting them to be exactly the same obviously, but I would have thought they'd be in the same general ball park. Coincidently, if I use Limma/Voom, the order of magnitude difference increases again :-/
Also, what constitutes a large difference in the calculated fold changes of the two methods?
Cheers
Ben.
Comment