SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
comparing results by cuffdiff, edgeR, DESeq PFS Bioinformatics 5 03-12-2014 03:01 AM
DESeq: "NA" generated in the resulted differentially expressed genes idyll_ty RNA Sequencing 8 05-02-2012 03:28 PM
comparing Bowtie/DESeq and Top-Hat/Cufflinks results maryb Bioinformatics 5 03-13-2012 07:46 AM
DESeq results give extremely small p-values? chris Bioinformatics 11 08-29-2011 06:33 AM
Different results generated by using the same parameters sulicon Bioinformatics 2 09-20-2010 11:59 PM

Reply
 
Thread Tools
Old 04-08-2012, 08:11 PM   #1
kentnf
Member
 
Location: Ithaca

Join Date: Jan 2009
Posts: 26
Default Question about different results generated by DESeq.

Hi,

I have problem on DESeq, it generates different results using same dataset (without replicate), and almost same code.


Here is code for the 1st test:

Code:
conds <- factor( c( "c0","c6","c12","c24","c48","c72","c96","c120","t0","t6","t12","t24","t48","t72","t96","t120") )
cds<-newCountDataSet(countsTable, conds)
cds <- estimateSizeFactors( cds )
sizeFactors( cds )
cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" )
test0 <- nbinomTest( cds, "c24", "t24" )
write.table( test0, sep="\t", file="test0.txt" )
it generates 225 differentially expressed genes with adjust pvalue < 0.05.

But when i change the code for the 2nd test:

Code:
conds <- factor( c( "c0","c6","c12","c24","c48","c72","c96","c120","t0","t6","t12","t24","t48","t72","t96","t120") )
cds<-newCountDataSet(countsTable, conds)
cds <- estimateSizeFactors( cds )
sizeFactors( cds )
cds2 <- cds[ ,c( "c24", "t24" ) ]
cds2 <- estimateDispersions( cds2, method="blind", sharingMode="fit-only" )
test1 <- nbinomTest( cds2, "c24", "t24" )
write.table( test1, sep="\t", file="test1.txt" )
it just generates 84 differentially expressed genes.

The 2nd test just estimate the two samples related to the comparison, but the 1st test estimate all the samples.
I do not know why they are different, and I want to know which one is more reliable.

Thanks a lot.

By the way, before import dataset into DESeq, I usually to remove the genes without expression in each conditon.
Here is a example.

cond1 cond2 cond3 treat1 treat2 treat3
gene1 1 2 3 2 2 0
gene2 0 0 1 0 0 0
gene3 0 0 0 0 0 0 (removed)

For the dataset before, there are 225 DE genes identified at 1st test.
After removing none-expressed genes, 220 DE genes were identified with adjust pvalue < 0.05.

I want to know which method is correct.

Thanks again
kentnf is offline   Reply With Quote
Old 04-09-2012, 12:45 AM   #2
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Dear Kentnf,

in a nutshell, you are asking two different questions and so you get two different answers. In the first case, you estimate the dispersions from all samples, in the second, only from the two samples c24 and t24. The dispersions are parameters that DESeq estimates for each gene and that represent the variance that DESeq's error model expects between replicate measurements. It seems that in test0, the dispersions are estimated to be smaller than in test1 - thus slightly smaller fold changes are already deemed significant, leading to more genes.

Which method is correct? The first point to note is that you are comparing between just two single samples, without replicates, so it is expected that the results are highly volatile. With sufficient number of replicates and good-quality data the differences should be less drastic. So, don't overinterprete your result from just these two samples, and try it also out for other samples and more replicates.

Second, you are asking two different questions, and it is up to you to decide which is the one you care about. This is not a technical question, it is one that only you can decide.

Third, ideally it should not matter much. Check the scatterplot of the -log10 (p-values) from the two tests against each other. I'd expect that they broadly agree (on the high values), and that much of the difference in number of genes that you see would be an effect of arbitrary thresholding (i.e. genes that just make it below p_adj<0.05 in one case might be just above in the other).

About the filtering of genes with overall low counts (i.e. sum of counts across all samples is so small that the gene anyway could never be significantly differently expressed): this is good practice. See Section 5 of the DESeq vignette.
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 04-09-2012, 08:14 AM   #3
kentnf
Member
 
Location: Ithaca

Join Date: Jan 2009
Posts: 26
Default

Hi, Wolfgang Huber

I make a comparison between test0 and test1. All the DE genes in test1 are included in test0. I think that is the differnet variances on different estimations you metioned. Because our experiment has problem on design. So we can not add more replicates. Base on your suggestions and the current results, I'd like to use test1 code to perform analysis for it show more reliable than test0. And I should remove the genes without expression.

If there is any problem, please give me some advice.

Thanks again for your help.
kentnf is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO