Dear all,
I have applied DESeq with independant filtering on a 454 dataset with 2 conditions, and no replicate. As the first aim was transcriptome annotation (sp without reference genome), we used 2 contrasted conditions in order to increase the range of expression. However, as additional objective, we would like to explore differential expression as a preliminary survey (which will be followed by a RNAseq on Illumina Hiseq2000 data, WITH replication).
I have 3 questions:
1. I think I do not understand clearly how to decide on the proportion of genes to be filtered (0.4 in the example). What is it based on exactly ?
2. I used two ways to do what I think is equivalent in terms of analysis, yet obtained (slightly) different results. First, I applied model comparison nbinomGLMTest (counts~factor condition vs counts~1) to get adjusted p-values corresponding to the difference between conditions. I did that for the whole dataset and for the filtered one, and compared results (in terms of padj<.1).
Second, as I only have this factor with two levels and no replicate, I did 2 separate nbinomTest analyses, i.e., on the whole dataset, and on the filtered one, respectively, and compared the corresponding adjusted p values. I was surprised to get different results. Is it simply because I did not use exactly the same size factors for filtered data ? (in the first case, I used filtered data, after pvalues were calculated on normalized data, whereas in the second case, size factors were estimated on the filtered dataset, before pvalues were calculated).
3. As I got only 3 extra DE genes thanks to filtering, I am wondering if dispersion was estimated correctly with:
cdstot<-estimateDispersions(cdstot, method="blind", sharingMode="fit-only", fitType="local")
Pvalues before BH adjustment were already much higher than with a Fisher exact test, and this surprised me a little.
Thanks in advance for any help on these topics,
Agnès
(see R listing below)
> fin<-read.table("lsdenext.txt", header=T, row.names=1)
> attach(fin)
> #26 avril 2012#
> conds<-factor(c("nb_readC", "nb_readD"))
> conds
[1] nb_readC nb_readD
Levels: nb_readC nb_readD
> cdstot<-newCountDataSet(fin, conds)
> cdstot<-estimateSizeFactors(cdstot, locfunc=shorth)
> sizeFactors(cdstot)
nb_readC nb_readD
1.1238156 0.8903734
> cdstot<-estimateDispersions(cdstot, method="blind", sharingMode="fit-only", fitType="local")
> restot<-nbinomTest(cdstot, "nb_readC", "nb_readD")
> head(restot)
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
1 AB070347.p.ls.1 0.4449128 0.8898257 0.000000 0.0000000 -Inf 1.0000000 1
2 AB083656.p.ls.1 8.5916778 11.5677341 5.615622 0.4854556 -1.0425887 0.5329552 1
3 AB159144.p.ls.1 31.3472863 43.6014592 19.093113 0.4379008 -1.1913240 0.2058511 1
4 AB159145.p.ls.1 2.6911615 0.8898257 4.492497 5.0487385 2.3359230 0.4856913 1
5 AB159147.p.ls.1 2.1295993 0.8898257 3.369373 3.7865539 1.9208855 0.6508927 1
6 AB159148.p.ls.1 22.0258009 20.4659910 23.585611 1.1524294 0.2046784 0.8848194 1
> rs<-rowSums(counts(cdstot))
> use<-(rs>quantile(rs, 0.4))
> table(use)
use
FALSE TRUE
6799 10194
> cdsfil<-cdstot[use,]
> fitfilcond<-fitNbinomGLMs(cdsfil,count~conds)
..........
> fitfil0<-fitNbinomGLMs(cdsfil,count~1)
..........
> pvalfilt<-nbinomGLMTest(fitfilcond, fitfil0)
> padjfilt<-p.adjust(pvalfilt, method="BH")
> fittotcond<-fitNbinomGLMs(cdstot,count~conds)
................
> fittot0<-fitNbinomGLMs(cdstot,count~1)
................
> pvaltot<-nbinomGLMTest(fittotcond, fittot0)
> padjtot<-p.adjust(pvaltot, method="BH")
>
> padjfiltForCompar=rep(+Inf, length(padjtot))
> padjfiltForCompar[use]=padjfilt
> tab=table(`nofi`=padjtot<.1, `fi`=padjfiltForCompar<.1)
> addmargins(tab)
fi
nofi FALSE TRUE Sum
FALSE 16987 3 16990
TRUE 0 3 3
Sum 16987 6 16993
> cdsfil<-estimateSizeFactors(cdsfil, locfunc=shorth)
> sizeFactors(cdsfil)
nb_readC nb_readD
1.1203877 0.8924126
> cdsfil<-estimateDispersions(cdsfil, method="blind", sharingMode="fit-only", fitType="local")
> resfil<-nbinomTest(cdsfil,"nb_readC", "nb_readD")
> head(resfil)
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
1 AB083656.p.ls.1 8.602958 11.6031267 5.602790 0.4828690 -1.0502964 0.5273125 1
2 AB159144.p.ls.1 31.392174 43.7348621 19.049485 0.4355675 -1.1990317 0.1938215 1
3 AB159145.p.ls.1 2.687390 0.8925482 4.482232 5.0218372 2.3282153 0.4790481 1
4 AB159147.p.ls.1 2.127111 0.8925482 3.361674 3.7663779 1.9131778 0.6544218 1
5 AB159148.p.ls.1 22.030163 20.5286087 23.531717 1.1462889 0.1969707 0.8889670 1
6 AB159149.p.ls.1 9.723516 11.6031267 7.843906 0.6760165 -0.5648695 0.7486689 1
> hist(resfil$padj)
> min(resfil$padj)
[1] 0.03376218
> comp3=rep(+Inf, length(restot$padj))
> comp3[use]=resfil$padj
> tab<-table(`nofi`=restot$padj<.1,`fi`=comp3<.1)
> addmargins(tab)
fi
nofi FALSE TRUE Sum
FALSE 16990 2 16992
TRUE 0 1 1
Sum 16990 3 16993
I have applied DESeq with independant filtering on a 454 dataset with 2 conditions, and no replicate. As the first aim was transcriptome annotation (sp without reference genome), we used 2 contrasted conditions in order to increase the range of expression. However, as additional objective, we would like to explore differential expression as a preliminary survey (which will be followed by a RNAseq on Illumina Hiseq2000 data, WITH replication).
I have 3 questions:
1. I think I do not understand clearly how to decide on the proportion of genes to be filtered (0.4 in the example). What is it based on exactly ?
2. I used two ways to do what I think is equivalent in terms of analysis, yet obtained (slightly) different results. First, I applied model comparison nbinomGLMTest (counts~factor condition vs counts~1) to get adjusted p-values corresponding to the difference between conditions. I did that for the whole dataset and for the filtered one, and compared results (in terms of padj<.1).
Second, as I only have this factor with two levels and no replicate, I did 2 separate nbinomTest analyses, i.e., on the whole dataset, and on the filtered one, respectively, and compared the corresponding adjusted p values. I was surprised to get different results. Is it simply because I did not use exactly the same size factors for filtered data ? (in the first case, I used filtered data, after pvalues were calculated on normalized data, whereas in the second case, size factors were estimated on the filtered dataset, before pvalues were calculated).
3. As I got only 3 extra DE genes thanks to filtering, I am wondering if dispersion was estimated correctly with:
cdstot<-estimateDispersions(cdstot, method="blind", sharingMode="fit-only", fitType="local")
Pvalues before BH adjustment were already much higher than with a Fisher exact test, and this surprised me a little.
Thanks in advance for any help on these topics,
Agnès
(see R listing below)
> fin<-read.table("lsdenext.txt", header=T, row.names=1)
> attach(fin)
> #26 avril 2012#
> conds<-factor(c("nb_readC", "nb_readD"))
> conds
[1] nb_readC nb_readD
Levels: nb_readC nb_readD
> cdstot<-newCountDataSet(fin, conds)
> cdstot<-estimateSizeFactors(cdstot, locfunc=shorth)
> sizeFactors(cdstot)
nb_readC nb_readD
1.1238156 0.8903734
> cdstot<-estimateDispersions(cdstot, method="blind", sharingMode="fit-only", fitType="local")
> restot<-nbinomTest(cdstot, "nb_readC", "nb_readD")
> head(restot)
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
1 AB070347.p.ls.1 0.4449128 0.8898257 0.000000 0.0000000 -Inf 1.0000000 1
2 AB083656.p.ls.1 8.5916778 11.5677341 5.615622 0.4854556 -1.0425887 0.5329552 1
3 AB159144.p.ls.1 31.3472863 43.6014592 19.093113 0.4379008 -1.1913240 0.2058511 1
4 AB159145.p.ls.1 2.6911615 0.8898257 4.492497 5.0487385 2.3359230 0.4856913 1
5 AB159147.p.ls.1 2.1295993 0.8898257 3.369373 3.7865539 1.9208855 0.6508927 1
6 AB159148.p.ls.1 22.0258009 20.4659910 23.585611 1.1524294 0.2046784 0.8848194 1
> rs<-rowSums(counts(cdstot))
> use<-(rs>quantile(rs, 0.4))
> table(use)
use
FALSE TRUE
6799 10194
> cdsfil<-cdstot[use,]
> fitfilcond<-fitNbinomGLMs(cdsfil,count~conds)
..........
> fitfil0<-fitNbinomGLMs(cdsfil,count~1)
..........
> pvalfilt<-nbinomGLMTest(fitfilcond, fitfil0)
> padjfilt<-p.adjust(pvalfilt, method="BH")
> fittotcond<-fitNbinomGLMs(cdstot,count~conds)
................
> fittot0<-fitNbinomGLMs(cdstot,count~1)
................
> pvaltot<-nbinomGLMTest(fittotcond, fittot0)
> padjtot<-p.adjust(pvaltot, method="BH")
>
> padjfiltForCompar=rep(+Inf, length(padjtot))
> padjfiltForCompar[use]=padjfilt
> tab=table(`nofi`=padjtot<.1, `fi`=padjfiltForCompar<.1)
> addmargins(tab)
fi
nofi FALSE TRUE Sum
FALSE 16987 3 16990
TRUE 0 3 3
Sum 16987 6 16993
> cdsfil<-estimateSizeFactors(cdsfil, locfunc=shorth)
> sizeFactors(cdsfil)
nb_readC nb_readD
1.1203877 0.8924126
> cdsfil<-estimateDispersions(cdsfil, method="blind", sharingMode="fit-only", fitType="local")
> resfil<-nbinomTest(cdsfil,"nb_readC", "nb_readD")
> head(resfil)
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
1 AB083656.p.ls.1 8.602958 11.6031267 5.602790 0.4828690 -1.0502964 0.5273125 1
2 AB159144.p.ls.1 31.392174 43.7348621 19.049485 0.4355675 -1.1990317 0.1938215 1
3 AB159145.p.ls.1 2.687390 0.8925482 4.482232 5.0218372 2.3282153 0.4790481 1
4 AB159147.p.ls.1 2.127111 0.8925482 3.361674 3.7663779 1.9131778 0.6544218 1
5 AB159148.p.ls.1 22.030163 20.5286087 23.531717 1.1462889 0.1969707 0.8889670 1
6 AB159149.p.ls.1 9.723516 11.6031267 7.843906 0.6760165 -0.5648695 0.7486689 1
> hist(resfil$padj)
> min(resfil$padj)
[1] 0.03376218
> comp3=rep(+Inf, length(restot$padj))
> comp3[use]=resfil$padj
> tab<-table(`nofi`=restot$padj<.1,`fi`=comp3<.1)
> addmargins(tab)
fi
nofi FALSE TRUE Sum
FALSE 16990 2 16992
TRUE 0 1 1
Sum 16990 3 16993
Comment