Hi,
I am having an issue with analyzing a next-gen sequencing data set for differential exon usage using DEXSeq. I have two main questions:
First, I have a partially unreplicated data set. I have created a reproducible example below where I have 3 cell lines each in two conditions, but only two of the 3 cell lines were completed in biological duplicate. The example is on random data, but my real data set is biologically derived and much larger; 42 cell lines with 37 unreplicated.
My first issue is that estimateDispersions produces different dispersion estimates when the unreplicated cell line is included or not included in the data set. I was under the impression that the the dispersion estimates should not be effected by unreplicated data, but this does not appear to be the case. I hope that I am missing something here; i.e. are my formulas correct? is there a reason that the dispersion estimates are different? are the unreplicated samples included in the baseMean calculation and that is changing the dispersion fitting?
Here is my code to produce this outcome:
Second, once I get the first part to work how can I subset the DEXSeqDataSet in order to testForDEU across conditions, but only within a single cell line? This used to be possible in the old version of DEXSeq (https://stat.ethz.ch/pipermail/bioco...ay/052479.html), but I do not see it in the manual or in any forums for the current version of DEXSeq.
Here is my sessionInfo:
Answers to either part would be great! (I hope it made sense to combine these posts as I think they are related) Thank you to anyone for the help in advance!
I am having an issue with analyzing a next-gen sequencing data set for differential exon usage using DEXSeq. I have two main questions:
First, I have a partially unreplicated data set. I have created a reproducible example below where I have 3 cell lines each in two conditions, but only two of the 3 cell lines were completed in biological duplicate. The example is on random data, but my real data set is biologically derived and much larger; 42 cell lines with 37 unreplicated.
My first issue is that estimateDispersions produces different dispersion estimates when the unreplicated cell line is included or not included in the data set. I was under the impression that the the dispersion estimates should not be effected by unreplicated data, but this does not appear to be the case. I hope that I am missing something here; i.e. are my formulas correct? is there a reason that the dispersion estimates are different? are the unreplicated samples included in the baseMean calculation and that is changing the dispersion fitting?
Here is my code to produce this outcome:
Code:
suppressMessages( library(DEXSeq) ) set.seed(110011) sampNames <- c( 'a.untreated.1','a.untreated.2', 'b.untreated.1','b.untreated.2', 'c.untreated.1', 'a.treated.1', 'a.treated.2', 'b.treated.1', 'b.treated.2', 'c.treated.1') ## generate random count data set nGenes <- 100 nExons <- 10 nSamps <- 10 sampleData <- data.frame( row.names=sampNames, condition=rep(c("untreated", "treated"), each=5), cellLine=rep(c('a','a','b', 'b', 'c'), 2)) design <- formula( ~ sample + exon + cellLine + cellLine:exon + condition:exon ) subDesign <- formula( ~ sample + exon + cellLine + cellLine:exon ) groupID <- unlist(lapply(1:nGenes, function(i) rep(paste('g', i), nExons))) featureID <- rep(sapply(1:nExons, function(i) paste('e', i)), nGenes) countData <- matrix(rpois(nGenes * nExons * nSamps, 100), nrow=nGenes * nExons, dimnames=list(groupID, sampNames)) ## run DEXSeq with all samples dxd <- DEXSeqDataSet(countData, sampleData, design, featureID, groupID) dxd <- estimateSizeFactors(dxd) ## this throws a warning which I am assuming is b/c ## the data is not realistic for the model dxd <- estimateDispersions(dxd) dxd <- testForDEU(dxd, reducedModel = subDesign) dxr1 <- DEXSeqResults(dxd) ## now use only replicated samples sampleData2 <- sampleData[sampleData[,'cellLine'] != 'c',] countData2 <- countData[,sampleData[,'cellLine'] != 'c'] dxd2 <- DEXSeqDataSet(countData2, sampleData2, design, featureID, groupID) dxd2 <- estimateSizeFactors(dxd2) dxd2 <- estimateDispersions(dxd2) dxd2 <- testForDEU(dxd2, reducedModel = subDesign) dxr2 <- DEXSeqResults(dxd2) ## why is this not true if I have only removed an unreplicated sample all.equal(dxr1$dispersion, dxr2$dispersion)
Second, once I get the first part to work how can I subset the DEXSeqDataSet in order to testForDEU across conditions, but only within a single cell line? This used to be possible in the old version of DEXSeq (https://stat.ethz.ch/pipermail/bioco...ay/052479.html), but I do not see it in the manual or in any forums for the current version of DEXSeq.
Code:
## then can I subset to testForDEU in only one cellLine? ## like this, but this throws an error R> dxdSubA <- dxd R> colData(dxdSubA) <- colData(dxdSubA)[colData(dxdSubA)$cellLine == 'a',] Error in `colData<-`(`*tmp*`, value = <S4 object of class "DataFrame">) : 'colData' nrow differs from 'assays' ncol
Code:
R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin13.3.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DEXSeq_1.10.8 BiocParallel_0.6.1 DESeq2_1.4.5 [4] RcppArmadillo_0.4.400.0 Rcpp_0.11.2 GenomicRanges_1.16.4 [7] GenomeInfoDb_1.0.2 IRanges_1.22.10 Biobase_2.24.0 [10] BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] annotate_1.42.1 AnnotationDbi_1.26.0 BatchJobs_1.3 [4] BBmisc_1.7 biomaRt_2.20.0 Biostrings_2.32.1 [7] bitops_1.0-6 brew_1.0-6 checkmate_1.4 [10] codetools_0.2-9 DBI_0.3.0 digest_0.6.4 [13] fail_1.2 foreach_1.4.2 genefilter_1.46.1 [16] geneplotter_1.42.0 grid_3.1.1 hwriter_1.3.2 [19] iterators_1.0.7 lattice_0.20-29 locfit_1.5-9.1 [22] RColorBrewer_1.0-5 RCurl_1.95-4.3 Rsamtools_1.16.1 [25] RSQLite_0.11.4 sendmailR_1.1-2 splines_3.1.1 [28] statmod_1.4.20 stats4_3.1.1 stringr_0.6.2 [31] survival_2.37-7 tools_3.1.1 XML_3.98-1.1 [34] xtable_1.7-3 XVector_0.4.0 zlibbioc_1.10.0
Comment