Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXseq estimateDispersions error dingdian110 Bioinformatics 9 01-07-2015 06:22 AM
DEXSeq results: an weird gene Jerry_Zhao RNA Sequencing 0 07-30-2014 07:32 AM
Weird MACS behavior feralBiologist Bioinformatics 0 11-28-2013 12:39 PM
DEXSeq error in estimateDispersions: match.arg(start.method, c("log(y)", "mean")) fpadilla Bioinformatics 14 07-03-2013 03:11 PM
DEXSeq estimateDispersions error dadada4ever Bioinformatics 14 11-20-2012 10:10 AM

Thread Tools
Old 09-11-2014, 01:10 PM   #1
Junior Member
Location: Berkeley, CA

Join Date: Sep 2014
Posts: 3
Default DEXSeq Partially Unreplicated Dataset: Weird EstimateDispersions behavior


I am having an issue with analyzing a next-gen sequencing data set for differential exon usage using DEXSeq. I have two main questions:

First, I have a partially unreplicated data set. I have created a reproducible example below where I have 3 cell lines each in two conditions, but only two of the 3 cell lines were completed in biological duplicate. The example is on random data, but my real data set is biologically derived and much larger; 42 cell lines with 37 unreplicated.

My first issue is that estimateDispersions produces different dispersion estimates when the unreplicated cell line is included or not included in the data set. I was under the impression that the the dispersion estimates should not be effected by unreplicated data, but this does not appear to be the case. I hope that I am missing something here; i.e. are my formulas correct? is there a reason that the dispersion estimates are different? are the unreplicated samples included in the baseMean calculation and that is changing the dispersion fitting?

Here is my code to produce this outcome:

suppressMessages( library(DEXSeq) )


sampNames <- c(
  'c.untreated.1', 'a.treated.1',
  'a.treated.2', 'b.treated.1',
  'b.treated.2', 'c.treated.1')

## generate random count data set
nGenes <- 100
nExons <- 10
nSamps <- 10
sampleData <- data.frame(
  condition=rep(c("untreated", "treated"), each=5),
  cellLine=rep(c('a','a','b', 'b', 'c'), 2))
design <- formula( ~ sample + exon + cellLine + cellLine:exon + condition:exon )
subDesign <- formula( ~ sample + exon + cellLine + cellLine:exon )
groupID <- unlist(lapply(1:nGenes, function(i) rep(paste('g', i), nExons)))
featureID <- rep(sapply(1:nExons, function(i) paste('e', i)), nGenes)
countData <- matrix(rpois(nGenes * nExons * nSamps, 100),
                    nrow=nGenes * nExons,
                    dimnames=list(groupID, sampNames))

## run DEXSeq with all samples
dxd <- DEXSeqDataSet(countData, sampleData, design, featureID, groupID)
dxd <- estimateSizeFactors(dxd)
## this throws a warning which I am assuming is b/c
## the data is not realistic for the model
dxd <- estimateDispersions(dxd)
dxd <- testForDEU(dxd, reducedModel = subDesign)
dxr1 <- DEXSeqResults(dxd)

## now use only replicated samples
sampleData2 <- sampleData[sampleData[,'cellLine'] != 'c',]
countData2 <- countData[,sampleData[,'cellLine'] != 'c']
dxd2 <- DEXSeqDataSet(countData2, sampleData2, design, featureID, groupID)
dxd2 <- estimateSizeFactors(dxd2)
dxd2 <- estimateDispersions(dxd2)
dxd2 <- testForDEU(dxd2, reducedModel =  subDesign)
dxr2 <- DEXSeqResults(dxd2)

## why is this not true if I have only removed an unreplicated sample
all.equal(dxr1$dispersion, dxr2$dispersion)

Second, once I get the first part to work how can I subset the DEXSeqDataSet in order to testForDEU across conditions, but only within a single cell line? This used to be possible in the old version of DEXSeq (, but I do not see it in the manual or in any forums for the current version of DEXSeq.

## then can I subset to testForDEU in only one cellLine?
## like this, but this throws an error
R> dxdSubA <- dxd
R> colData(dxdSubA) <- colData(dxdSubA)[colData(dxdSubA)$cellLine == 'a',]
Error in `colData<-`(`*tmp*`, value = <S4 object of class "DataFrame">) :
   'colData' nrow differs from 'assays' ncol
Here is my sessionInfo:
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.3.0 (64-bit)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] DEXSeq_1.10.8           BiocParallel_0.6.1      DESeq2_1.4.5           
 [4] RcppArmadillo_0.4.400.0 Rcpp_0.11.2             GenomicRanges_1.16.4   
 [7] GenomeInfoDb_1.0.2      IRanges_1.22.10         Biobase_2.24.0         
[10] BiocGenerics_0.10.0    

loaded via a namespace (and not attached):
 [1] annotate_1.42.1      AnnotationDbi_1.26.0 BatchJobs_1.3       
 [4] BBmisc_1.7           biomaRt_2.20.0       Biostrings_2.32.1   
 [7] bitops_1.0-6         brew_1.0-6           checkmate_1.4       
[10] codetools_0.2-9      DBI_0.3.0            digest_0.6.4        
[13] fail_1.2             foreach_1.4.2        genefilter_1.46.1   
[16] geneplotter_1.42.0   grid_3.1.1           hwriter_1.3.2       
[19] iterators_1.0.7      lattice_0.20-29      locfit_1.5-9.1      
[22] RColorBrewer_1.0-5   RCurl_1.95-4.3       Rsamtools_1.16.1    
[25] RSQLite_0.11.4       sendmailR_1.1-2      splines_3.1.1       
[28] statmod_1.4.20       stats4_3.1.1         stringr_0.6.2       
[31] survival_2.37-7      tools_3.1.1          XML_3.98-1.1        
[34] xtable_1.7-3         XVector_0.4.0        zlibbioc_1.10.0
Answers to either part would be great! (I hope it made sense to combine these posts as I think they are related) Thank you to anyone for the help in advance!
marcus1487 is offline   Reply With Quote
Old 09-11-2014, 05:25 PM   #2
Junior Member
Location: Berkeley, CA

Join Date: Sep 2014
Posts: 3

So I was able to get a work-able solution to the second half of my problem, but I am still not sure if it is producing accurate results. Here is the function I have produced to testForDEU on each cellLine independently using globally fit dispersions. The dxd object is taken from the code in my previous post.

fitOneCellLine <- function(cellLine){
  sampleData1 <- sampleData[sampleData$cellLine == cellLine,]
  countData1 <- countData[,sampleData$cellLine == cellLine]
  dxd1 <- DEXSeqDataSet(
    countData1, sampleData1, formula( ~ sample + exon + condition:exon),
    featureID, groupID)
  dxd1 <- estimateSizeFactors(dxd1)

  ## add dispersions from rep samples
  dxdDims <- dim(mcols(rowData(dxd)))[2]
  mcols(rowData(dxd1))[5:dxdDims] <- mcols(rowData(dxd))[5:dxdDims]
  mcols(mcols(dxd1)) <- mcols(mcols(dxd))
  dxd1@dispersionFunction <- dxd@dispersionFunction
  dxd1 <- testForDEU(dxd1, reducedModel =  ~ sample + exon)

aRes <- fitOneCellLine('a')
bRes <- fitOneCellLine('b')
cRes <- fitOneCellLine('c')
I note that the three results objects have different p-values, but I am still not sure if I have transferred all of the appropriate information into the cellLine specific object before testing for DEU. I am mostly concerned that this workflow is not producing accurate results. I am not sure what to test against either as this is the only way I could get cellLine specific results.

Also I was not able to produce the fold change values here, but in my data set generated using the DEXSeqDataSetFromHTSeq function I am able to estimate fold changes. I am not sure what else is different between this randomly generated data set and my biological data set aside from the generator function, but in any case that is not a huge concern of mine.
marcus1487 is offline   Reply With Quote
Old 09-23-2014, 07:39 AM   #3
Senior Member
Location: Heidelberg

Join Date: Aug 2010
Posts: 165

Dear Marcus,

Of course the optimal is that you would have replicates for the third cell line. But what you could also try is to estimate the dispersions using only the cell lines for which you have replicates, and pass this dispersion estimates to an object with all the data. This would assume that the variability between each of the celllines is similar... although I don't know how true this is.

With regards to the implementation, testForDEU requires the columns "allZero", "dispGeneEst", "dispFit", "dispersion","dispMAP", "dispOutlier". You could pass these columns to the object used for the testing independently of how you estimate the dispersions before this step and this should work.
areyes is offline   Reply With Quote
Old 09-23-2014, 03:45 PM   #4
Junior Member
Location: Berkeley, CA

Join Date: Sep 2014
Posts: 3

Hi areyes,

Thank you for your response.

In terms of the assumption that variability between each of the cell lines is similar, isn't this just the same assumption that is made when variability is estimated across all samples even when they are all replicated?

To spell it out a little more, if we are testing for the effect of a condition across cell lines What is the difference between these two situations:

1) Estimating variability across a set of replicated cell line - condition combinations and then testing for differential expression between conditions within a single replicated cell line.
2) Estimating variability across a set of replicated cell line - condition combinations and then testing for differential expression between conditions within a separate UN-replicated cell line.

I may be wrong, but I think the only difference is the additional power from an additional experiment in the replicated experiment (assuming common read coverage). I guess the question is: Is there reason to share dispersion estimates between different conditions when comparing differential expression across another condition.

In terms of the implementation details this is incredibly helpful! Thank you so much for confirming that this works correctly assuming the above assumptions are valid.
marcus1487 is offline   Reply With Quote
Old 09-24-2014, 01:15 AM   #5
Senior Member
Location: Heidelberg

Join Date: Aug 2010
Posts: 165

Hi Marcus,

You are right, so far one dispersion estimate is estimated for all the levels of the factorial designs.

I think that if you include the un-replicated cell-line when estimating the dispersions, the final dispersion estimates might be underestimated so it might be better to use only the replicated samples for this step.

Yes, including more samples should give you additional power. But the think that I would keep in mind is that it would be not possible to estimate the variability in the un-replicated cell-line, so its hard to know if the dispersion estimation based on the replicated cell-lines models well the variability of the un-replicated cell-line. I would just try and see how much additional power to get by including the un-replicated cell-line.

On other topics, I just read the formulas of your code again:

design <- formula( ~ sample + exon + cellLine + cellLine:exon + condition:exon )
subDesign <- formula( ~ sample + exon + cellLine + cellLine:exon )

In this formulas, you don't have to include the cellLine effect alone, you can just do:

design <- formula( ~ sample + exon + cellLine:exon + condition:exon )
subDesign <- formula( ~ sample + exon + cellLine:exon )

to consider the variability in exon usage due to the different cell-lines.
areyes is offline   Reply With Quote

dexseq, estimatedispersions, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 09:53 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO