Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DEXSeq Partially Unreplicated Dataset: Weird EstimateDispersions behavior

    Hi,

    I am having an issue with analyzing a next-gen sequencing data set for differential exon usage using DEXSeq. I have two main questions:

    First, I have a partially unreplicated data set. I have created a reproducible example below where I have 3 cell lines each in two conditions, but only two of the 3 cell lines were completed in biological duplicate. The example is on random data, but my real data set is biologically derived and much larger; 42 cell lines with 37 unreplicated.

    My first issue is that estimateDispersions produces different dispersion estimates when the unreplicated cell line is included or not included in the data set. I was under the impression that the the dispersion estimates should not be effected by unreplicated data, but this does not appear to be the case. I hope that I am missing something here; i.e. are my formulas correct? is there a reason that the dispersion estimates are different? are the unreplicated samples included in the baseMean calculation and that is changing the dispersion fitting?

    Here is my code to produce this outcome:

    Code:
    suppressMessages( library(DEXSeq) )
    
    set.seed(110011)
    
    sampNames <- c(
      'a.untreated.1','a.untreated.2',
      'b.untreated.1','b.untreated.2',
      'c.untreated.1', 'a.treated.1',
      'a.treated.2', 'b.treated.1',
      'b.treated.2', 'c.treated.1')
    
    ## generate random count data set
    nGenes <- 100
    nExons <- 10
    nSamps <- 10
    sampleData <- data.frame(
      row.names=sampNames,
      condition=rep(c("untreated", "treated"), each=5),
      cellLine=rep(c('a','a','b', 'b', 'c'), 2))
    design <- formula( ~ sample + exon + cellLine + cellLine:exon + condition:exon )
    subDesign <- formula( ~ sample + exon + cellLine + cellLine:exon )
    groupID <- unlist(lapply(1:nGenes, function(i) rep(paste('g', i), nExons)))
    featureID <- rep(sapply(1:nExons, function(i) paste('e', i)), nGenes)
    countData <- matrix(rpois(nGenes * nExons * nSamps, 100),
                        nrow=nGenes * nExons,
                        dimnames=list(groupID, sampNames))
    
    ## run DEXSeq with all samples
    dxd <- DEXSeqDataSet(countData, sampleData, design, featureID, groupID)
    dxd <- estimateSizeFactors(dxd)
    ## this throws a warning which I am assuming is b/c
    ## the data is not realistic for the model
    dxd <- estimateDispersions(dxd)
    dxd <- testForDEU(dxd, reducedModel = subDesign)
    dxr1 <- DEXSeqResults(dxd)
    
    ## now use only replicated samples
    sampleData2 <- sampleData[sampleData[,'cellLine'] != 'c',]
    countData2 <- countData[,sampleData[,'cellLine'] != 'c']
    dxd2 <- DEXSeqDataSet(countData2, sampleData2, design, featureID, groupID)
    dxd2 <- estimateSizeFactors(dxd2)
    dxd2 <- estimateDispersions(dxd2)
    dxd2 <- testForDEU(dxd2, reducedModel =  subDesign)
    dxr2 <- DEXSeqResults(dxd2)
    
    ## why is this not true if I have only removed an unreplicated sample
    all.equal(dxr1$dispersion, dxr2$dispersion)

    Second, once I get the first part to work how can I subset the DEXSeqDataSet in order to testForDEU across conditions, but only within a single cell line? This used to be possible in the old version of DEXSeq (https://stat.ethz.ch/pipermail/bioco...ay/052479.html), but I do not see it in the manual or in any forums for the current version of DEXSeq.

    Code:
    ## then can I subset to testForDEU in only one cellLine?
    ## like this, but this throws an error
    R> dxdSubA <- dxd
    R> colData(dxdSubA) <- colData(dxdSubA)[colData(dxdSubA)$cellLine == 'a',]
    Error in `colData<-`(`*tmp*`, value = <S4 object of class "DataFrame">) :
       'colData' nrow differs from 'assays' ncol
    Here is my sessionInfo:
    Code:
    R version 3.1.1 (2014-07-10)
    Platform: x86_64-apple-darwin13.3.0 (64-bit)
    
    locale:
    [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    
    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets  methods  
    [8] base     
    
    other attached packages:
     [1] DEXSeq_1.10.8           BiocParallel_0.6.1      DESeq2_1.4.5           
     [4] RcppArmadillo_0.4.400.0 Rcpp_0.11.2             GenomicRanges_1.16.4   
     [7] GenomeInfoDb_1.0.2      IRanges_1.22.10         Biobase_2.24.0         
    [10] BiocGenerics_0.10.0    
    
    loaded via a namespace (and not attached):
     [1] annotate_1.42.1      AnnotationDbi_1.26.0 BatchJobs_1.3       
     [4] BBmisc_1.7           biomaRt_2.20.0       Biostrings_2.32.1   
     [7] bitops_1.0-6         brew_1.0-6           checkmate_1.4       
    [10] codetools_0.2-9      DBI_0.3.0            digest_0.6.4        
    [13] fail_1.2             foreach_1.4.2        genefilter_1.46.1   
    [16] geneplotter_1.42.0   grid_3.1.1           hwriter_1.3.2       
    [19] iterators_1.0.7      lattice_0.20-29      locfit_1.5-9.1      
    [22] RColorBrewer_1.0-5   RCurl_1.95-4.3       Rsamtools_1.16.1    
    [25] RSQLite_0.11.4       sendmailR_1.1-2      splines_3.1.1       
    [28] statmod_1.4.20       stats4_3.1.1         stringr_0.6.2       
    [31] survival_2.37-7      tools_3.1.1          XML_3.98-1.1        
    [34] xtable_1.7-3         XVector_0.4.0        zlibbioc_1.10.0
    Answers to either part would be great! (I hope it made sense to combine these posts as I think they are related) Thank you to anyone for the help in advance!

  • #2
    So I was able to get a work-able solution to the second half of my problem, but I am still not sure if it is producing accurate results. Here is the function I have produced to testForDEU on each cellLine independently using globally fit dispersions. The dxd object is taken from the code in my previous post.

    Code:
    fitOneCellLine <- function(cellLine){
      sampleData1 <- sampleData[sampleData$cellLine == cellLine,]
      countData1 <- countData[,sampleData$cellLine == cellLine]
      dxd1 <- DEXSeqDataSet(
        countData1, sampleData1, formula( ~ sample + exon + condition:exon),
        featureID, groupID)
      dxd1 <- estimateSizeFactors(dxd1)
    
      ## add dispersions from rep samples
      dxdDims <- dim(mcols(rowData(dxd)))[2]
      mcols(rowData(dxd1))[5:dxdDims] <- mcols(rowData(dxd))[5:dxdDims]
      mcols(mcols(dxd1)) <- mcols(mcols(dxd))
      dxd1@dispersionFunction <- dxd@dispersionFunction
      
      dxd1 <- testForDEU(dxd1, reducedModel =  ~ sample + exon)
      return(DEXSeqResults(dxd1))
    }
    
    aRes <- fitOneCellLine('a')
    bRes <- fitOneCellLine('b')
    cRes <- fitOneCellLine('c')
    I note that the three results objects have different p-values, but I am still not sure if I have transferred all of the appropriate information into the cellLine specific object before testing for DEU. I am mostly concerned that this workflow is not producing accurate results. I am not sure what to test against either as this is the only way I could get cellLine specific results.

    Also I was not able to produce the fold change values here, but in my data set generated using the DEXSeqDataSetFromHTSeq function I am able to estimate fold changes. I am not sure what else is different between this randomly generated data set and my biological data set aside from the generator function, but in any case that is not a huge concern of mine.

    Comment


    • #3
      Dear Marcus,

      Of course the optimal is that you would have replicates for the third cell line. But what you could also try is to estimate the dispersions using only the cell lines for which you have replicates, and pass this dispersion estimates to an object with all the data. This would assume that the variability between each of the celllines is similar... although I don't know how true this is.

      With regards to the implementation, testForDEU requires the columns "allZero", "dispGeneEst", "dispFit", "dispersion","dispMAP", "dispOutlier". You could pass these columns to the object used for the testing independently of how you estimate the dispersions before this step and this should work.

      Comment


      • #4
        Hi areyes,

        Thank you for your response.

        In terms of the assumption that variability between each of the cell lines is similar, isn't this just the same assumption that is made when variability is estimated across all samples even when they are all replicated?

        To spell it out a little more, if we are testing for the effect of a condition across cell lines What is the difference between these two situations:

        1) Estimating variability across a set of replicated cell line - condition combinations and then testing for differential expression between conditions within a single replicated cell line.
        2) Estimating variability across a set of replicated cell line - condition combinations and then testing for differential expression between conditions within a separate UN-replicated cell line.

        I may be wrong, but I think the only difference is the additional power from an additional experiment in the replicated experiment (assuming common read coverage). I guess the question is: Is there reason to share dispersion estimates between different conditions when comparing differential expression across another condition.

        In terms of the implementation details this is incredibly helpful! Thank you so much for confirming that this works correctly assuming the above assumptions are valid.

        Comment


        • #5
          Hi Marcus,

          You are right, so far one dispersion estimate is estimated for all the levels of the factorial designs.

          I think that if you include the un-replicated cell-line when estimating the dispersions, the final dispersion estimates might be underestimated so it might be better to use only the replicated samples for this step.

          Yes, including more samples should give you additional power. But the think that I would keep in mind is that it would be not possible to estimate the variability in the un-replicated cell-line, so its hard to know if the dispersion estimation based on the replicated cell-lines models well the variability of the un-replicated cell-line. I would just try and see how much additional power to get by including the un-replicated cell-line.


          On other topics, I just read the formulas of your code again:

          design <- formula( ~ sample + exon + cellLine + cellLine:exon + condition:exon )
          subDesign <- formula( ~ sample + exon + cellLine + cellLine:exon )

          In this formulas, you don't have to include the cellLine effect alone, you can just do:

          design <- formula( ~ sample + exon + cellLine:exon + condition:exon )
          subDesign <- formula( ~ sample + exon + cellLine:exon )

          to consider the variability in exon usage due to the different cell-lines.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          35 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X