Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Paired design versus unpaired design in DESeq2

    Dear all,
    I am assessing differential expression between tumour and normal gland samples from the same subjects using DESeq2 and have run into a problem using the paired design.
    The dataset consists of 19 matched tumour and gland samples from subjects and 2 gland samples from 1 healthy subject all run on several miseq runs.
    For the unpaired design, the conditions “Tumour” and “Gland” were used for the differential expression analysis which resulted in 19 replicates for the Tumour condition and 21 replicates for the Gland condition. Running DESeq2 on this dataset caused refitting of outliers detected using Cook’s distance in 7 genes. The R command output is listed below.

    Code:
    > Panel2Design
           condition3  condition
    S26_T         S26     Tumour
    S26_a         S26      Gland
    S01_T         S01     Tumour
    S28_a         S28      Gland
    S02_T         S02     Tumour
    S02_a         S02      Gland
    S09_T         S09     Tumour
    S09_a         S09      Gland
    S19_T         S19     Tumour
    S19_a         S19      Gland
    S20_T         S20     Tumour
    S20_a         S20      Gland
    S21_T         S21     Tumour
    S21_a         S21      Gland
    S23_T         S23     Tumour
    S23_a         S23      Gland
    S24_T         S24     Tumour
    S24_a         S24      Gland
    S32_T         S32     Tumour
    S32_a         S32      Gland
    S33_T         S33     Tumour
    S33_a         S33      Gland
    S34_T         S34     Tumour
    S34_a         S34      Gland
    S35_T         S35     Tumour
    S35_a         S35      Gland
    S36_T         S36     Tumour
    S36_a         S36      Gland
    S37_T         S37     Tumour
    S37_a         S37      Gland
    S38_T         S38     Tumour
    S38_a         S38      Gland
    S39_T         S39     Tumour
    S39_a         S39      Gland
    S40_T         S40     Tumour
    S40_a         S40      Gland
    S41_T         S41     Tumour
    S41_a         S41      Gland
    S22_a         S22      Gland
    S22_a         S22      Gland
    
    > Panel2CDS=DESeqDataSetFromMatrix(countData=Panel2,colData=Panel2Design,design=~condition)
    > Panel2CDS_1=DESeq(Panel2CDS)
    estimating size factors
    estimating dispersions
    gene-wise dispersion estimates
    mean-dispersion relationship
    final dispersion estimates
    fitting model and testing
    -- replacing outliers and refitting for 7 genes
    -- DESeq argument ‘minReplicatesForReplace’ = 7
    -- original counts are preserved in counts(dds))
    estimating dispersions
    fitting model and testing
    For the paired design, an additional condition column (“condition3”) was created in the colData file to indicate which samples were from the same subject. DESeq2 was then run using the second condition in addition to the first condition, as per the R command output listed below. At this point, no outliers were removed/refitted and the GLM fit did not converge.

    Code:
    > Panel2pCDS=DESeqDataSetFromMatrix(countData=Panel2,colData=Panel2Design,design=~condition3+condition)
    > Panel2pCDS2_1=DESeq(Panel2pCDS)
    estimating size factors
    estimating dispersions
    gene-wise dispersion estimates
    mean-dispersion relationship
    final dispersion estimates
    fitting model and testing
    Warning messages:
    1: glm.fit: algorithm did not converge
    2: glm.fit: algorithm did not converge
    3: glm.fit: algorithm did not converge
    I assume the Cook’s distance outlier detection is not working because under the second condition for each subject there is only 1 replicate for the tumour and 1 replicate for the gland and Cook’s only kicks in from at least 3 replicates. But this now causes the GLM not to fit anymore, therefore should I be worried about the dispersion estimates? Or should I just opt for a different fit-type (local or mean)? How do I know which one is better?
    I thought the paired design would improve the power for the differential expression testing. How do I know that it did? I am detecting slightly more differentially expressed genes, but could this be due to the fact that some outliers were not removed via Cook’s Cutoff and a worse dispersion estimation?

    FYI Settings used:
    Code:
    > sessionInfo()
    R version 3.1.0 (2014-04-10)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    
    locale:
    [1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New Zealand.1252   
    [3] LC_MONETARY=English_New Zealand.1252 LC_NUMERIC=C                        
    [5] LC_TIME=English_New Zealand.1252    
    
    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
     [1] DESeq2_1.4.5            RcppArmadillo_0.4.300.0 Rcpp_0.11.1             GenomicRanges_1.16.3   
     [5] GenomeInfoDb_1.0.2      IRanges_1.22.6          DESeq_1.16.0            lattice_0.20-29        
     [9] locfit_1.5-9.1          Biobase_2.24.0          BiocGenerics_0.10.0    
    
    loaded via a namespace (and not attached):
     [1] annotate_1.42.0      AnnotationDbi_1.26.0 DBI_0.2-7            genefilter_1.46.1   
     [5] geneplotter_1.42.0   grid_3.1.0           RColorBrewer_1.0-5   RSQLite_0.11.4      
     [9] splines_3.1.0        stats4_3.1.0         survival_2.37-7      tools_3.1.0         
    [13] XML_3.98-1.1         xtable_1.7-3         XVector_0.4.0

  • #2
    hi Kristen,

    Yes, you're right about the reason for the outlier replacement not taking place, because with the design including the subject no longer has replicates for each combination of the variable levels.

    The glm.fit warning is not very informative here, and I should catch this warning internally to not confuse users.

    This warning is not from the main GLM of the model, but from the parametric fit of the curve through dispersion estimates (the red line in Figure 7 in the main vignette). You can use fitType="local". This is typically very similar to using fitType="parametric". Good to check

    plotDispEsts(dds)

    to make sure the curve is a good fit to the "cloud" of dispersion estimates.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    8 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    8 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    67 views
    0 likes
    Last Post seqadmin  
    Working...
    X