Paired design versus unpaired design in DESeq2

KristenC

Junior Member

Join Date: Nov 2013
Posts: 7

Paired design versus unpaired design in DESeq2

05-26-2014, 07:01 PM

Dear all,
I am assessing differential expression between tumour and normal gland samples from the same subjects using DESeq2 and have run into a problem using the paired design.
The dataset consists of 19 matched tumour and gland samples from subjects and 2 gland samples from 1 healthy subject all run on several miseq runs.
For the unpaired design, the conditions “Tumour” and “Gland” were used for the differential expression analysis which resulted in 19 replicates for the Tumour condition and 21 replicates for the Gland condition. Running DESeq2 on this dataset caused refitting of outliers detected using Cook’s distance in 7 genes. The R command output is listed below.

Code:

> Panel2Design
       condition3  condition
S26_T         S26     Tumour
S26_a         S26      Gland
S01_T         S01     Tumour
S28_a         S28      Gland
S02_T         S02     Tumour
S02_a         S02      Gland
S09_T         S09     Tumour
S09_a         S09      Gland
S19_T         S19     Tumour
S19_a         S19      Gland
S20_T         S20     Tumour
S20_a         S20      Gland
S21_T         S21     Tumour
S21_a         S21      Gland
S23_T         S23     Tumour
S23_a         S23      Gland
S24_T         S24     Tumour
S24_a         S24      Gland
S32_T         S32     Tumour
S32_a         S32      Gland
S33_T         S33     Tumour
S33_a         S33      Gland
S34_T         S34     Tumour
S34_a         S34      Gland
S35_T         S35     Tumour
S35_a         S35      Gland
S36_T         S36     Tumour
S36_a         S36      Gland
S37_T         S37     Tumour
S37_a         S37      Gland
S38_T         S38     Tumour
S38_a         S38      Gland
S39_T         S39     Tumour
S39_a         S39      Gland
S40_T         S40     Tumour
S40_a         S40      Gland
S41_T         S41     Tumour
S41_a         S41      Gland
S22_a         S22      Gland
S22_a         S22      Gland

> Panel2CDS=DESeqDataSetFromMatrix(countData=Panel2,colData=Panel2Design,design=~condition)
> Panel2CDS_1=DESeq(Panel2CDS)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 7 genes
-- DESeq argument ‘minReplicatesForReplace’ = 7
-- original counts are preserved in counts(dds))
estimating dispersions
fitting model and testing

For the paired design, an additional condition column (“condition3”) was created in the colData file to indicate which samples were from the same subject. DESeq2 was then run using the second condition in addition to the first condition, as per the R command output listed below. At this point, no outliers were removed/refitted and the GLM fit did not converge.

Code:

> Panel2pCDS=DESeqDataSetFromMatrix(countData=Panel2,colData=Panel2Design,design=~condition3+condition)
> Panel2pCDS2_1=DESeq(Panel2pCDS)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: algorithm did not converge
3: glm.fit: algorithm did not converge

I assume the Cook’s distance outlier detection is not working because under the second condition for each subject there is only 1 replicate for the tumour and 1 replicate for the gland and Cook’s only kicks in from at least 3 replicates. But this now causes the GLM not to fit anymore, therefore should I be worried about the dispersion estimates? Or should I just opt for a different fit-type (local or mean)? How do I know which one is better?
I thought the paired design would improve the power for the differential expression testing. How do I know that it did? I am detecting slightly more differentially expressed genes, but could this be due to the fact that some outliers were not removed via Cook’s Cutoff and a worse dispersion estimation?

FYI Settings used:

Code:

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New Zealand.1252   
[3] LC_MONETARY=English_New Zealand.1252 LC_NUMERIC=C                        
[5] LC_TIME=English_New Zealand.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.4.5            RcppArmadillo_0.4.300.0 Rcpp_0.11.1             GenomicRanges_1.16.3   
 [5] GenomeInfoDb_1.0.2      IRanges_1.22.6          DESeq_1.16.0            lattice_0.20-29        
 [9] locfit_1.5-9.1          Biobase_2.24.0          BiocGenerics_0.10.0    

loaded via a namespace (and not attached):
 [1] annotate_1.42.0      AnnotationDbi_1.26.0 DBI_0.2-7            genefilter_1.46.1   
 [5] geneplotter_1.42.0   grid_3.1.0           RColorBrewer_1.0-5   RSQLite_0.11.4      
 [9] splines_3.1.0        stats4_3.1.0         survival_2.37-7      tools_3.1.0         
[13] XML_3.98-1.1         xtable_1.7-3         XVector_0.4.0

Tags: None

Michael Love

Senior Member

Join Date: Jul 2013

Posts: 333
- Share
- Tweet
#2

05-29-2014, 11:05 AM

hi Kristen,

Yes, you're right about the reason for the outlier replacement not taking place, because with the design including the subject no longer has replicates for each combination of the variable levels.

The glm.fit warning is not very informative here, and I should catch this warning internally to not confuse users.

This warning is not from the main GLM of the model, but from the parametric fit of the curve through dispersion estimates (the red line in Figure 7 in the main vignette). You can use fitType="local". This is typically very similar to using fitType="parametric". Good to check

plotDispEsts(dds)

to make sure the curve is a good fit to the "cloud" of dispersion estimates.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Paired design versus unpaired design in DESeq2

Comment

Latest Articles

ad_right_rmr

News