Dear all,
I am assessing differential expression between tumour and normal gland samples from the same subjects using DESeq2 and have run into a problem using the paired design.
The dataset consists of 19 matched tumour and gland samples from subjects and 2 gland samples from 1 healthy subject all run on several miseq runs.
For the unpaired design, the conditions “Tumour” and “Gland” were used for the differential expression analysis which resulted in 19 replicates for the Tumour condition and 21 replicates for the Gland condition. Running DESeq2 on this dataset caused refitting of outliers detected using Cook’s distance in 7 genes. The R command output is listed below.
For the paired design, an additional condition column (“condition3”) was created in the colData file to indicate which samples were from the same subject. DESeq2 was then run using the second condition in addition to the first condition, as per the R command output listed below. At this point, no outliers were removed/refitted and the GLM fit did not converge.
I assume the Cook’s distance outlier detection is not working because under the second condition for each subject there is only 1 replicate for the tumour and 1 replicate for the gland and Cook’s only kicks in from at least 3 replicates. But this now causes the GLM not to fit anymore, therefore should I be worried about the dispersion estimates? Or should I just opt for a different fit-type (local or mean)? How do I know which one is better?
I thought the paired design would improve the power for the differential expression testing. How do I know that it did? I am detecting slightly more differentially expressed genes, but could this be due to the fact that some outliers were not removed via Cook’s Cutoff and a worse dispersion estimation?
FYI Settings used:
I am assessing differential expression between tumour and normal gland samples from the same subjects using DESeq2 and have run into a problem using the paired design.
The dataset consists of 19 matched tumour and gland samples from subjects and 2 gland samples from 1 healthy subject all run on several miseq runs.
For the unpaired design, the conditions “Tumour” and “Gland” were used for the differential expression analysis which resulted in 19 replicates for the Tumour condition and 21 replicates for the Gland condition. Running DESeq2 on this dataset caused refitting of outliers detected using Cook’s distance in 7 genes. The R command output is listed below.
Code:
> Panel2Design condition3 condition S26_T S26 Tumour S26_a S26 Gland S01_T S01 Tumour S28_a S28 Gland S02_T S02 Tumour S02_a S02 Gland S09_T S09 Tumour S09_a S09 Gland S19_T S19 Tumour S19_a S19 Gland S20_T S20 Tumour S20_a S20 Gland S21_T S21 Tumour S21_a S21 Gland S23_T S23 Tumour S23_a S23 Gland S24_T S24 Tumour S24_a S24 Gland S32_T S32 Tumour S32_a S32 Gland S33_T S33 Tumour S33_a S33 Gland S34_T S34 Tumour S34_a S34 Gland S35_T S35 Tumour S35_a S35 Gland S36_T S36 Tumour S36_a S36 Gland S37_T S37 Tumour S37_a S37 Gland S38_T S38 Tumour S38_a S38 Gland S39_T S39 Tumour S39_a S39 Gland S40_T S40 Tumour S40_a S40 Gland S41_T S41 Tumour S41_a S41 Gland S22_a S22 Gland S22_a S22 Gland > Panel2CDS=DESeqDataSetFromMatrix(countData=Panel2,colData=Panel2Design,design=~condition) > Panel2CDS_1=DESeq(Panel2CDS) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing -- replacing outliers and refitting for 7 genes -- DESeq argument ‘minReplicatesForReplace’ = 7 -- original counts are preserved in counts(dds)) estimating dispersions fitting model and testing
Code:
> Panel2pCDS=DESeqDataSetFromMatrix(countData=Panel2,colData=Panel2Design,design=~condition3+condition) > Panel2pCDS2_1=DESeq(Panel2pCDS) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: algorithm did not converge 3: glm.fit: algorithm did not converge
I thought the paired design would improve the power for the differential expression testing. How do I know that it did? I am detecting slightly more differentially expressed genes, but could this be due to the fact that some outliers were not removed via Cook’s Cutoff and a worse dispersion estimation?
FYI Settings used:
Code:
> sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New Zealand.1252 [3] LC_MONETARY=English_New Zealand.1252 LC_NUMERIC=C [5] LC_TIME=English_New Zealand.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq2_1.4.5 RcppArmadillo_0.4.300.0 Rcpp_0.11.1 GenomicRanges_1.16.3 [5] GenomeInfoDb_1.0.2 IRanges_1.22.6 DESeq_1.16.0 lattice_0.20-29 [9] locfit_1.5-9.1 Biobase_2.24.0 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7 genefilter_1.46.1 [5] geneplotter_1.42.0 grid_3.1.0 RColorBrewer_1.0-5 RSQLite_0.11.4 [9] splines_3.1.0 stats4_3.1.0 survival_2.37-7 tools_3.1.0 [13] XML_3.98-1.1 xtable_1.7-3 XVector_0.4.0
Comment