I am trying to analyze some metagenomic data with DESeq2 and have some questions about the experimental design and whether my analysis is meaningful.
For ten swine herds we gathered 30 feces samples, pooled them, extracted metagenomic DNA and sequenced it to approximately 70 mill. reads / dataset (2 x 100 bp, PE) on a HiSeq2500. We replicated the sampling, so we have two pools of 30 from each herd.
We then used a BWA-based pipeline to map the reads to various databases including one with 2130 resistance genes.
I've crated a resistance abundance matrix and tried to analyze the 20 samples with DeSeq2. My independent variables are the known antimicrobial consumption (separated by drug type) for the ten herds.
To analyze whether gene abundances are explained by tetracycline use, I would use a design formula like this:
formula(~ sample_site + tetracycline)
However, because the consumption covariates (continuous), are identical for replicates at each sample_site (herd), I am given an error message:
"Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed"
Is my design inherently flawed or am I just using DESeq2 in a way that wasn't anticipated? If my IV was factorial, I can understand DESeq2 would complain it was unable to separate the effects from the two, but since "tetracycline" is continuous, shouldn't this work? Is this different from having two replicates from each time point in an RNASeq experiment?
Adding a very small random number to each tetracycline number I can get it to work. I have a number of differentially abundant genes according to the Wald test... all of them increase in abundance as consumption increases.
A simplified version of my design matrix looks like this:
1) Does DESeq2 use the "sample_site" to estimate the dispersion within herds, as I'm hoping it does?
2) Are there any DESeq2 parameters I should tweak for my specific application?
3) If meaningful, can I tell DESeq2 to accept my design formula?
Best regards,
Patrick Munk
For ten swine herds we gathered 30 feces samples, pooled them, extracted metagenomic DNA and sequenced it to approximately 70 mill. reads / dataset (2 x 100 bp, PE) on a HiSeq2500. We replicated the sampling, so we have two pools of 30 from each herd.
We then used a BWA-based pipeline to map the reads to various databases including one with 2130 resistance genes.
I've crated a resistance abundance matrix and tried to analyze the 20 samples with DeSeq2. My independent variables are the known antimicrobial consumption (separated by drug type) for the ten herds.
To analyze whether gene abundances are explained by tetracycline use, I would use a design formula like this:
formula(~ sample_site + tetracycline)
However, because the consumption covariates (continuous), are identical for replicates at each sample_site (herd), I am given an error message:
"Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed"
Is my design inherently flawed or am I just using DESeq2 in a way that wasn't anticipated? If my IV was factorial, I can understand DESeq2 would complain it was unable to separate the effects from the two, but since "tetracycline" is continuous, shouldn't this work? Is this different from having two replicates from each time point in an RNASeq experiment?
Adding a very small random number to each tetracycline number I can get it to work. I have a number of differentially abundant genes according to the Wald test... all of them increase in abundance as consumption increases.
A simplified version of my design matrix looks like this:
sample_name sample_site aminoglycoside tetracycline lib_size
F10_IS_P_FIB 2014_10 0,029982363 1,280110685 60991062
F10_ST_FIB 2014_10 0,029982363 1,280110685 60231460
F1_IS_P_FIB 2014_1 0,15457294 0,171208676 62054282
F1_ST_FIB 2014_1 0,15457294 0,171208676 70823944
F2_IS_P_FIB 2014_2 0,588795386 0,892359485 59969860
F2_ST_FIB 2014_2 0,588795386 0,892359485 62137964
F3_IS_P_FIB 2014_3 0,902489538 0,101130753 59479678
F3_ST_FIB 2014_3 0,902489538 0,101130753 73284776
F4_IS_P_FIB 2014_4 0,633925771 2,051830423 60183764
F4_ST_FIB 2014_4 0,633925771 2,051830423 76241940
F5_IS_P_FIB 2014_5 0,076205716 1,02533471 60625410
F5_ST_FIB 2014_5 0,076205716 1,02533471 67639338
F6_IS_P_FIB 2014_6 0,573926613 2,683780065 61728500
F6_ST_FIB 2014_6 0,573926613 2,683780065 60676694
F7_IS_P_FIB 2014_7 0,077160494 0,346395483 59633136
F7_ST_FIB 2014_7 0,077160494 0,346395483 55578210
F8_IS_P_FIB 2014_8 0,142856031 0,242034969 60365050
F8_ST_FIB 2014_8 0,142856031 0,242034969 77182448
F9_IS_P_FIB 2014_9 0,290693702 2,488566048 65000142
F9_ST_FIB 2014_9 0,290693702 2,488566048 79458424
F10_IS_P_FIB 2014_10 0,029982363 1,280110685 60991062
F10_ST_FIB 2014_10 0,029982363 1,280110685 60231460
F1_IS_P_FIB 2014_1 0,15457294 0,171208676 62054282
F1_ST_FIB 2014_1 0,15457294 0,171208676 70823944
F2_IS_P_FIB 2014_2 0,588795386 0,892359485 59969860
F2_ST_FIB 2014_2 0,588795386 0,892359485 62137964
F3_IS_P_FIB 2014_3 0,902489538 0,101130753 59479678
F3_ST_FIB 2014_3 0,902489538 0,101130753 73284776
F4_IS_P_FIB 2014_4 0,633925771 2,051830423 60183764
F4_ST_FIB 2014_4 0,633925771 2,051830423 76241940
F5_IS_P_FIB 2014_5 0,076205716 1,02533471 60625410
F5_ST_FIB 2014_5 0,076205716 1,02533471 67639338
F6_IS_P_FIB 2014_6 0,573926613 2,683780065 61728500
F6_ST_FIB 2014_6 0,573926613 2,683780065 60676694
F7_IS_P_FIB 2014_7 0,077160494 0,346395483 59633136
F7_ST_FIB 2014_7 0,077160494 0,346395483 55578210
F8_IS_P_FIB 2014_8 0,142856031 0,242034969 60365050
F8_ST_FIB 2014_8 0,142856031 0,242034969 77182448
F9_IS_P_FIB 2014_9 0,290693702 2,488566048 65000142
F9_ST_FIB 2014_9 0,290693702 2,488566048 79458424
2) Are there any DESeq2 parameters I should tweak for my specific application?
3) If meaningful, can I tell DESeq2 to accept my design formula?
Best regards,
Patrick Munk
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.9 ez_4.2-2 DESeq2_1.6.1 corrplot_0.73
[5] RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3 GenomicRanges_1.18.1 GenomeInfoDb_1.2.0
[9] IRanges_2.0.0 S4Vectors_0.4.0 BiocGenerics_0.12.0 phyloseq_1.9.15
[13] pheatmap_1.0.2 scales_0.2.4 plyr_1.8.1 data.table_1.9.4
[17] vegan_2.0-10 lattice_0.20-29 permute_0.8-3 gplots_2.14.2
[21] RColorBrewer_1.0-5 ggplot2_1.0.0 reshape2_1.4
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 ade4_1.6-2 annotate_1.44.0 AnnotationDbi_1.28.0 ape_3.1-4
[6] base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7 Biobase_2.26.0 BiocParallel_1.0.0
[11] biom_0.3.12 Biostrings_2.34.0 bitops_1.0-6 brew_1.0-6 car_2.0-25
[16] caTools_1.17.1 checkmate_1.5.0 chron_2.3-45 cluster_1.15.3 codetools_0.2-9
[21] colorspace_1.2-4 compiler_3.1.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5
[26] fail_1.2 foreach_1.4.2 foreign_0.8-61 formatR_1.1 Formula_1.1-2
[31] gdata_2.13.3 genefilter_1.48.1 geneplotter_1.44.0 grid_3.1.1 gtable_0.1.2
[36] gtools_3.4.1 Hmisc_3.14-5 igraph_0.7.1 iterators_1.0.7 KernSmooth_2.23-13
[41] labeling_0.3 latticeExtra_0.6-26 lme4_1.1-7 locfit_1.5-9.1 MASS_7.3-35
[46] Matrix_1.1-4 mgcv_1.8-3 minqa_1.2.4 multtest_2.22.0 munsell_0.4.2
[51] nlme_3.1-118 nloptr_1.0.4 nnet_7.3-8 pbkrtest_0.4-2 proto_0.3-10
[56] quantreg_5.11 RJSONIO_1.3-0 rpart_4.1-8 RSQLite_0.11.4 sendmailR_1.2-1
[61] SparseM_1.6 splines_3.1.1 stringr_0.6.2 survival_2.37-7 tools_3.1.1
[66] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0 zlibbioc_1.12.0
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.9 ez_4.2-2 DESeq2_1.6.1 corrplot_0.73
[5] RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3 GenomicRanges_1.18.1 GenomeInfoDb_1.2.0
[9] IRanges_2.0.0 S4Vectors_0.4.0 BiocGenerics_0.12.0 phyloseq_1.9.15
[13] pheatmap_1.0.2 scales_0.2.4 plyr_1.8.1 data.table_1.9.4
[17] vegan_2.0-10 lattice_0.20-29 permute_0.8-3 gplots_2.14.2
[21] RColorBrewer_1.0-5 ggplot2_1.0.0 reshape2_1.4
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 ade4_1.6-2 annotate_1.44.0 AnnotationDbi_1.28.0 ape_3.1-4
[6] base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7 Biobase_2.26.0 BiocParallel_1.0.0
[11] biom_0.3.12 Biostrings_2.34.0 bitops_1.0-6 brew_1.0-6 car_2.0-25
[16] caTools_1.17.1 checkmate_1.5.0 chron_2.3-45 cluster_1.15.3 codetools_0.2-9
[21] colorspace_1.2-4 compiler_3.1.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5
[26] fail_1.2 foreach_1.4.2 foreign_0.8-61 formatR_1.1 Formula_1.1-2
[31] gdata_2.13.3 genefilter_1.48.1 geneplotter_1.44.0 grid_3.1.1 gtable_0.1.2
[36] gtools_3.4.1 Hmisc_3.14-5 igraph_0.7.1 iterators_1.0.7 KernSmooth_2.23-13
[41] labeling_0.3 latticeExtra_0.6-26 lme4_1.1-7 locfit_1.5-9.1 MASS_7.3-35
[46] Matrix_1.1-4 mgcv_1.8-3 minqa_1.2.4 multtest_2.22.0 munsell_0.4.2
[51] nlme_3.1-118 nloptr_1.0.4 nnet_7.3-8 pbkrtest_0.4-2 proto_0.3-10
[56] quantreg_5.11 RJSONIO_1.3-0 rpart_4.1-8 RSQLite_0.11.4 sendmailR_1.2-1
[61] SparseM_1.6 splines_3.1.1 stringr_0.6.2 survival_2.37-7 tools_3.1.1
[66] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0 zlibbioc_1.12.0
Comment