I have a question regarding defining and comparing models in DESeq2. My data are counts (index of abundance) for each taxon identified through fungal ITS high-throughput amplicon data from soils sampled seasonally under three different stand types.
Experimental design:
• randomized complete block - blocks=6; treatment=3 (standType = spruce, beech, oak) = 18 experimental units
• sampling times = 5 seasons (fall_1, winter, spring, summer, fall_2)
• total sample size = 90 samples
I am interested in looking at the response of species over time in my treatments with the goal of identifying species that show time dependency differences in counts among stand types (i.e. significant stand x time interactions). I have been following the “RNA-Seq workflow:gene-level exploratory analysis and differential expression” vignette (http://www.bioconductor.org/help/workflows/rnaseqGene/), specifically the section on “Time series experiments”.
I used the two models below to test if the interaction of standType:Season was an important factor.
m1 (full) ~standType+ standType:Season
m2 (reduced) ~ standType
This is a relatively similar set up to the vignette and my understanding is that since the only difference between the two models is the interaction term, species with low p values in the Likelihood Ratio Test results table are the species that show stand-specific effects in time. Correct?
I can pull out the species that show different patterns over time among the three treatments by querying all the contrast via standType and comparing the species with significant padj values between treatments. From here I can visualize patterns of any particular species by plotting the counts using code in the vignette.
This seems good so far, but after reading more on setting up models, I realize I have not properly defined my models. Since I have repeatedly sampled the same 18 sites five different times, I wonder if
• m1 (full) should be defined as a repeated measures design keeping samples = 90 which would more properly account for the non-independence of samples collected at the same site.
• and m2 (reduced) should identify samples from the same site as replicates collapsing the 90 samples to = 18 to avoid pseudoreplication and associated concerns about artificial inflation of statistical power.
It seems that the model statements should be structured to ensure the data are processed as repeated measures of 18 experimental units not as 90 independent samples. Am I correct in being concerned about this? If so, should I be looking into how to specify repeated measures model statements for m1 and m2?
Experimental design:
• randomized complete block - blocks=6; treatment=3 (standType = spruce, beech, oak) = 18 experimental units
• sampling times = 5 seasons (fall_1, winter, spring, summer, fall_2)
• total sample size = 90 samples
I am interested in looking at the response of species over time in my treatments with the goal of identifying species that show time dependency differences in counts among stand types (i.e. significant stand x time interactions). I have been following the “RNA-Seq workflow:gene-level exploratory analysis and differential expression” vignette (http://www.bioconductor.org/help/workflows/rnaseqGene/), specifically the section on “Time series experiments”.
I used the two models below to test if the interaction of standType:Season was an important factor.
m1 (full) ~standType+ standType:Season
m2 (reduced) ~ standType
This is a relatively similar set up to the vignette and my understanding is that since the only difference between the two models is the interaction term, species with low p values in the Likelihood Ratio Test results table are the species that show stand-specific effects in time. Correct?
I can pull out the species that show different patterns over time among the three treatments by querying all the contrast via standType and comparing the species with significant padj values between treatments. From here I can visualize patterns of any particular species by plotting the counts using code in the vignette.
This seems good so far, but after reading more on setting up models, I realize I have not properly defined my models. Since I have repeatedly sampled the same 18 sites five different times, I wonder if
• m1 (full) should be defined as a repeated measures design keeping samples = 90 which would more properly account for the non-independence of samples collected at the same site.
• and m2 (reduced) should identify samples from the same site as replicates collapsing the 90 samples to = 18 to avoid pseudoreplication and associated concerns about artificial inflation of statistical power.
It seems that the model statements should be structured to ensure the data are processed as repeated measures of 18 experimental units not as 90 independent samples. Am I correct in being concerned about this? If so, should I be looking into how to specify repeated measures model statements for m1 and m2?
Comment