Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 workflow problem

    Hi everyone,

    I'm investigating differentially expressed genes in my wild-type yeast vs a mutant strain. Each strain has been cultured in three different conditions (glucose, sucrose, glycerol). I've got two replicates of Illumina 100bp single end reads per condition and I'm analysing these with DESeq2 (v1.6.3).

    Taking the glucose replicates for example. I've performed two analyses. The first using a counts table containing all replicates from all conditions. For the glucose comparison I specified a pairwise contrast like follows:
    Code:
    myresults <- results(ddsanalysis, contrast=c("condition","mutant_glucose","WT_glucose"))
    Comparing this output to the second analysis in which the counts table only contained my glucose replicates, I see ~100 fewer DE genes returned from the first analysis containing all conditions and replicates. The subsets of genes returned are also different. For example, the first analysis suggests a gene (which I know to be upregulated in the mutant on glucose via qRT-PCR) is not among the significant DE genes; yet this gene is the most significant DE gene in my second analysis.

    While I don't fully understand the modelling that DESeq2 conducts during the analysis, It seems to me that information is drawn from all replicates supplied via the counts table regardless of whether they are specified as part of a pairwise contrast or not and that this is influencing the determination of DE genes in the first analysis.

    What would the community recommend as the best course of action for this analysis? To create separate counts tables for the pairwise comparisons I wish to perform? Or to continue using the contrasts method as I have described?

    Thanks in advance,

  • #2
    "information is drawn from all replicates supplied via the counts table regardless of whether they are specified as part of a pairwise contrast"

    You are correct, all samples in the DESeqDataSet are used to estimate the dispersion. Adding or removing samples will change the dispersion estimates, which will change p-values. So it is not surprising to see more or less genes passing an FDR threshold after adding or removing samples. p-values are tail probabilities which are highly sensitive to model parameters. Usually it's best to include all the samples for dispersion estimation. An exception to this is if you have plenty of biological replicates, and the within-group variance of the groups are very different. Look at a PCA plot (see vignette) and see if the different groups have similar or very different within-group spread. If they are very different and you have plenty of biological replicates, then you might want to run DESeq() on each group separately.

    Comment


    • #3
      Thanks for the reply Michael,

      After looking over a PCA plot containing all of my replicates the glycerol condition replicates do not clearly cluster independently of the other conditions. It is possible that including these replicates in my initial analysis inflated the dispersion estimates and in turn, affected the DE gene predictions. Removing the glycerol condition results in a set of predicted DE genes more similar to that when only analysing the glucose condition. I will likely remove the glycerol replicates from my analysis and analyse them independently however there may be too much variability to resolve with only two biological replicates.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 11:49 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-24-2024, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X