Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • asiangg
    Member
    • Dec 2008
    • 44

    Can DESeq and edgeR deal with in-balanced RNA-seq data?

    We did three biological replicates for our treatment and control using RNA-seq to find out which transcripts have differential expression. To make sure we are obtaining genuine changes, we did another batch of experiments several months later. Now, we have:

    Batch1: 3 treatments vs. 3 controls
    Batch2: 3 treatments vs. 3 controls

    The two batches were done under the same conditions(hopefully). However, there is a significant difference in total read count. The first batch contains ~10 million reads for each replicate but the second batch contains ~30 million reads for each. It is because Illumina has improved chemicals and software.

    I applied several tools (including DESeq, edgeR and limma) to identify differential genes from the two batches of data. The 1st batch yields ~500 genes and the 2nd batch yields ~200 genes. To our disappoint, the two lists contain very small overlaps.

    We suspect one set of treatments or controls was screwed so decided to switch the treatment and control of the two batches to identify the bad ones.

    To our surprise, the two batches yield 10 fold more genes after switching! That means, each batch now contains ~5000 differential genes and they overlap by 70%!! This cannot be biologically true and I suspect it is related with the unbalanced inputs of treatment vs. control.

    To my knowledge, both DESeq and edgeR try to normalize the library sizes internally before performing statistical tests. However, the question is how well is that done? Any input or suggestions?
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    I'm not quite sure what you mean by switching. Are you now comparing treatment from batch 1 with control from batch 2?

    But two answer your question: Both DESeq and edgeR adjust for library size. While edgeR uses the library sizes that you tell it, DESeq tries to estimate them from the data.

    To see whether this worked well, I'd suggest that you choose pairs of samples and divide all the counts from one sample by the size factor for this sample (for DESeq; for edgeR, take the total read count) and do likewise for the other. Then plot one against the other in a log-log scatter plot and mark the diagonal (with abline(a=0,b=1) ). Check that the points scatter symmetrically around the diagonal. Do this for a couple of sample pairs.

    In my experience, however, the library size normalisation works well and is unlikely to be the culprit.

    A good idea might be to check sample distances: With DESeq, make a CountDataSet containing all 12 of your samples. The perform a variance stabilizing transformation, get a distance matrix for the variance transformed matrix and plot it as a heatmap. I have described this procedure in the DESeq vignette. If all is well, the replicates should cluster together. If a sample does not cluster with its replicates, you might want to exclude it from the analysis.

    Lastly, have a look at the scvPlots in your four batch-condition combinations. What is the raw SCV value in the region of highest count density, i.e., at the peak of the black density curve? Is it maybe much larger in some cases than in others?

    Cheers
    Simon

    Comment

    • markrobinsonca
      Junior Member
      • Mar 2010
      • 7

      #3
      A couple comments from the edgeR camp ...

      I agree with Simon that just a pairs() plots of read counts is a useful initial diagnostic, especially if you think you might have sample switching (I didn't fully understand what was switched from your description). Also, M-vs-A plots (edgeR does 'smear' plots) would be quite useful.

      One clarification of what Simon said with respect to edgeR. While its true that edgeR uses the library sizes "that you tell it", there is a function in there for calculating normalization factors from the data -- calcNormFactors() -- and a description in the manual of how to build that into your library sizes. I haven't compared directly, but its roughly similar to the DEseq calculation for this. The normalization (which is beyond just accounting for library size) is described at:


      Another alternative to explore sample relations is the plotMDS.dge() function in edgeR. This is essentially a principal components plot, but specific to count data.

      Hope that helps.

      Cheers,
      Mark

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Hi Mark

        It seems I haven looked into the edgeR vignette for a while and missed that you added a size estimation by now.

        You are right, your and our scheme are very much the same. We all had the same idea of looking at the quotient between individual gene counts and taking some robust location estimator of their distribution. The only difference is that you used a trimmed mean and we went all the way to maximal trimming, i.e., used the median. This definitely shouldn't make much of a difference.

        Simon

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Today, 05:37 AM
        0 responses
        5 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        16 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        50 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        109 views
        0 reactions
        Last Post SEQadmin2  
        Working...