Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • edgeR equalizeLibSizes(dge)$pseudo.counts changes wildly

    I previously analyzed a set of 24 samples and found that gene X had 5,591 counts for sample_Y1 and 50,113 counts for sample_Y2. When I looked at the pseudocounts for these values (using equalizeLibSizes(dge)$pseudo.counts), I had Y1=6900.57 and Y2=52776.9.

    Then I added 12 more samples for condition Z (totally unrelated to condition Y) and added a new column to my design matrix so it looked like this:

    Y Z
    sample1 1 0
    ...
    sample_Y1 1 0
    sample_Y2 1 0
    ...
    sample24 1 0
    sample_Z1 0 1
    ...
    sample_Z12 0 1

    But the problem is that now Y1=0.0 and Y2=23458.0 for gene X. I understand that this value would decrease because the library sizes for my first 24 samples was around 20 million reads, and for the last 12, it was around 2 million reads.

    What I don't understand is why would Y1 change so dramatically, while Y2 would only halve. Any help is very appreciated.

  • #2
    Have a look at dge$samples$lib.size and dge$samples$norm.factors in both cases. I wonder if something is going wonky with the size normalization (I've seen that happen sometimes when you have a 10x or more difference in library sizes).

    Comment


    • #3
      Thanks, Devon. I did, but I don't see anything glaringly wrong.

      The Y1 and Y2 samples have 21M and 33M reads respectively, with normalization factors 1.13 and .76 before adding the X samples, and .94 and .65 after adding the X samples (which have around 2M reads and a factor of around 1.2).

      Comment


      • #4
        nachocab, the results that you report sound impossible. It isn't possible for a real count of over 5000 to be reduced to a pseudo count of 0. Either you've made a mistake somewhere or there's a bug in edgeR. Is it possible that you have simply pulled out the wrong value for Gene X after adding the new data? If you think this is a bug in edgeR, then the way to proceed is to send a reproducible example to the authors. We certainly haven't seen anything like this.

        BTW, it is not usual for a user to call equalizeLibSizes() directly, and it is not usually correct to use it without setting the dispersion argument.

        Comment


        • #5
          Nacho, thank you for sending me your data offline. To my surprise, this turns out to be neither impossible nor a bug.

          First, let me point out that this would not have occured had you used edgeR in the usual way. Had you used a normal calling sequence like:

          dge <- estimateCommonDisp(dge)
          dge <- estimateTagwiseDisp(dge)

          then the pseudo count for gene X would have been set to 6438.7 rather than zero.

          The reason why you have got a pseudo count of 0 is that you have called equalizeLibSize with a dispersion setting that is inappropriate for your data. Your samples S11 and S12 belong to a small group of their own. Gene X has cpm values of 448.6 and 6162.8 for samples S11 and S12. You have called equalizeLibSizes() with dispersion=0, but it is completely unbelievable that Gene X could have had such different cpm values for the two replicate samples if the dispersion truly was 0. Hence, when transforming to a smaller library size, edgeR is forced to put the smaller of two counts to zero to try to fit the dispersion information you have given it. If you had called equalizeLibSizes with any reasonable dispersion value (even as small as 1e-6!) then the pseudo count would not have been zero. The estimated dispersion for Gene X is actually 0.48.

          As I said in my previous post, it is not correct to call equalizeLibSizes without setting the dispersion appropriately. In fact, we didn't intend for users to call this function directly at all.
          Last edited by Gordon Smyth; 09-23-2014, 04:31 PM. Reason: minor grammatical error

          Comment


          • #6
            Note that to access the pseudo counts, one uses

            dge$pseudo.counts

            not

            equalizeLibSizes(dge)$pseudo.counts

            The latter code will (in Bioc 2.14) compute pseudo counts using dispersion=0. Of course you will know this if you have read the help page for equalizeLibSizes.

            The help page for equalizeLibSizes also tells you that this function is intended for internal use, so you should only call it directly if you know what you are doing.

            Comment


            • #7
              Thanks for your answer. I'll use cpms instead of pseudocounts, but I don't understand why they remain the same with and without dispersion:

              # without estimating dispersion
              dge_no_dispersion <- DGEList(counts = counts_raw, group = group)
              dge_no_dispersion <- calcNormFactors(dge)

              # estimating dispersion
              dge_dispersion <- estimateGLMCommonDisp(dge_no_dispersion)
              dge_dispersion <- estimateGLMTrendedDisp(dge_dispersion)
              dge_dispersion <- estimateGLMTagwiseDisp(dge_dispersion)

              # cpm doesn't change
              cpm(dge_no_dispersion)[gene_id, c("S11", "S12")] # 448.648 6162.769
              cpm(dge_dispersion)[gene_id, c("S11", "S12")] # 448.648 6162.769

              Comment


              • #8
                Originally posted by nachocab View Post
                I'll use cpms instead of pseudocounts, but I don't understand why they remain the same with and without dispersion:
                cpm is a very simple quantity:

                count / normalized lib size * 1e6

                It doesn't depend on dispersion.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                66 views
                0 likes
                Last Post seqadmin  
                Working...
                X