SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
edgeR mean normalized reads counts for each gene antoza Introductions 2 02-11-2014 08:22 AM
The easiest way to produce the normalised counts in edgeR feralBiologist Bioinformatics 0 11-17-2013 03:50 PM
edgeR: number of DGE genes and FDR?? SpreeFu Bioinformatics 4 03-12-2013 08:54 AM
RNA seq and DGE using edgeR greggime RNA Sequencing 14 07-12-2012 02:01 AM
Multiple DGE libraries comparison. (EdgeR baySeq DESeq) Roman Bruno Bioinformatics 52 11-29-2011 06:39 AM

Reply
 
Thread Tools
Old 09-17-2014, 03:14 PM   #1
nachocab
Member
 
Location: cambridge, MA

Join Date: Dec 2012
Posts: 11
Default edgeR equalizeLibSizes(dge)$pseudo.counts changes wildly

I previously analyzed a set of 24 samples and found that gene X had 5,591 counts for sample_Y1 and 50,113 counts for sample_Y2. When I looked at the pseudocounts for these values (using equalizeLibSizes(dge)$pseudo.counts), I had Y1=6900.57 and Y2=52776.9.

Then I added 12 more samples for condition Z (totally unrelated to condition Y) and added a new column to my design matrix so it looked like this:

Y Z
sample1 1 0
...
sample_Y1 1 0
sample_Y2 1 0
...
sample24 1 0
sample_Z1 0 1
...
sample_Z12 0 1

But the problem is that now Y1=0.0 and Y2=23458.0 for gene X. I understand that this value would decrease because the library sizes for my first 24 samples was around 20 million reads, and for the last 12, it was around 2 million reads.

What I don't understand is why would Y1 change so dramatically, while Y2 would only halve. Any help is very appreciated.
nachocab is offline   Reply With Quote
Old 09-17-2014, 11:39 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Have a look at dge$samples$lib.size and dge$samples$norm.factors in both cases. I wonder if something is going wonky with the size normalization (I've seen that happen sometimes when you have a 10x or more difference in library sizes).
dpryan is offline   Reply With Quote
Old 09-18-2014, 04:57 AM   #3
nachocab
Member
 
Location: cambridge, MA

Join Date: Dec 2012
Posts: 11
Default

Thanks, Devon. I did, but I don't see anything glaringly wrong.

The Y1 and Y2 samples have 21M and 33M reads respectively, with normalization factors 1.13 and .76 before adding the X samples, and .94 and .65 after adding the X samples (which have around 2M reads and a factor of around 1.2).
nachocab is offline   Reply With Quote
Old 09-22-2014, 12:01 AM   #4
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

nachocab, the results that you report sound impossible. It isn't possible for a real count of over 5000 to be reduced to a pseudo count of 0. Either you've made a mistake somewhere or there's a bug in edgeR. Is it possible that you have simply pulled out the wrong value for Gene X after adding the new data? If you think this is a bug in edgeR, then the way to proceed is to send a reproducible example to the authors. We certainly haven't seen anything like this.

BTW, it is not usual for a user to call equalizeLibSizes() directly, and it is not usually correct to use it without setting the dispersion argument.
Gordon Smyth is offline   Reply With Quote
Old 09-23-2014, 04:29 PM   #5
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

Nacho, thank you for sending me your data offline. To my surprise, this turns out to be neither impossible nor a bug.

First, let me point out that this would not have occured had you used edgeR in the usual way. Had you used a normal calling sequence like:

dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)

then the pseudo count for gene X would have been set to 6438.7 rather than zero.

The reason why you have got a pseudo count of 0 is that you have called equalizeLibSize with a dispersion setting that is inappropriate for your data. Your samples S11 and S12 belong to a small group of their own. Gene X has cpm values of 448.6 and 6162.8 for samples S11 and S12. You have called equalizeLibSizes() with dispersion=0, but it is completely unbelievable that Gene X could have had such different cpm values for the two replicate samples if the dispersion truly was 0. Hence, when transforming to a smaller library size, edgeR is forced to put the smaller of two counts to zero to try to fit the dispersion information you have given it. If you had called equalizeLibSizes with any reasonable dispersion value (even as small as 1e-6!) then the pseudo count would not have been zero. The estimated dispersion for Gene X is actually 0.48.

As I said in my previous post, it is not correct to call equalizeLibSizes without setting the dispersion appropriately. In fact, we didn't intend for users to call this function directly at all.

Last edited by Gordon Smyth; 09-23-2014 at 04:31 PM. Reason: minor grammatical error
Gordon Smyth is offline   Reply With Quote
Old 10-09-2014, 02:20 PM   #6
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

Note that to access the pseudo counts, one uses

dge$pseudo.counts

not

equalizeLibSizes(dge)$pseudo.counts

The latter code will (in Bioc 2.14) compute pseudo counts using dispersion=0. Of course you will know this if you have read the help page for equalizeLibSizes.

The help page for equalizeLibSizes also tells you that this function is intended for internal use, so you should only call it directly if you know what you are doing.
Gordon Smyth is offline   Reply With Quote
Old 10-09-2014, 02:59 PM   #7
nachocab
Member
 
Location: cambridge, MA

Join Date: Dec 2012
Posts: 11
Default

Thanks for your answer. I'll use cpms instead of pseudocounts, but I don't understand why they remain the same with and without dispersion:

# without estimating dispersion
dge_no_dispersion <- DGEList(counts = counts_raw, group = group)
dge_no_dispersion <- calcNormFactors(dge)

# estimating dispersion
dge_dispersion <- estimateGLMCommonDisp(dge_no_dispersion)
dge_dispersion <- estimateGLMTrendedDisp(dge_dispersion)
dge_dispersion <- estimateGLMTagwiseDisp(dge_dispersion)

# cpm doesn't change
cpm(dge_no_dispersion)[gene_id, c("S11", "S12")] # 448.648 6162.769
cpm(dge_dispersion)[gene_id, c("S11", "S12")] # 448.648 6162.769
nachocab is offline   Reply With Quote
Old 10-09-2014, 04:07 PM   #8
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

Quote:
Originally Posted by nachocab View Post
I'll use cpms instead of pseudocounts, but I don't understand why they remain the same with and without dispersion:
cpm is a very simple quantity:

count / normalized lib size * 1e6

It doesn't depend on dispersion.
Gordon Smyth is offline   Reply With Quote
Reply

Tags
edger, normalization, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:18 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO