Seqanswers Leaderboard Ad

**dpryan** · 09-17-2014, 11:39 PM

Have a look at dge$samples$lib.size and dge$samples$norm.factors in both cases. I wonder if something is going wonky with the size normalization (I've seen that happen sometimes when you have a 10x or more difference in library sizes).

**nachocab** · 09-18-2014, 04:57 AM

Thanks, Devon. I did, but I don't see anything glaringly wrong.

The Y1 and Y2 samples have 21M and 33M reads respectively, with normalization factors 1.13 and .76 before adding the X samples, and .94 and .65 after adding the X samples (which have around 2M reads and a factor of around 1.2).

**Gordon Smyth** · 09-22-2014, 12:01 AM

nachocab, the results that you report sound impossible. It isn't possible for a real count of over 5000 to be reduced to a pseudo count of 0. Either you've made a mistake somewhere or there's a bug in edgeR. Is it possible that you have simply pulled out the wrong value for Gene X after adding the new data? If you think this is a bug in edgeR, then the way to proceed is to send a reproducible example to the authors. We certainly haven't seen anything like this.

BTW, it is not usual for a user to call equalizeLibSizes() directly, and it is not usually correct to use it without setting the dispersion argument.

**Gordon Smyth** · 09-23-2014, 04:29 PM

Nacho, thank you for sending me your data offline. To my surprise, this turns out to be neither impossible nor a bug.

First, let me point out that this would not have occured had you used edgeR in the usual way. Had you used a normal calling sequence like:

dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)

then the pseudo count for gene X would have been set to 6438.7 rather than zero.

The reason why you have got a pseudo count of 0 is that you have called equalizeLibSize with a dispersion setting that is inappropriate for your data. Your samples S11 and S12 belong to a small group of their own. Gene X has cpm values of 448.6 and 6162.8 for samples S11 and S12. You have called equalizeLibSizes() with dispersion=0, but it is completely unbelievable that Gene X could have had such different cpm values for the two replicate samples if the dispersion truly was 0. Hence, when transforming to a smaller library size, edgeR is forced to put the smaller of two counts to zero to try to fit the dispersion information you have given it. If you had called equalizeLibSizes with any reasonable dispersion value (even as small as 1e-6!) then the pseudo count would not have been zero. The estimated dispersion for Gene X is actually 0.48.

As I said in my previous post, it is not correct to call equalizeLibSizes without setting the dispersion appropriately. In fact, we didn't intend for users to call this function directly at all.

**Gordon Smyth** · 10-09-2014, 02:20 PM

Note that to access the pseudo counts, one uses

dge$pseudo.counts

not

equalizeLibSizes(dge)$pseudo.counts

The latter code will (in Bioc 2.14) compute pseudo counts using dispersion=0. Of course you will know this if you have read the help page for equalizeLibSizes.

The help page for equalizeLibSizes also tells you that this function is intended for internal use, so you should only call it directly if you know what you are doing.

**nachocab** · 10-09-2014, 02:59 PM

Thanks for your answer. I'll use cpms instead of pseudocounts, but I don't understand why they remain the same with and without dispersion:

# without estimating dispersion
dge_no_dispersion <- DGEList(counts = counts_raw, group = group)
dge_no_dispersion <- calcNormFactors(dge)

# estimating dispersion
dge_dispersion <- estimateGLMCommonDisp(dge_no_dispersion)
dge_dispersion <- estimateGLMTrendedDisp(dge_dispersion)
dge_dispersion <- estimateGLMTagwiseDisp(dge_dispersion)

# cpm doesn't change
cpm(dge_no_dispersion)[gene_id, c("S11", "S12")] # 448.648 6162.769
cpm(dge_dispersion)[gene_id, c("S11", "S12")] # 448.648 6162.769

**Gordon Smyth** · 10-09-2014, 04:07 PM

Originally posted by nachocab View Post

I'll use cpms instead of pseudocounts, but I don't understand why they remain the same with and without dispersion:

cpm is a very simple quantity:

count / normalized lib size * 1e6

It doesn't depend on dispersion.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

edgeR equalizeLibSizes(dge)$pseudo.counts changes wildly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News