Seqanswers Leaderboard Ad

**Simon Anders** · 07-25-2010, 11:08 PM

You can divide your counts by the size factors:

normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

Now, the samples are on a common scale.

Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

Simon

**xhuister** · 07-26-2010, 05:21 AM

Originally posted by Simon Anders View Post

You can divide your counts by the size factors:

normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

Now, the samples are on a common scale.

Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

Simon

Thanks Simon, actually float is OK too, or I can round to interger.

**yifangt** · 06-20-2011, 03:15 PM

Normalized counts of common scale!

Hi, Simon:

I have a question related to your last post about the normalized counts of common scale. My RNA-seq data composes of seven conditions each with 2 replicates. That is a time series experiment for seed embryo development. I was wondering if I could get the normalized counts of the all the development stages with the common scale to compare the expression across all the time courses.
I tried your code, and the result showed similar scale only within the two replicates of each stage, but between stages they are still quite different. Here is the normalized counts sum for each library. Note same letter indicates same stage, and 1, 2 indicate replicates.

Code:

Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
25581381.8, 26292834.7, 14023619.9, 12161472.3, 18581478.1, 18749724.2,	10861736.7, 11585558.2, 9996853.0, 10318368.5, 14130401.4, 16239989.6, 30725888.0, 27787233.7

And the original reads sums for each library are:

Code:

Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
21012147, 19924212, 26002900, 9660245, 17139388, 7649319, 16430105, 20101956, 12920266, 6306742, 44241095, 20094409, 15166090, 23203758

I assume the normalized sum for each stage should be close to each other too as for replicates within stage. Not sure what is the function DESeq uses to get the point. edgeR expects provides the common.lib.size for this purpose, and I assume both edgeR and DESeq share the idea. Am I right?
Actually I tried edgeR for the common.lib.size, but I never got similar numbers across the different stages, which is another reason I post this problem here.

Did I miss anything e.g. only two conditions are allowed?
As most of the experiments are for two conditions I tried only sub-section of my data with only two conditions, the result looked the same. Is there anything wrong with my data? But the raw counts number looks fine and other analysis indicated the data is fine.

I tried different normalization methods (in edgeR TMM, quantile and RLE), or use the sum of each condition without replicate, but still could not get to the common scale with the comparable counts number for each stage. I have posted this question in bioconductor group, not yet get any reply from edgeR group. Appreciate if you could possibly address this point.
Thank you!

Yifang

Originally posted by Simon Anders View Post

You can divide your counts by the size factors:

normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

Now, the samples are on a common scale.

Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

Simon

**Simon Anders** · 06-20-2011, 11:38 PM

If you are suspicious that normalization might be off, the best is to compare two samples with an MA plot:

Code:

plot( 
   ( counts(cds)[,1] + counts(cds)[,2] )/2, 
   counts(cds)[,2] / counts(cds)[,1], 
   log="xy", pch="." )
abline( h = sizeFactors(cds)[2] / sizeFactors(cds)[1] )

If the size factors are fine, the horizontal line should disect the bulk of the points in the middle, especially at the high-expression end (right-hand side) of the plot.

**yifangt** · 06-21-2011, 07:53 AM

Re: Normalized counts between treatments.

Thanks Simon!

My suspicion is the normalization between the treatments rather than within treatment. I did the plotting for the cds from two replicates of the same treatment (i.e. within treatment, left image of the attachment) and one replicate each of two different treatments (between treatments, right image) as you suggested. The plot of the within treatment looks very fine, but the one between treatment does not. Is there anything wrong? This bugs me for a couple of days!

Thanks again!

Yifang

Attached Files

MAplot.png (9.2 KB, 294 views)

**Simon Anders** · 06-21-2011, 09:24 AM

It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.

**yifangt** · 06-21-2011, 01:21 PM

Re: Normalized counts between treatments.

Originally posted by Simon Anders View Post

It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.

Thanks Simon! This is a big relief to me.
The dots are concentrated across the line after I add the transparency to the plot. However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide). But I can't see if DESeq has similar parameters behind.
After I check the sums of baseMeanA and baseMeanB as normalized counts for each library, the numbers are still quite different, which is the same as the result from edgeR. So,
1) Does my data violate the assumption of both packages?
2) Does the normalized library size of the conditions not matter if they are different?
3) Is the result still meaningful even the normalized library sizes are different?
4) What could probably be the reason(s) to cause the normalized library size so different?
5) Should I remove the smaller number reads as some other people do? After I removed the smaller numbers of counts (<=40 in >=6 out of 14 samples), the normalized library sizes become very similar.
I can feel my lack of mathematics for the packages. Your explanation would be greatly appreciated. Thank you again!
Yifang

Attached Files

MAplot.jpg (13.5 KB, 198 views)

**Simon Anders** · 07-03-2011, 05:02 AM

Originally posted by yifangt View Post

However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide).

If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

**xguo** · 12-31-2011, 02:13 PM

Can we get the normalized count for each exon using the same approach in the newly released DEXseq package? In the visualization graph generated by DEXSeq, there are normalized counts drawn for each exon. Are those value derived from dividing raw counts by size factors?

thanks for your reply.

Originally posted by Simon Anders View Post

If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

**areyes** · 01-01-2012, 02:54 AM

Hi xguo,

Yes, they are derived in the same way. In DEXSeq, you can get them using the parameter normalized=TRUE in the function 'counts':

counts(ecs, normalized=TRUE)

Cheers!

**billstevens** · 10-16-2012, 12:41 PM

I have a question about normalization and downscaling:

I have 7 samples with the following number of aligned reads, based on samtools flagstat
Sample 1a: 50M, Sample 1b: 45M, Sample 2a: 95 M, Sample 2b: 45M, Sample 2c: 55M, Sample 3a: 115 M, Sample 3b: 30 M

When I use cds <- estimateSizeFactors, I get the following data:
Sample 1a: 0.8, Sample 1b: 0.7, Sample 2a: 1.55, Sample 2b: 0.75, Sample 2c: 0.90, Sample 3a: 1.92, Sample 3b: 0.9

The last sample of condition 3 has a high library size than I think it should. I went to a workshop and they suggested downscaling some data so they all match the number of reads. Is that what I should do? Does this data look peculiar?

**Simon Anders** · 10-16-2012, 12:51 PM

And what did they mean by "downscaling"?

**billstevens** · 10-20-2012, 11:30 AM

Hi Simon, I wasn't sure how they meant to do this, but the point was to essentially disregard some reads to get all of conditions to have roughly the same amount of reads before running DESeq. Have you not encountered this technique? Do you think it is a problem that one of my samples (Sample 3a) has 4X the depth of another (Sample 3b), but the size factor that DESeq outputs indicates only a twofold difference?

**bpb9** · 11-20-2014, 07:46 PM

When I run DESeq and look at my dispersion estimates verus normalized expression, it seems that most genes have either very low or very high read counts even after normalization. I am normalizing data that consists of three time points (50 samples per time point). This looks very different from the dispersion plot in the manual. Is this an example where filtering out genes with very low read count would be a good idea? If not, is it worth it to normalize each time point separately? When I compare two samples with an MA plot , it seems fairly normal (though it is omitting about 6000 data points)
dispersion:
sample compare: http://tinypic.com/r/aadgxz/8

Edit: I ran it using cdsFilt and it looks much more like the example in the manual:

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to get normalized count from DESeq?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News