SEQanswers How to get normalized count from DESeq?
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post coralgirl Bioinformatics 5 11-26-2011 11:58 AM jdsv Bioinformatics 2 11-20-2011 07:48 PM fabrice Bioinformatics 1 08-16-2011 11:02 PM yuelics Introductions 3 07-29-2011 05:41 AM yuelics Bioinformatics 0 07-27-2011 04:48 AM

 07-19-2010, 11:14 AM #1 xhuister Member   Location: China Join Date: Apr 2010 Posts: 41 How to get normalized count from DESeq? Hi all, I have different libraries of NGS and want to use DESeq to normalize them. I want to make the count comparable among different libraries, but how can I get the normalized count (in integer number but not float value)? In the DESeq mannual, there is code like this: cds <- newCountDataSet( countsTable, conds ) cds <- estimateSizeFactors( cds ) cds <- estimateVarianceFunctions( cds ) res <- nbinomTest( cds, "T", "N") And about the normalization, I have sample1,2,3, each sample only has one replicate, if I want to compare sample1 and 2, should I normalize between 1 and 2, while if to compare 2 and 3, then again normalized between 2 and 3? Can I normalize sample 1,2,3 together then compare 1,2 or 2,3? Thanks!
 07-25-2010, 11:08 PM #2 Simon Anders Senior Member   Location: Heidelberg, Germany Join Date: Feb 2010 Posts: 994 You can divide your counts by the size factors: normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) ) Now, the samples are on a common scale. Of course, these are not integer values. However, I cannot quite see why you would want to have integer values. Simon
07-26-2010, 05:21 AM   #3
xhuister
Member

Location: China

Join Date: Apr 2010
Posts: 41

Quote:
 Originally Posted by Simon Anders You can divide your counts by the size factors: normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) ) Now, the samples are on a common scale. Of course, these are not integer values. However, I cannot quite see why you would want to have integer values. Simon
Thanks Simon, actually float is OK too, or I can round to interger.

06-20-2011, 03:15 PM   #4
yifangt
Member

Join Date: Feb 2011
Posts: 61
Normalized counts of common scale!

Hi, Simon:

I have a question related to your last post about the normalized counts of common scale. My RNA-seq data composes of seven conditions each with 2 replicates. That is a time series experiment for seed embryo development. I was wondering if I could get the normalized counts of the all the development stages with the common scale to compare the expression across all the time courses.
I tried your code, and the result showed similar scale only within the two replicates of each stage, but between stages they are still quite different. Here is the normalized counts sum for each library. Note same letter indicates same stage, and 1, 2 indicate replicates.
Code:
```Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
25581381.8, 26292834.7, 14023619.9, 12161472.3, 18581478.1, 18749724.2,	10861736.7, 11585558.2, 9996853.0, 10318368.5, 14130401.4, 16239989.6, 30725888.0, 27787233.7```
And the original reads sums for each library are:
Code:
```Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
21012147, 19924212, 26002900, 9660245, 17139388, 7649319, 16430105, 20101956, 12920266, 6306742, 44241095, 20094409, 15166090, 23203758```
I assume the normalized sum for each stage should be close to each other too as for replicates within stage. Not sure what is the function DESeq uses to get the point. edgeR expects provides the common.lib.size for this purpose, and I assume both edgeR and DESeq share the idea. Am I right?
Actually I tried edgeR for the common.lib.size, but I never got similar numbers across the different stages, which is another reason I post this problem here.

Did I miss anything e.g. only two conditions are allowed?
As most of the experiments are for two conditions I tried only sub-section of my data with only two conditions, the result looked the same. Is there anything wrong with my data? But the raw counts number looks fine and other analysis indicated the data is fine.

I tried different normalization methods (in edgeR TMM, quantile and RLE), or use the sum of each condition without replicate, but still could not get to the common scale with the comparable counts number for each stage. I have posted this question in bioconductor group, not yet get any reply from edgeR group. Appreciate if you could possibly address this point.
Thank you!

Yifang

Quote:
 Originally Posted by Simon Anders You can divide your counts by the size factors: normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) ) Now, the samples are on a common scale. Of course, these are not integer values. However, I cannot quite see why you would want to have integer values. Simon

Last edited by yifangt; 06-20-2011 at 06:02 PM.

 06-20-2011, 11:38 PM #5 Simon Anders Senior Member   Location: Heidelberg, Germany Join Date: Feb 2010 Posts: 994 If you are suspicious that normalization might be off, the best is to compare two samples with an MA plot: Code: ```plot( ( counts(cds)[,1] + counts(cds)[,2] )/2, counts(cds)[,2] / counts(cds)[,1], log="xy", pch="." ) abline( h = sizeFactors(cds)[2] / sizeFactors(cds)[1] )``` If the size factors are fine, the horizontal line should disect the bulk of the points in the middle, especially at the high-expression end (right-hand side) of the plot.
06-21-2011, 07:53 AM   #6
yifangt
Member

Join Date: Feb 2011
Posts: 61
Re: Normalized counts between treatments.

Thanks Simon!

My suspicion is the normalization between the treatments rather than within treatment. I did the plotting for the cds from two replicates of the same treatment (i.e. within treatment, left image of the attachment) and one replicate each of two different treatments (between treatments, right image) as you suggested. The plot of the within treatment looks very fine, but the one between treatment does not. Is there anything wrong? This bugs me for a couple of days!

Thanks again!

Yifang
Attached Images
 MAplot.png (9.2 KB, 263 views)

 06-21-2011, 09:24 AM #7 Simon Anders Senior Member   Location: Heidelberg, Germany Join Date: Feb 2010 Posts: 994 It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.
06-21-2011, 01:21 PM   #8
yifangt
Member

Join Date: Feb 2011
Posts: 61
Re: Normalized counts between treatments.

Quote:
 Originally Posted by Simon Anders It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.
Thanks Simon! This is a big relief to me.
The dots are concentrated across the line after I add the transparency to the plot. However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide). But I can't see if DESeq has similar parameters behind.
After I check the sums of baseMeanA and baseMeanB as normalized counts for each library, the numbers are still quite different, which is the same as the result from edgeR. So,
1) Does my data violate the assumption of both packages?
2) Does the normalized library size of the conditions not matter if they are different?
3) Is the result still meaningful even the normalized library sizes are different?
4) What could probably be the reason(s) to cause the normalized library size so different?
5) Should I remove the smaller number reads as some other people do? After I removed the smaller numbers of counts (<=40 in >=6 out of 14 samples), the normalized library sizes become very similar.
I can feel my lack of mathematics for the packages. Your explanation would be greatly appreciated. Thank you again!
Yifang
Attached Images
 MAplot.jpg (13.5 KB, 158 views)

07-03-2011, 05:02 AM   #9
Simon Anders
Senior Member

Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994

Quote:
 Originally Posted by yifangt However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide).
If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

12-31-2011, 01:13 PM   #10
xguo
Member

Location: Maryland

Join Date: Jul 2008
Posts: 48

Can we get the normalized count for each exon using the same approach in the newly released DEXseq package? In the visualization graph generated by DEXSeq, there are normalized counts drawn for each exon. Are those value derived from dividing raw counts by size factors?

Quote:
 Originally Posted by Simon Anders If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

 01-01-2012, 01:54 AM #11 areyes Senior Member   Location: Heidelberg Join Date: Aug 2010 Posts: 165 Hi xguo, Yes, they are derived in the same way. In DEXSeq, you can get them using the parameter normalized=TRUE in the function 'counts': counts(ecs, normalized=TRUE) Cheers!
 10-16-2012, 12:41 PM #12 billstevens Senior Member   Location: Baltimore Join Date: Mar 2012 Posts: 120 I have a question about normalization and downscaling: I have 7 samples with the following number of aligned reads, based on samtools flagstat Sample 1a: 50M, Sample 1b: 45M, Sample 2a: 95 M, Sample 2b: 45M, Sample 2c: 55M, Sample 3a: 115 M, Sample 3b: 30 M When I use cds <- estimateSizeFactors, I get the following data: Sample 1a: 0.8, Sample 1b: 0.7, Sample 2a: 1.55, Sample 2b: 0.75, Sample 2c: 0.90, Sample 3a: 1.92, Sample 3b: 0.9 The last sample of condition 3 has a high library size than I think it should. I went to a workshop and they suggested downscaling some data so they all match the number of reads. Is that what I should do? Does this data look peculiar?
 10-16-2012, 12:51 PM #13 Simon Anders Senior Member   Location: Heidelberg, Germany Join Date: Feb 2010 Posts: 994 And what did they mean by "downscaling"?
 10-20-2012, 11:30 AM #14 billstevens Senior Member   Location: Baltimore Join Date: Mar 2012 Posts: 120 Hi Simon, I wasn't sure how they meant to do this, but the point was to essentially disregard some reads to get all of conditions to have roughly the same amount of reads before running DESeq. Have you not encountered this technique? Do you think it is a problem that one of my samples (Sample 3a) has 4X the depth of another (Sample 3b), but the size factor that DESeq outputs indicates only a twofold difference?
 11-20-2014, 06:46 PM #15 bpb9 Member   Location: NYC Join Date: Aug 2012 Posts: 24 When I run DESeq and look at my dispersion estimates verus normalized expression, it seems that most genes have either very low or very high read counts even after normalization. I am normalizing data that consists of three time points (50 samples per time point). This looks very different from the dispersion plot in the manual. Is this an example where filtering out genes with very low read count would be a good idea? If not, is it worth it to normalize each time point separately? When I compare two samples with an MA plot , it seems fairly normal (though it is omitting about 6000 data points) dispersion: sample compare: http://tinypic.com/r/aadgxz/8 Edit: I ran it using cdsFilt and it looks much more like the example in the manual: Last edited by bpb9; 11-20-2014 at 08:23 PM. Reason: additional image, revised analysis
 11-21-2014, 12:45 AM #16 Simon Anders Senior Member   Location: Heidelberg, Germany Join Date: Feb 2010 Posts: 994 If you have m samples, then the maximum possible coefficient of variance is achieved if a gene appears in only one samples and has zero counts in the (m-1) other samples. In these cases, you will see a dispersion value close to m. So, all these points in the top left of your plots are likely genes that appear in only one or very few samples and are absent in most other samples. So, if you look at a scatter plot of one sample versus another, most of these genes will just vanish in the bottom-left zero-zero corner. This is why you didnt see them there. I don't know what "cdsFilt" is, but removing these genes did make the plot more normal looking. However, note that if you zoomed in your previous plot into the region with mean > 10, it looks quite the same as the new plot. So your results shouldn't change much, i.e., the filtering might not even have been necessary.
 11-21-2014, 05:38 AM #17 bpb9 Member   Location: NYC Join Date: Aug 2012 Posts: 24 Hi Simon, by cdsFilt I mean I did the following: cdsFilt: rs <- rowSums ( counts ( cds )) use <- (rs > quantile(rs, 0.4)) table(use) FALSE TRUE 10,000 15,243 #Proceed with only filtered data (15,243 genes) cdsFilt <- cds[ use, ] cdsFilt <- estimateDispersions( cdsFilt ) #View dispersion estimates str( fitInfo(cdsFilt) ) #Plot filtered dispersions plot( rowMeans( counts( cdsFilt, normalized=TRUE ) ), fitInfo(cdsFilt)\$perGeneDispEsts, pch = '.', log="xy" ) xg <- 10^seq( -.10, 10, length.out=300 ) lines( xg, fitInfo(cds)\$dispFun( xg ), col="red" ) If I am going to use this data to test for differential expression and eQTLs, then I think removing the low expression transcripts, while not necessary, would be helpful, as it reduces multiple test correction.
 10-19-2016, 05:37 AM #18 smurmu Junior Member   Location: new delhi Join Date: Oct 2016 Posts: 6 Can anyone please help in normalizing rna seq data using EDASeq?? Thanks