SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq Count Tables Prep coralgirl Bioinformatics 5 11-26-2011 11:58 AM
DEXSeq vs htseq-count/DESeq counting model jdsv Bioinformatics 2 11-20-2011 07:48 PM
How to output Normalised count data from DESeq? fabrice Bioinformatics 1 08-16-2011 11:02 PM
count reads or count base-pairs yuelics Introductions 3 07-29-2011 05:41 AM
Quantification: count reads or count base pairs? yuelics Bioinformatics 0 07-27-2011 04:48 AM

Reply
 
Thread Tools
Old 07-19-2010, 11:14 AM   #1
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default How to get normalized count from DESeq?

Hi all,

I have different libraries of NGS and want to use DESeq to normalize them. I want to make the count comparable among different libraries, but how can I get the normalized count (in integer number but not float value)?

In the DESeq mannual, there is code like this:
cds <- newCountDataSet( countsTable, conds )
cds <- estimateSizeFactors( cds )
cds <- estimateVarianceFunctions( cds )
res <- nbinomTest( cds, "T", "N")

And about the normalization, I have sample1,2,3, each sample only has one replicate, if I want to compare sample1 and 2, should I normalize between 1 and 2, while if to compare 2 and 3, then again normalized between 2 and 3? Can I normalize sample 1,2,3 together then compare 1,2 or 2,3?


Thanks!
xhuister is offline   Reply With Quote
Old 07-25-2010, 11:08 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

You can divide your counts by the size factors:

normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

Now, the samples are on a common scale.

Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

Simon
Simon Anders is offline   Reply With Quote
Old 07-26-2010, 05:21 AM   #3
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default

Quote:
Originally Posted by Simon Anders View Post
You can divide your counts by the size factors:

normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

Now, the samples are on a common scale.

Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

Simon
Thanks Simon, actually float is OK too, or I can round to interger.
xhuister is offline   Reply With Quote
Old 06-20-2011, 03:15 PM   #4
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default Normalized counts of common scale!

Hi, Simon:

I have a question related to your last post about the normalized counts of common scale. My RNA-seq data composes of seven conditions each with 2 replicates. That is a time series experiment for seed embryo development. I was wondering if I could get the normalized counts of the all the development stages with the common scale to compare the expression across all the time courses.
I tried your code, and the result showed similar scale only within the two replicates of each stage, but between stages they are still quite different. Here is the normalized counts sum for each library. Note same letter indicates same stage, and 1, 2 indicate replicates.
Code:
Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
25581381.8, 26292834.7, 14023619.9, 12161472.3, 18581478.1, 18749724.2,	10861736.7, 11585558.2, 9996853.0, 10318368.5, 14130401.4, 16239989.6, 30725888.0, 27787233.7
And the original reads sums for each library are:
Code:
Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
21012147, 19924212, 26002900, 9660245, 17139388, 7649319, 16430105, 20101956, 12920266, 6306742, 44241095, 20094409, 15166090, 23203758
I assume the normalized sum for each stage should be close to each other too as for replicates within stage. Not sure what is the function DESeq uses to get the point. edgeR expects provides the common.lib.size for this purpose, and I assume both edgeR and DESeq share the idea. Am I right?
Actually I tried edgeR for the common.lib.size, but I never got similar numbers across the different stages, which is another reason I post this problem here.

Did I miss anything e.g. only two conditions are allowed?
As most of the experiments are for two conditions I tried only sub-section of my data with only two conditions, the result looked the same. Is there anything wrong with my data? But the raw counts number looks fine and other analysis indicated the data is fine.

I tried different normalization methods (in edgeR TMM, quantile and RLE), or use the sum of each condition without replicate, but still could not get to the common scale with the comparable counts number for each stage. I have posted this question in bioconductor group, not yet get any reply from edgeR group. Appreciate if you could possibly address this point.
Thank you!

Yifang

Quote:
Originally Posted by Simon Anders View Post
You can divide your counts by the size factors:

normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

Now, the samples are on a common scale.

Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

Simon

Last edited by yifangt; 06-20-2011 at 06:02 PM.
yifangt is offline   Reply With Quote
Old 06-20-2011, 11:38 PM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

If you are suspicious that normalization might be off, the best is to compare two samples with an MA plot:

Code:
plot( 
   ( counts(cds)[,1] + counts(cds)[,2] )/2, 
   counts(cds)[,2] / counts(cds)[,1], 
   log="xy", pch="." )
abline( h = sizeFactors(cds)[2] / sizeFactors(cds)[1] )
If the size factors are fine, the horizontal line should disect the bulk of the points in the middle, especially at the high-expression end (right-hand side) of the plot.
Simon Anders is offline   Reply With Quote
Old 06-21-2011, 07:53 AM   #6
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default Re: Normalized counts between treatments.

Thanks Simon!

My suspicion is the normalization between the treatments rather than within treatment. I did the plotting for the cds from two replicates of the same treatment (i.e. within treatment, left image of the attachment) and one replicate each of two different treatments (between treatments, right image) as you suggested. The plot of the within treatment looks very fine, but the one between treatment does not. Is there anything wrong? This bugs me for a couple of days!

Thanks again!

Yifang
Attached Images
File Type: png MAplot.png (9.2 KB, 263 views)
yifangt is offline   Reply With Quote
Old 06-21-2011, 09:24 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.
Simon Anders is offline   Reply With Quote
Old 06-21-2011, 01:21 PM   #8
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default Re: Normalized counts between treatments.

Quote:
Originally Posted by Simon Anders View Post
It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.
Thanks Simon! This is a big relief to me.
The dots are concentrated across the line after I add the transparency to the plot. However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide). But I can't see if DESeq has similar parameters behind.
After I check the sums of baseMeanA and baseMeanB as normalized counts for each library, the numbers are still quite different, which is the same as the result from edgeR. So,
1) Does my data violate the assumption of both packages?
2) Does the normalized library size of the conditions not matter if they are different?
3) Is the result still meaningful even the normalized library sizes are different?
4) What could probably be the reason(s) to cause the normalized library size so different?
5) Should I remove the smaller number reads as some other people do? After I removed the smaller numbers of counts (<=40 in >=6 out of 14 samples), the normalized library sizes become very similar.
I can feel my lack of mathematics for the packages. Your explanation would be greatly appreciated. Thank you again!
Yifang
Attached Images
File Type: jpg MAplot.jpg (13.5 KB, 158 views)
yifangt is offline   Reply With Quote
Old 07-03-2011, 05:02 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by yifangt View Post
However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide).
If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)
Simon Anders is offline   Reply With Quote
Old 12-31-2011, 01:13 PM   #10
xguo
Member
 
Location: Maryland

Join Date: Jul 2008
Posts: 48
Default

Can we get the normalized count for each exon using the same approach in the newly released DEXseq package? In the visualization graph generated by DEXSeq, there are normalized counts drawn for each exon. Are those value derived from dividing raw counts by size factors?

thanks for your reply.

Quote:
Originally Posted by Simon Anders View Post
If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)
xguo is offline   Reply With Quote
Old 01-01-2012, 01:54 AM   #11
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi xguo,

Yes, they are derived in the same way. In DEXSeq, you can get them using the parameter normalized=TRUE in the function 'counts':

counts(ecs, normalized=TRUE)

Cheers!
areyes is offline   Reply With Quote
Old 10-16-2012, 12:41 PM   #12
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

I have a question about normalization and downscaling:

I have 7 samples with the following number of aligned reads, based on samtools flagstat
Sample 1a: 50M, Sample 1b: 45M, Sample 2a: 95 M, Sample 2b: 45M, Sample 2c: 55M, Sample 3a: 115 M, Sample 3b: 30 M

When I use cds <- estimateSizeFactors, I get the following data:
Sample 1a: 0.8, Sample 1b: 0.7, Sample 2a: 1.55, Sample 2b: 0.75, Sample 2c: 0.90, Sample 3a: 1.92, Sample 3b: 0.9

The last sample of condition 3 has a high library size than I think it should. I went to a workshop and they suggested downscaling some data so they all match the number of reads. Is that what I should do? Does this data look peculiar?
billstevens is offline   Reply With Quote
Old 10-16-2012, 12:51 PM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

And what did they mean by "downscaling"?
Simon Anders is offline   Reply With Quote
Old 10-20-2012, 11:30 AM   #14
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Hi Simon, I wasn't sure how they meant to do this, but the point was to essentially disregard some reads to get all of conditions to have roughly the same amount of reads before running DESeq. Have you not encountered this technique? Do you think it is a problem that one of my samples (Sample 3a) has 4X the depth of another (Sample 3b), but the size factor that DESeq outputs indicates only a twofold difference?
billstevens is offline   Reply With Quote
Old 11-20-2014, 06:46 PM   #15
bpb9
Member
 
Location: NYC

Join Date: Aug 2012
Posts: 24
Default

When I run DESeq and look at my dispersion estimates verus normalized expression, it seems that most genes have either very low or very high read counts even after normalization. I am normalizing data that consists of three time points (50 samples per time point). This looks very different from the dispersion plot in the manual. Is this an example where filtering out genes with very low read count would be a good idea? If not, is it worth it to normalize each time point separately? When I compare two samples with an MA plot , it seems fairly normal (though it is omitting about 6000 data points)
dispersion:
sample compare: http://tinypic.com/r/aadgxz/8

Edit: I ran it using cdsFilt and it looks much more like the example in the manual:

Last edited by bpb9; 11-20-2014 at 08:23 PM. Reason: additional image, revised analysis
bpb9 is offline   Reply With Quote
Old 11-21-2014, 12:45 AM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

If you have m samples, then the maximum possible coefficient of variance is achieved if a gene appears in only one samples and has zero counts in the (m-1) other samples. In these cases, you will see a dispersion value close to m. So, all these points in the top left of your plots are likely genes that appear in only one or very few samples and are absent in most other samples. So, if you look at a scatter plot of one sample versus another, most of these genes will just vanish in the bottom-left zero-zero corner. This is why you didnt see them there.

I don't know what "cdsFilt" is, but removing these genes did make the plot more normal looking. However, note that if you zoomed in your previous plot into the region with mean > 10, it looks quite the same as the new plot. So your results shouldn't change much, i.e., the filtering might not even have been necessary.
Simon Anders is offline   Reply With Quote
Old 11-21-2014, 05:38 AM   #17
bpb9
Member
 
Location: NYC

Join Date: Aug 2012
Posts: 24
Default

Hi Simon, by cdsFilt I mean I did the following:

cdsFilt:

rs <- rowSums ( counts ( cds ))
use <- (rs > quantile(rs, 0.4))
table(use)
FALSE TRUE
10,000 15,243
#Proceed with only filtered data (15,243 genes)
cdsFilt <- cds[ use, ]

cdsFilt <- estimateDispersions( cdsFilt )

#View dispersion estimates
str( fitInfo(cdsFilt) )

#Plot filtered dispersions
plot(
rowMeans( counts( cdsFilt, normalized=TRUE ) ),
fitInfo(cdsFilt)$perGeneDispEsts,
pch = '.', log="xy"
)
xg <- 10^seq( -.10, 10, length.out=300 )
lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )

If I am going to use this data to test for differential expression and eQTLs, then I think removing the low expression transcripts, while not necessary, would be helpful, as it reduces multiple test correction.
bpb9 is offline   Reply With Quote
Old 10-19-2016, 05:37 AM   #18
smurmu
Junior Member
 
Location: new delhi

Join Date: Oct 2016
Posts: 6
Default

Can anyone please help in normalizing rna seq data using EDASeq??
Thanks
smurmu is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO