Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA seq data normalization question slny Bioinformatics 35 10-19-2016 06:32 AM
RNA Seq normalization harshinamdar Bioinformatics 39 03-16-2013 02:12 AM
RNA-seq and normalization numbers zee Bioinformatics 52 12-12-2012 06:44 AM
Use of DSN normalization in SOLiD RNA-seq? daughart RNA Sequencing 4 03-09-2012 01:40 PM
RNA-Seq: GC-Content Normalization for RNA-Seq Data. Newsbot! Literature Watch 0 12-20-2011 03:00 AM

Thread Tools
Old 08-04-2011, 08:38 AM   #1
Senior Member
Location: Germany

Join Date: May 2010
Posts: 149
Unhappy clarification of rna-seq normalization

Hi everybody,

I read a lot in the last few days about the different opinions to rna-seq normalization methods.
To be honest I'm quite a bit confused at the moment and so I would like to ask for your help to try and clarify me about how to use what kind of normalization method.

I'm sure that there is no straightforward answer for such a question but I would really appreciate contradictory opinions if it will help for other users also to explain the problem.

As far as I understand it there is no "standard" method for normalizing methods.

We have one rna-seq experiment with each only one set for control and one set for treatment. Albeit the fact of insignificance regarding the lack of replicates, I would like to understand how to work in general with rna-seq data.

we would like to look into both differential expression and differences in splice variants between the two conditions.
I have read opinion about how to normalize the data in best way for identifying differentially expressed genes and for identifying isoforms.
Apparently these two goals should be analyzed differently.
The best example for that was the discussion between Simon and lpachter about when to normalize how here:

I think it shows how controversy this can be. I was interested in this discussion, though it is quite an old one and a lot have changed probably.

RPKM measure the relative level of gene expression between experiments, but apparently some people are against it, due to certain biases, which it can't compensate. In the posting above, Simon mentions DESeq (EdgeR), which suppose to work better for differential expression

So my questions are:
(well I will probably have a lot more, but these are to begin with)

1. Will it be better to normalize the data twice separately for the two goals

2. Does it make sense to normalize data one time after the other?

3. Can I relay on cuffdiff/cuffcompare to give me a good estimation on the splice variants and on DESeq/DEGSeq to give me a good estimation about the differentially expressed genes?

I would appreciate every comment or discussion.


frymor is offline   Reply With Quote
Old 08-05-2011, 12:23 AM   #2
Location: London, UK

Join Date: Jul 2009
Posts: 21

Clearly it is important to follow the assumptions and models within each of the tools you mention.

If you want to compile a simple "table of expression", you can produce RKPMs, fold-changes, etc. If, however you use a specific tool, such as edgeR, which has its own methodology for normalizing and estimating differences in expression (bearing in mind that edgeR has a variety of models implemented, as explained in its manual), then you should provide it what it expects, i.e. raw read counts

Since we are still in early days clearly lab validation of results is the key to understanding which tools are giving you best answers in the end....
eslondon is offline   Reply With Quote
Old 08-05-2011, 04:32 AM   #3
Senior Member
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 192


you are asking somewhat for the 'holy grail' - how to normalize my data.
In my opinion the most crucial step is to know where your data comes from. Thus, DE normalization between technical replicates needs to be different from DE detection between biological replicates (poisson vs. neg. binom (see Marioni et al.)). In addition, as mentioned above, every method assumes a different distribution of reads.
RPKM 'just' normalize for gene length and amount of reads in total. It does not correct biases coming from transcript abundance in the library. Thus your RPKM values should follow a normal distrib. and they should not show a linear correlation between gene length and transcript abundance. However, since housekeepers provide a great amount of transcript one should also take into account to normalize maybe with quantile normalization, for instance. DESeq (and stuff like that) want the raw counts to estimate dispersion and distribution to optimally fit the assumptions to the given data. So I would do different analysis (i.e. using DESeq as well as RPKM/FC analysis) and compare the results. From that comparison you can figure out what distribution fits best to your data, at least somewhat.
sphil is offline   Reply With Quote
Old 08-06-2011, 07:07 PM   #4
Senior Member
Location: East Coast, US

Join Date: Jun 2010
Posts: 177

Hi frymor,

You may try different methods but ultimately you must rely on the follow-up experiment(s) to validate the results. Let's say you try 2-3 analysis methods/models, you will have DE genes identified by all methods or by some. You need to validate them by independent methods - e.g., qPCR. The field needs sufficient validation results to see which method is better suited for a certain application.
DZhang is offline   Reply With Quote

gene expression, normalization, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 08:20 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO