SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA Seq normalization harshinamdar Bioinformatics 39 03-16-2013 01:12 AM
choosing & validating RNA-Seq time course data normalization method(s) anandksrao Bioinformatics 6 10-20-2012 10:50 AM
RNA-Seq: GC-Content Normalization for RNA-Seq Data. Newsbot! Literature Watch 0 12-20-2011 02:00 AM
A scaling normalization method for differential expression analysis of RNA-seq data severin Literature Watch 1 09-09-2010 11:09 PM
Quantile normalization for RNA seq data? Boel Bioinformatics 3 03-26-2010 03:07 PM

Reply
 
Thread Tools
Old 03-10-2011, 12:50 PM   #1
slny
Member
 
Location: FL

Join Date: Mar 2011
Posts: 54
Default RNA seq data normalization question

Hi,

Currently I'm working on mRNA Seq and have a question about data normalization.

If the data is already normalized with RPKM, should I further normalize the data, for example TMM?

Thanks,

slny
slny is offline   Reply With Quote
Old 03-10-2011, 01:20 PM   #2
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

am not sure what you mean by TMM?
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 03-10-2011, 01:45 PM   #3
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Hello,

I think if you did RPKM first, that would incorporate any RNA library compositional bias that TMM aims to compensate for, so if you would want to take the compositional bias into account, perhaps use the scaling factor produced by TMM first to adjust the library read counts and then proceed to do RPKM? Or just use the edgeR package in its entirety.

Ken
Kennels is offline   Reply With Quote
Old 03-11-2011, 04:03 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Better tell us what you want to do afterwards with your normalized data. This may influence how you want to normalize.
Simon Anders is offline   Reply With Quote
Old 03-11-2011, 06:01 AM   #5
slny
Member
 
Location: FL

Join Date: Mar 2011
Posts: 54
Default

Thanks a lot for all the responses.

Currently I have mRNA seq data for two groups and would like to find out differentially expressed genes. Currently I use countOverlaps function to count the reads for each gene and then use edgeR or DESeq for data normalization and differential analysis.

Because the expression level should be the count of reads for each gene divided by the gene length, I wonder whether I should normalize the data with RPKM first and then further normalize the data with TMM in edgeR.

For bioinfosm's question, TMM is a normalization method used by edgeR package. TMM should be kind of global normalization (not very sure).
slny is offline   Reply With Quote
Old 03-11-2011, 06:20 AM   #6
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

TMM is trimmed mean of M-values and is performed on the counts, not on the RPKM. It's a way to control for samples with different populations of RNA by sort of computing a "global fold change" between samples using a trimmed mean as a scaling factor. If your samples are kind of similar to eachother, you might not need it, but if you're worried about different populations of RNAs, TMM normalization might help. Then you would use the TMM normalized read counts to compute differential expresion.
mgogol is offline   Reply With Quote
Old 03-11-2011, 06:29 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

The normalization methods in DESeq and edgeR are meant to be fed with raw, integer counts. Please do not divide by transcript length before the DE analysis; it will screw up the whole method. For visualization purposes, you may want to divide the normalized counts by transcript length afterwards. (In DESeq, you get normalized counts by dividing the raw counts by the appropriate size factor.) However, think carefully about what to use as transcript length The original idea of using the sum of all exon lengths was not that good (see, e.g., the cufflinks paper).
Simon Anders is offline   Reply With Quote
Old 03-11-2011, 06:46 AM   #8
slny
Member
 
Location: FL

Join Date: Mar 2011
Posts: 54
Default

Does TMM consider gene length? If not, how could I adjust the gene expression from the read count for each gene?
slny is offline   Reply With Quote
Old 03-11-2011, 07:05 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by slny View Post
Does TMM consider gene length? If not, how could I adjust the gene expression from the read count for each gene?
No, it doesn't, because it doesn't need to.

This is why I asked what you want to do with your data.

If you want to test for differential expression, you want to compare the expression of the same gene in different samples. As the gene has the same length in all your samples, there is no point in dividing by the gene length. You only mask the information on how precise your measurement is.

If you want to compare a gene with another gene, then you may want to divide by gene length, but you should be aware that such a comparison opens a whole new can of worms.
Simon Anders is offline   Reply With Quote
Old 03-11-2011, 07:15 AM   #10
slny
Member
 
Location: FL

Join Date: Mar 2011
Posts: 54
Default

Perfect explanation. Thanks a lot!

One more question. Should I log transform the count of reads before I normalize the data?
slny is offline   Reply With Quote
Old 03-11-2011, 09:47 AM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

No.

By "normalize", do you mean using DESeq's and edgeR's normalisation methods? They expect raw, integer counts, see above.

Or do you mean dividing by transcript length? This does not make sense on the log scale, for obvious reasons.
Simon Anders is offline   Reply With Quote
Old 03-11-2011, 11:47 AM   #12
slny
Member
 
Location: FL

Join Date: Mar 2011
Posts: 54
Default

If we use poisson distribution or negative binomial distribution for differential analysis, then we should not log transformation because of discrete probability distribution.

Why do we use these discrete probability distributions in sequencing analysis, but normal distribution in microarray data analysis? Could we log transform the mRNA seq data and normalize the data with quantile normalization? If so, we can still use t test to select differentially expressed genes.
slny is offline   Reply With Quote
Old 03-12-2011, 01:14 PM   #13
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

+1: Do not log-transform count data.
steven is offline   Reply With Quote
Old 03-13-2011, 01:35 AM   #14
A Oshlack
Member
 
Location: Australia

Join Date: Jun 2010
Posts: 17
Default

Quote:
Originally Posted by slny View Post
Why do we use these discrete probability distributions in sequencing analysis, but normal distribution in microarray data analysis? Could we log transform the mRNA seq data and normalize the data with quantile normalization? If so, we can still use t test to select differentially expressed genes.
Cloonan et al, Nature Methods did exactly what you suggest. However, microarray data is fundamentally different as expression is measured indirectly by fluorescence of probes and seems to behave normally on the log scale. For sequencing data this is not the case i.e. when you log a Poisson distribution it's not normally distributed. We actually tested the Cloonan method in our simulation for the TMM paper and it performed significantly worse than count based methods but I don't think that result made it into the paper.

One comment on RPKM. In my opinion one would want to divide by gene length when you are looking at absolute expression of a gene i.e. comparing between genes rather than comparing between samples. However to do a proper comparison between genes you really need to take into account other biases such a sequence compositions.
A Oshlack is offline   Reply With Quote
Old 03-13-2011, 05:44 AM   #15
pbseq
Member
 
Location: italy

Join Date: Feb 2010
Posts: 16
Default

maybe sligthly off topic but is RNA-seq counting-related:
I always hear about RPKM but, to me, counting gene expressione by covered bases (and not nymber of reads ) looks more precise to me. base counting instead of read counting is very easy (e.g. with SeqMonk software) but is soo poorly mentioned that I'm wondering if it's OK for downstream applications.

BTW, for differential expression purposes, I use SeQmonk for harvesting raw data as follows: I select probes of interest (e.g, genes, mRNA or intergenic regions ) , I count data by bases (I do not correct for number of total reads, or gene length and don't log transform) and then feed the raw data to DESeq or EDGER. Upto looks fine to me (at least for my poor experience ).. any warnings?
thanks for any comments !
pbseq is offline   Reply With Quote
Old 03-13-2011, 11:53 PM   #16
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by pbseq View Post
maybe sligthly off topic but is RNA-seq counting-related:
I always hear about RPKM but, to me, counting gene expressione by covered bases (and not nymber of reads ) looks more precise to me. base counting instead of read counting is very easy (e.g. with SeqMonk software) but is soo poorly mentioned that I'm wondering if it's OK for downstream applications.
If your reads are all the same length then counting reads or bases amounts to the same thing. The reason the base counting option in SeqMonk is useful is because in the initial analysis a quantitation is usually carried out for each exon. Spliced reads will be split between exons, so if you simply count reads then a spliced read will be counted twice and thus any spliced reads will exert an undue influence on the quantitation. Doing a base count allows a read to be split proportionally between the exons it overlaps and therefore gives every read the same weight in the quantitation.
simonandrews is offline   Reply With Quote
Old 03-14-2011, 02:34 AM   #17
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by pbseq View Post
BTW, for differential expression purposes, I use SeQmonk for harvesting raw data as follows: I select probes of interest (e.g, genes, mRNA or intergenic regions ) , I count data by bases (I do not correct for number of total reads, or gene length and don't log transform) and then feed the raw data to DESeq or EDGER. Upto looks fine to me (at least for my poor experience ).. any warnings?
Yes, you get too many false positives. There is two principal sources of noise in RNA-Seq: (a) the actual variation in concentration of a transcript between samples, and (b) the shot noise. Imagine, you have two samples with exactly the same concentration of a given transcript. Will you get the same counts? Of course not, because there is still a random element in how many transcripts actually get sequenced. This is called shot noise, is determined by the Poisson distribution, and constitutes the theoretical lower limit the the measurement precision in RNA-Seq. DESeq and edgeR compute the shot noise from the number of reads. (The more reads, the less severe the shot noise is.) If you feed these tools the number of bases instead of the number of reads, they will severely underestimate the within-group variance and call too many hits.
Simon Anders is offline   Reply With Quote
Old 03-15-2011, 06:22 AM   #18
pbseq
Member
 
Location: italy

Join Date: Feb 2010
Posts: 16
Default

Simon Andrews an Simon Anders : many thanks for the answers, now the issue is much clearer. Indeed, in many tutorials simply the word "counts" is used which indeed as I understand now is intended to solely refer to "read counts". I wonder also if, in future, packages and software could implement the option of DEG calculation based on base coverage, which in some cases may reveal more accurate !
pbseq is offline   Reply With Quote
Old 03-16-2011, 09:59 AM   #19
slny
Member
 
Location: FL

Join Date: Mar 2011
Posts: 54
Default

Could I simply summarize that RPKM should be used for changes with genes, but TMM in edgeR and the normalization (Sorry, don't remember the name) in DESeq for changes with samples?
slny is offline   Reply With Quote
Old 04-10-2011, 09:17 PM   #20
syambmed
Junior Member
 
Location: malaysia

Join Date: Mar 2011
Posts: 5
Default confuse newbie

Hi guys,

I have trancriptome data from Illumina and am using CLC Genomic workbench for data analysis. I dont know or not familiar with other programs for transcriptome analysis. the data are from 1 sample of control cells and 1 sample of treated cells (no replicate for each sample) and I am looking for differently express genes.

The problem is normalization step. There are 3 types of normalization method offered by the software 1) scaling [option for normalization value= mean or median, baseline = median mean or median median] 2) quantile 3) total reads per 1million.

I dont know which one to choose..T_T Help me..

Then there are statistical tests on Gaussion data or on proportions. How to know that my data is suitable for which test..? I read that mostly people use Baggerley's.

the thing with Baggerley test (when i explore with the software) is that the test outcome have p-value and false discovery rate (FDR) p-value correction. which one is used for determining differentially expressed genes..?


Thank you.
syambmed is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:01 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO