SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA seq data normalization question slny Bioinformatics 35 10-19-2016 05:32 AM
RNA Seq normalization harshinamdar Bioinformatics 39 03-16-2013 01:12 AM
RNA-Seq: GC-Content Normalization for RNA-Seq Data. Newsbot! Literature Watch 0 12-20-2011 02:00 AM
RNA-seq library DSN normalization megalodon RNA Sequencing 2 09-01-2011 01:44 AM
clarification of rna-seq normalization frymor Bioinformatics 3 08-06-2011 06:07 PM

Reply
 
Thread Tools
Old 09-22-2008, 12:15 AM   #1
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default RNA-seq and normalization numbers

In many of the experiments our lab is doing with Illumina reads, we always seem to end up with the task of normalizing data.
If I have 3 experimental conditions, I've sequenced a lane for each and there needs to be a way to compare counts of my known RNA in mirbase to my sequence reads mapped to the genome (with MAQ / novoalign).

I've read about people doing counts as reads per million and log transforming these values to fit Poisson distribution, but it's sprung multiple ideas in my mind. Would this be as simple as dividing my counts for each experiment by

1) 1 Million
2) the total number of reads sequenced
3) the total number of uniquely mapped reads


I'm inclined to option (3) because that represents the amount of usable sequence data.

I'm just wondering if anybody has a more intelligent way of tackling this problem with nextgen data or perhaps there's some software to help out.

I have :

a) Alignment locations of all reads on a ref. genome for each experiment
b) Location of my reference RNAs on the same genome

I am already able to count the number of overlapping locations with each reference RNA in each experiment, and that gives me raw counts.

I have about 4 experiments, but this varies from study to study.
zee is offline   Reply With Quote
Old 09-22-2008, 02:11 AM   #2
Chema
Junior Member
 
Location: Poznan, Poland

Join Date: Jul 2008
Posts: 6
Default

What we are doing is to apply kind of quantile normalization. After to obtain a score for each miRNA , since we are expecting that most of the miRNA have similar expression in both tissues at study, we normalize the score values to have similar distribution.

The main reason to apply this kind of normalization is that we were very worry about the different coverage of the different samples at study.
Chema is offline   Reply With Quote
Old 09-22-2008, 06:46 AM   #3
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,352
Default

http://www.ncbi.nlm.nih.gov/pubmed/18516045

This paper introduces a concept RPKM for quantifying tags in RNA-seq. Might be worth a look if you haven't seen it already.
ECO is offline   Reply With Quote
Old 09-23-2008, 04:26 AM   #4
Chema
Junior Member
 
Location: Poznan, Poland

Join Date: Jul 2008
Posts: 6
Default

Hi,

The point with RPKM that I do not like, it is that I do not feel that it can handle different coverages. Perhaps I can explain it better through an example.

Let say that we are working with a genome with three genes A, B and C with the same length (I know not very realist, but just an example), and we want to study their expression in two conditions 1 and 2.

The real expression of the genes is:
Condition 1 Condition 2
Gene A 1 1
Gene B 1 1
Gene C 1 0

We run a RNA-seq experiment and we get the next number of reads
Condition 1 Condition 2
Gene A 3*10^5 4.5*10^5
Gene B 3*10^5 4.5*10^5
Gene C 3*10^5 0
Total 9*10^5 9*10^5

Translate to RPKM , since they have the same length, it should be something like:
Condition 1 Condition 2
Gene A 333333 5*10^5
Gene B 333333 5*10^5
Gene C 333333 0

As you can see, it seems that Gene A and B are also differentially express. This is because, since the expression of gene C is lower in condition 2 than 1, we have more reads that will improve of the coverage of the other genes.

Anyway, I think that always it is nice to normalize the data in some way. Mainly, when you are working with so low number of replicates.

Last edited by Chema; 09-23-2008 at 04:30 AM. Reason: change format
Chema is offline   Reply With Quote
Old 10-07-2008, 08:46 AM   #5
vruotti
Member
 
Location: US

Join Date: Feb 2008
Posts: 13
Default RPKM concern?

Hi Chema,

We have used Wold's (RPMK) method and got very good results. Are you saying that if a particular gene has less coverage (coverage=fewer number of mapped reads) this gene will contribute to a change regarding the accuracy of detecting the expression of the other genes? I think the this is the whole point of normalizing. The difference is that the overall expression should not be too different. Also, can you please check your numbers for us? Are they really 333,333 or 333,000,000 for genes a, b and c in condition 1 after converting to RPKM?

Maybe I'm doing this wrong. The formula I see in their paper is:

RPKM = 10^9 x C / NL, which is really just simply C/N

C= the number of mappable reads that felt onto the gene's exons
N= total number of mappable reads in the experiment
L= the sum of the exons in base pairs.

So, let's plug in your numbers.
Condition 1 Condition 2
Gene A 3*10^5 4.5*10^5
Gene B 3*10^5 4.5*10^5
Gene C 3*10^5 0
Total 9*10^5 9*10^5

Translate to RPKM , since they have the same length, it should be something like:
Condition 1 Condition 2
Gene A 333,000,000 500,000,000
Gene B 333,000,000 500,000,000
Gene C 333,000,000 500,000,000

If you look at these numbers you could argue whether the two expression values are differentially expressed. They are not that far apart. Sorry, did I miss your point? Can you explain your concern again?

Victor

Last edited by vruotti; 10-08-2008 at 06:38 AM.
vruotti is offline   Reply With Quote
Old 10-10-2008, 06:45 AM   #6
Chema
Junior Member
 
Location: Poznan, Poland

Join Date: Jul 2008
Posts: 6
Default

Hi Victor,

What I wanted to say is that if you have a group of genes that are less expressed in condition 2 than 1, then, you will have less reads that represent these genes in condition 2 than 1, and therefore the genes that are not in this group will be represented for more reads in condition 2 than in condition 1 (not necessarily, because they are differentially expressed). But I guess this problem will be only important when you have a high number of genes differentially expressed.

Yes, you are right the numbers are as you said. Only:

Translate to RPKM , since they have the same length, it should be something like:
Condition 1 Condition 2
Gene A 333,000,000 500,000,000
Gene B 333,000,000 500,000,000
Gene C 333,000,000 0

About if the difference in gene A and B are significant. Well, I didn’t generate any noise for these data, this is a theorical example, they are not estimations if not the real values of the variable. So, if they are different is because the method will declare it as differentially expressed.

Any way, if you want to see a bigger difference, just run an example with more genes that are downregulated in condition 2. You will see that the diference for gene A and B increase.
Chema is offline   Reply With Quote
Old 10-11-2008, 07:07 AM   #7
vruotti
Member
 
Location: US

Join Date: Feb 2008
Posts: 13
Default normalization

Hi Chema,
Yeah I see what you mean. I guess most of the conditions we ran so far do not have the a huge amount of genes downregulated in comparison with the rest of the genes studied. We are doing whole genome RNA-seq studies and I think the house keeping genes do help avoid this problem.

In case having a huge number of genes that are indeed different, then maybe we have to come up with a slight modification of the RPKM formula/method to cope with this elevated expression or lack of. Maybe include a coverage based on the estimated level of expression variable.

Does anybody else used the RPKM method? Can you share with us the different normalization methods you might use for doing RNA-seq outside of RPKM?

Victor
vruotti is offline   Reply With Quote
Old 11-12-2008, 08:04 AM   #8
fabio25
Member
 
Location: italy

Join Date: Aug 2008
Posts: 13
Default

Dear everybody,
I would be very interested in asking how you fix the miRNA targets.
just use mirBASE or have you a much nicer way?
Thanks a lot,
Fabio
fabio25 is offline   Reply With Quote
Old 02-27-2009, 05:01 AM   #9
doxologist
Member
 
Location: USA

Join Date: Jan 2009
Posts: 96
Default

Hi Victor:

It seems that from your problem, looking at percentage of the tags a certain gene occupies would solve the problem... and ALSO like suggested, quantile normalization (assuming that the differentially regulated genes do not make up a large percentage of your genes).
doxologist is offline   Reply With Quote
Old 01-15-2010, 02:55 AM   #10
steffenp
Junior Member
 
Location: Germany

Join Date: Dec 2009
Posts: 2
Default RPKM for Tag-Seq data

Hello everyone!
My question is also related to the RPKM normalization. Is this measure suited for Tag-Seq. data, where we have only reads at the 3'end and not all over the whole transcript. My suggestion is no. What would be an adequate normalization for this kind of data?
steffenp is offline   Reply With Quote
Old 04-02-2010, 06:42 PM   #11
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
The point with RPKM that I do not like, it is that I do not feel that it can handle different coverages.
I totally agree with you Chema. RPKM is biased for testing differential expression for longer genes. See the following paper:

HTML Code:
http://www.biology-direct.com/content/4/1/14
and also other papers by Oshlack and Wakefield.
RockChalkJayhawk is offline   Reply With Quote
Old 04-03-2010, 12:55 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 990
Default

If you want to test for differential expression, it is a good idea to stay on the level of raw, integer counts, and not use RPKM or related data that is normalized by transcript length. This is because significance depends on the number of actual reads that you count. If you have low count you need to see a high fold-change to call significance.

See this thread for more details: http://seqanswers.com/forums/showthread.php?t=4349 (especially from post #6 onwards)

If you work with count data, your testing procedure needs to be aware of the ratios of sequencing depths of the libraries. This functionality is offered by several tools, namely edgeR, DESeq, and cuffdiff. I recommend DESeq, of course, as it is our work. ;-)

Note that this does not alleviate the bias towards longer genes: If two genes have the same expressions (same number of transcript molecules per volume) in two samples and hence the same fold change, the longer one may be called significant and the shorter one not, because the longer one produces more fragments.

If this bothers you, you have a couple of options:
- use Tag-Seq instead of RNA-Seq
- additionally filter with with a rather large threshold on log fold change
- for GO enrichment test and the like, use the test by Young et al. (2010), which takes the length bias into account: http://genomebiology.com/2010/11/2/R14

Here is a figure, that shows, how the log fold change required for significance (red dots: genes with significant DE; black dots: other genes) depends on the counts when using DESeq for testing:



For more information, see our paper: http://dx.doi.org/10.1038/npre.2010.4282.1

Last edited by Simon Anders; 04-03-2010 at 12:57 AM.
Simon Anders is offline   Reply With Quote
Old 04-03-2010, 01:05 AM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 990
Default

To estimate the library size, simply taking the total number of (mapped or unmapped) reads is, in our experience, not a good idea.

Sometimes, a few very strongly expressed genes are differentially expressed, and as they make up a good part of the total counts, they skew this number. After you divide by total counts, these few strongly expressed genes become equal, and the whole rest looks differentially expressed.

The following simple alternative works much better:

- Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.

- To get the sequencing depth of a sample relative to the reference, calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample. Now you have, for each gene, an estimate of the depth ratio.

- Simply take the median of all the quotients to get the relative depth of the library.

This is what the 'estimateSizeFactors' function of our DESeq package does.
Simon Anders is offline   Reply With Quote
Old 04-04-2010, 08:52 PM   #14
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default Re: Bowtie-Tophat SAM output for read count assembly

Quote:
Originally Posted by Simon Anders View Post
If you want to test for differential expression, it is a good idea to stay on the level of raw, integer counts, and not use RPKM or related data that is normalized by transcript length. This is because significance depends on the number of actual reads that you count. If you have low count you need to see a high fold-change to call significance.

See this thread for more details: http://seqanswers.com/forums/showthread.php?t=4349 (especially from post #6 onwards)

If you work with count data, your testing procedure needs to be aware of the ratios of sequencing depths of the libraries. This functionality is offered by several tools, namely edgeR, DESeq, and cuffdiff. I recommend DESeq, of course, as it is our work. ;-)
Hi Simon
I have used Bowtie, Tophat and Cufflinks to align and assemble maize RNA-seq data. Cufflinks reports FPKM values which our statistics collaborator is not able to use as it has already been normalized. Can I use the Bowtie, Tophat generated 'SAM' output in some other program to assemble the data in way that I will have absolute read counts per gene or transcript? If so what other programs would you recommend?

thanks
Siva
Siva is offline   Reply With Quote
Old 04-05-2010, 07:50 AM   #15
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

Dear Siva,

I'd like to understand why your statistics collaborator cannot use the Cufflinks FPKM values together with their confidence intervals?

Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.
lpachter is offline   Reply With Quote
Old 04-05-2010, 08:23 AM   #16
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by lpachter View Post
Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.
Why not just take the FPKM values categorized by cufflinks for each individual transcripts and multiply them by transcript length and number of aligned sequences? That would give you the tag count per transcript.
RockChalkJayhawk is offline   Reply With Quote
Old 04-05-2010, 08:58 AM   #17
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

Thats correct- the procedure RCJ suggests will give you an estimate of the actual tag count for each transcript.
lpachter is offline   Reply With Quote
Old 04-05-2010, 07:32 PM   #18
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Quote:
Originally Posted by lpachter View Post
Dear Siva,

I'd like to understand why your statistics collaborator cannot use the Cufflinks FPKM values together with their confidence intervals?

Regarding absolute counts, the whole point of Cufflinks is that it is not possible to obtain absolute read counts per transcript, because for many reads there is ambiguity as to which transcript they belong to. Cufflinks is probabilistically assigning reads to transcripts and thereby able to estimate expression of individual transcripts.
Dear L Pachter
Thank you very much for your reply thanks also to RCJ. I understand your point about probabilistically assigning reads to transcripts. I will get back to our statistical group about this. However the procedure suggested by RCJ and seconded by you has me a bit confused. I used to think that the algorithm that Cufflinks uses assigns reads based on probability and it will be different for each transcript and also vary according to genome location. So how can a mere multiplication of FPKM by transcript length and aligned sequences give us tag count per transcript? Am I missing something too obvious here?

thanks
Siva
Siva is offline   Reply With Quote
Old 04-06-2010, 04:46 AM   #19
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

To answer your question, see this page <a href="http://cufflinks.cbcb.umd.edu/howitworks.html"> here</a>
Quote:
To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments.
Therefore, cufflinks is using probabilities to assign individual fragments to individual transcripts. Since the units of the counts are in fragments per KB transcript per million reads, you can convert them back to raw integers by the multiplying the length of a given transcript and number of reads in millions.
RockChalkJayhawk is offline   Reply With Quote
Old 04-06-2010, 05:08 AM   #20
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Unfortunately I just noticed that when cufflinks calls transcripts, it doesn't report thier length. Only in the tmap reference files. All the other files only show the genomic coordinates, which isn't as helpful.

I will e-mail the author to see if this functional;ity can be easily added or worked around.
RockChalkJayhawk is offline   Reply With Quote
Reply

Tags
normalization, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO