Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Should one combine normalization methods in RNA-seq? BGould Bioinformatics 2 07-20-2016 12:32 PM RNAseqV2 data Monologia RNA Sequencing 0 12-29-2013 11:05 PM
Different normalization methods with count data greener Bioinformatics 2 09-01-2013 06:27 AM
Normalization methods in cuffdiff 2.02 drKM Bioinformatics 0 02-05-2013 11:43 PM

Thread Tools
Old 04-29-2014, 05:12 AM   #1
Location: Germany

Join Date: Nov 2013
Posts: 20
Default Whicn normalization methods are used for RNAseqv2 data at TCGA?


I have downloaded the RNAseqV2 data for BRCA. there are diffrent version of expression values inside of the RNAseqV2, Level 3 folder.
in the files with extention
rsem.isoforms.results: we have raw_count and scaled_estimate
.rsem.genes.normalized_results: we have Normalized count

My question is that, what is the diffenece between Normalized count and Scaled estimate ? which Normalization methods they have used ?
watermark is offline   Reply With Quote
Old 04-29-2014, 06:02 AM   #2
Senior Member
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245

The wiki explains how the data was handled - basically, there are two pipelines that were used, and the file names should tell you which files came out of which pipeline.

Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
mbblack is offline   Reply With Quote
Old 08-05-2014, 03:52 PM   #3
Junior Member
Location: Oxford

Join Date: Aug 2013
Posts: 2

It took me a while to get my head around this, since the column names in the rsem.genes/isoforms.results files don't match the default output of RSEM, neither the version they claim to have used nor the most current version.

The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!

The *.normalized_results files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments. The Perl code for this quantile normalisation can be found here.

In conclusion, I would strongly recommend using the TPM/scaled_estimate values for all intents and purposes. It seems to me to be the more robust and mathematically sound value.

Hope that helps, best wishes,

benjaminsb is offline   Reply With Quote
Old 11-22-2014, 03:32 AM   #4
Registered Vendor
Location: San Francisco, CA

Join Date: Mar 2014
Posts: 18


Yep, as Benjamin has pointed out, we have found the data in the *.normalized_results to be the most robust and comparable across samples and experiments. We've also done some testing against values we generate using a standard 75th-percentile normalization approach on the raw counts, and we find the relationship between our normalized values and the values presented in *.normalized_results to be in very high concordance (assessed by pearson correlation of gene-by-gene value comparisons).

In fact, we chose to import the raw counts into our software platform, GenePool. When users of GenePool work with the RNA-Seq data in GenePool, they have the choice to apply different normalization methods, one of which is the standard 75th normalization method.

If you're interested in checking out what we've done to bring TCGA data into GenePool, here are some related posts:

Good luck!

GenePool is making genomics data management, analysis, and sharing easier!
Products @

Last edited by GenePool; 11-23-2014 at 09:25 PM.
GenePool is offline   Reply With Quote
Old 02-07-2016, 10:31 PM   #5
Junior Member
Location: Canada

Join Date: May 2015
Posts: 3

I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
Thank you for your help
dreamer2001 is offline   Reply With Quote
Old 02-08-2016, 04:38 AM   #6
Senior Member
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080

Originally Posted by dreamer2001 View Post
I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
Thank you for your help
If you are interested in only checking one (or few genes) then you may want to do that at cancer Bioportal ( or the GenePool site mentioned above (if it is really free).
GenoMax is offline   Reply With Quote
Old 12-02-2016, 04:57 PM   #7
Junior Member
Location: Canada

Join Date: May 2015
Posts: 3

Hello again,
sorry guys, I am facing an issue here. I used the scaled estimate from TCGA data to correlate two genes across 550 patients. One reviewer said I should use normalized count as used by cBioportal. Which one is better? And how can I explain the use of scaled estimate over normalized count? To me scaled estimate sounded more sense so I just used it cuz I could understand how the data is generated from raw count.
Thanks for your help.
dreamer2001 is offline   Reply With Quote
Old 12-15-2016, 11:00 PM   #8
Senior Member
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166

Scaled estimate and normalised count are similar ways of normalising the reads of each sample. Neither one is better and both are fine. Make a scatterplot of scaled estimate vs. normalised count to show the reviewer that they basically provide the same information and complain that there's no good reason to change your analysis and figures.
Dario1984 is offline   Reply With Quote

bioinfomatics, ngs analysis, ngs data, tcga

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 05:36 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO