SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Should one combine normalization methods in RNA-seq? BGould Bioinformatics 2 07-20-2016 11:32 AM
Synapse.org RNAseqV2 data Monologia RNA Sequencing 0 12-29-2013 10:05 PM
Different normalization methods with count data greener Bioinformatics 2 09-01-2013 05:27 AM
Normalization methods in cuffdiff 2.02 drKM Bioinformatics 0 02-05-2013 10:43 PM

Reply
 
Thread Tools
Old 04-29-2014, 04:12 AM   #1
watermark
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default Whicn normalization methods are used for RNAseqv2 data at TCGA?

Hi,

I have downloaded the RNAseqV2 data for BRCA. there are diffrent version of expression values inside of the RNAseqV2, Level 3 folder.
in the files with extention
rsem.isoforms.results: we have raw_count and scaled_estimate
.rsem.genes.normalized_results: we have Normalized count

My question is that, what is the diffenece between Normalized count and Scaled estimate ? which Normalization methods they have used ?
watermark is offline   Reply With Quote
Old 04-29-2014, 05:02 AM   #2
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

The wiki explains how the data was handled - basically, there are two pipelines that were used, and the file names should tell you which files came out of which pipeline.

see https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
mbblack is offline   Reply With Quote
Old 08-05-2014, 02:52 PM   #3
benjaminsb
Junior Member
 
Location: Oxford

Join Date: Aug 2013
Posts: 2
Default

It took me a while to get my head around this, since the column names in the rsem.genes/isoforms.results files don't match the default output of RSEM, neither the version they claim to have used nor the most current version.

The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!

The *.normalized_results files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments. The Perl code for this quantile normalisation can be found here.

In conclusion, I would strongly recommend using the TPM/scaled_estimate values for all intents and purposes. It seems to me to be the more robust and mathematically sound value.

Hope that helps, best wishes,

Benjamin
benjaminsb is offline   Reply With Quote
Old 11-22-2014, 02:32 AM   #4
GenePool
Registered Vendor
 
Location: San Francisco, CA

Join Date: Mar 2014
Posts: 18
Default

Hi,

Yep, as Benjamin has pointed out, we have found the data in the *.normalized_results to be the most robust and comparable across samples and experiments. We've also done some testing against values we generate using a standard 75th-percentile normalization approach on the raw counts, and we find the relationship between our normalized values and the values presented in *.normalized_results to be in very high concordance (assessed by pearson correlation of gene-by-gene value comparisons).

In fact, we chose to import the raw counts into our software platform, GenePool. When users of GenePool work with the RNA-Seq data in GenePool, they have the choice to apply different normalization methods, one of which is the standard 75th normalization method.

If you're interested in checking out what we've done to bring TCGA data into GenePool, here are some related posts:

http://seqanswers.com/forums/showthread.php?t=48485
http://seqanswers.com/forums/showthread.php?t=42471

Good luck!

------------------------------
GenePool is making genomics data management, analysis, and sharing easier!
Products @ www.stationxinc.com

Last edited by GenePool; 11-23-2014 at 08:25 PM.
GenePool is offline   Reply With Quote
Old 02-07-2016, 09:31 PM   #5
dreamer2001
Junior Member
 
Location: Canada

Join Date: May 2015
Posts: 3
Default

Hello
I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
Thank you for your help
dreamer2001 is offline   Reply With Quote
Old 02-08-2016, 03:38 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,439
Default

Quote:
Originally Posted by dreamer2001 View Post
Hello
I am interested in TCGA analysis. I would like to see the differences in my gene of interest in multiple groups of lung cancer patients. DO you guys think I should use scaled_estimate or raw_count?
Thank you for your help
If you are interested in only checking one (or few genes) then you may want to do that at cancer Bioportal (http://www.cbioportal.org/) or the GenePool site mentioned above (if it is really free).
GenoMax is offline   Reply With Quote
Old 12-02-2016, 03:57 PM   #7
dreamer2001
Junior Member
 
Location: Canada

Join Date: May 2015
Posts: 3
Default

Hello again,
sorry guys, I am facing an issue here. I used the scaled estimate from TCGA data to correlate two genes across 550 patients. One reviewer said I should use normalized count as used by cBioportal. Which one is better? And how can I explain the use of scaled estimate over normalized count? To me scaled estimate sounded more sense so I just used it cuz I could understand how the data is generated from raw count.
Thanks for your help.
dreamer2001 is offline   Reply With Quote
Old 12-15-2016, 10:00 PM   #8
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 159
Default

Scaled estimate and normalised count are similar ways of normalising the reads of each sample. Neither one is better and both are fine. Make a scatterplot of scaled estimate vs. normalised count to show the reviewer that they basically provide the same information and complain that there's no good reason to change your analysis and figures.
Dario1984 is offline   Reply With Quote
Reply

Tags
bioinfomatics, ngs analysis, ngs data, tcga

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO