SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq2: diff gene expression between species using gene-specific normalization factor mra Bioinformatics 4 12-01-2014 07:17 AM
How to get expression of one gene from 10 RNA-Seq datasets? jeni Bioinformatics 6 10-14-2014 01:43 PM
Comparing read depths per gene/exon between samples shimbalama Clinical Sequencing 5 09-10-2014 08:17 AM
Cancer-Specific Gene Regulatory Networks datasets gunbuddy Core Facilities 3 08-17-2012 10:34 AM
comparing samples with varying no. of reads for differential expression harshinamdar Bioinformatics 1 08-24-2011 06:16 AM

Reply
 
Thread Tools
Old 02-03-2015, 01:13 PM   #1
Hockeymac18
Junior Member
 
Location: CA - Stanford

Join Date: Feb 2015
Posts: 4
Default Comparing gene expression of specific genes between, samples, datasets, and species

Our questions might be a bit different than most. As background for our lab, we are not experts in RNA-Seq data and are learning it as we go.

There have been a number of studies that have sequenced "normal" individuals, and we are interested in using this public data to answer a few simple questions. Specifically, we're interested to find out the gene expression range between "normal" individuals in the population for a very small number of genes (which we are interested in from our wet-lab work). We are not comparing normal against a disease, or anything like that.

We were wondering if there is a recommended way to normalize this data so that we can compare the gene expression of Gene X in individual A to individual B (letting us ultimately determine the range of expression in Gene X for all individuals).

We know that just using RPKM values is not the way to go. Housekeeping genes are full of many issues (for instance, there is no guarantee that they are actually stable across individuals). We have looked at quantile normalized values, but will this let you compare between individuals the way we want?

Naively, we are thinking that percent ranking of gene X against all genes in all individuals might work well. We are thinking this would also let us compare between studies and potentially even between species. But then this would remove the resolution (for instance, the number 2 gene vs. the number 3 gene might actually have a very large difference in expression, even if their "ranks" are nearly identical).

We've also thought about using "total expression" as the denominator. That is, we would divide expression for gene X / total expression in each individual. I know people have shied away from using this when looking at differential expression analysis, but if we already know what genes we want to know the expression of, we are thinking this method "could work". But like percent rank, we're not sure if there are any limitations that we are missing.

Does anyone know of a good way approach this question? We naively thought this would be a simple analysis (i.e. just grab the expression values for each and compare), but as we learn more it seems more complicated than we initially expected.

We appreciate any insight.
Hockeymac18 is offline   Reply With Quote
Old 02-03-2015, 10:28 PM   #2
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.
mikep is offline   Reply With Quote
Old 02-04-2015, 11:28 AM   #3
Hockeymac18
Junior Member
 
Location: CA - Stanford

Join Date: Feb 2015
Posts: 4
Default

Quote:
Originally Posted by mikep View Post
TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.
Thank you for your response. After doing a bit more reading, it does seem like TPM is what we're after.

"Total expression" as a concept is something our P.I. thought would be good to normalize against conceptually. And yes, I believe from an RNA-Seq perspective, this would mean total mapped reads.

The "public" datasets that we've found that we'd like to use have reported their expression figures as RPKM. We have also seen papers that were reporting expression as quantile-normalized RPKM/FPKM values.

Is it possible to calculate TPM from RPKM/FPKM? I guess for that it would depend on how they calculated RPKM, correct?

Am I correct in that the main difference between TPM and RPKM/FPKM is the length normalization for the transcript? Naively, then, I would think you could multiply RPKM/FPKM by the length of the transcript and get TPM, right? But then we would have to assume that each RPKM/FPKM value in each experiment is using the same length for the transcript...

Or am I missing something fundamental there in the formulas for TPM and RPKM/FPKM (which is quite likely)?
Hockeymac18 is offline   Reply With Quote
Old 02-04-2015, 04:12 PM   #4
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.
mikep is offline   Reply With Quote
Old 02-04-2015, 04:51 PM   #5
Hockeymac18
Junior Member
 
Location: CA - Stanford

Join Date: Feb 2015
Posts: 4
Default

Quote:
Originally Posted by mikep View Post
As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.
I think what I was thinking of is CPM (counts per million).


But I think I am missing something with the relationship between TPM and FPKM...

If the formulas for each are:

TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

Shouldn't you be able to go between the two?


If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/


Also, isn't there supposed to be a proportionality constant in the RPKM/FPKM formula? Or is that "cancelled" out in the equation?

The reason I bring up the proportionality constant is that this has been a main reason that people have recommended not using FPKM/RPKM and have instead recommended using TPM:
Wagner, Kim, and Lynch: http://lynchlab.uchicago.edu/publica...%282012%29.pdf
Lior Pacther blog article: https://liorpachter.wordpress.com/20...he-supplement/
Lior Pacther talk: https://www.youtube.com/watch?v=5NiF...tu.be&t=30m30s


Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).

Last edited by Hockeymac18; 02-04-2015 at 05:02 PM.
Hockeymac18 is offline   Reply With Quote
Old 02-04-2015, 05:51 PM   #6
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

Quote:
Originally Posted by Hockeymac18 View Post
But I think I am missing something with the relationship between TPM and FPKM...

If the formulas for each are:

TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

Shouldn't you be able to go between the two?


If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/
Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.

Quote:
Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).
Fair enuff
mikep is offline   Reply With Quote
Old 02-04-2015, 05:53 PM   #7
Hockeymac18
Junior Member
 
Location: CA - Stanford

Join Date: Feb 2015
Posts: 4
Default

Quote:
Originally Posted by mikep View Post
Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.



Fair enuff
Thank you for you help on the matter. You helped me work through the issues conceptually, and I learned a great deal about RNA-Seq quantification methods along the way.
Hockeymac18 is offline   Reply With Quote
Old 02-04-2015, 06:32 PM   #8
Zapages
Member
 
Location: NJ

Join Date: Oct 2012
Posts: 97
Default

I would like to say this was a very insightful topic in regards to TPM vs RPKM/FPKM situation. Thank you for sharing this information.

As for re-analyzing public data. My suggestion is to create pipeline and try to offload the information cloud based methodology and go from there. This will save you time.

Personally, I did the following:

NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > Cufflinks2 > Cuffmerge2 > Cuffdiff2 > Offline (CummeRbund)

and


NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > EdgeR

and


NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > DeSeq

This took about a 6 to 8 months to accomplish for about 40 samples. Its definitely do-able, but takes a bit of time.

I would suggest trying iPlant Collaborative's Discovery Environment.

All the best with your project.
Zapages is offline   Reply With Quote
Old 02-05-2015, 03:07 AM   #9
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

This has code for going between RPKM and TPM (and also effective counts)

https://haroldpimentel.wordpress.com...ression-units/

A very nice post.

Also when comparing public data, I recommend that you try to correct for batch effects using ComBat or a similar program. Also you might want to convert to log scale before that. Good luck!

Last edited by kopi-o; 02-05-2015 at 04:09 AM.
kopi-o is offline   Reply With Quote
Old 02-05-2015, 07:36 AM   #10
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

One advantage of starting from raw data and re-normalizing and analyzing yourself is that you can investigate the various original data sets for any potential bias in library size and signal distribution. Your processed data may represent radically different original data sets, and so you may be introducing bias into your meta-analysis by starting from processed data only.

That actually, to me, is the whole point of requiring authors to submit raw data - you really need to begin from that if you want to compare across studies. I just would not be comfortable trying to do any real cross-study meta-analysis from processed data.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
mbblack is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:01 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO