Seqanswers Leaderboard Ad

**mikep** · 02-03-2015, 10:28 PM

TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.

**Hockeymac18** · 02-04-2015, 11:28 AM

Originally posted by mikep View Post

TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.

Thank you for your response. After doing a bit more reading, it does seem like TPM is what we're after.

"Total expression" as a concept is something our P.I. thought would be good to normalize against conceptually. And yes, I believe from an RNA-Seq perspective, this would mean total mapped reads.

The "public" datasets that we've found that we'd like to use have reported their expression figures as RPKM. We have also seen papers that were reporting expression as quantile-normalized RPKM/FPKM values.

Is it possible to calculate TPM from RPKM/FPKM? I guess for that it would depend on how they calculated RPKM, correct?

Am I correct in that the main difference between TPM and RPKM/FPKM is the length normalization for the transcript? Naively, then, I would think you could multiply RPKM/FPKM by the length of the transcript and get TPM, right? But then we would have to assume that each RPKM/FPKM value in each experiment is using the same length for the transcript...

Or am I missing something fundamental there in the formulas for TPM and RPKM/FPKM (which is quite likely)?

**mikep** · 02-04-2015, 04:12 PM

As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.

**Hockeymac18** · 02-04-2015, 04:51 PM

Originally posted by mikep View Post

As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.

I think what I was thinking of is CPM (counts per million).

But I think I am missing something with the relationship between TPM and FPKM...

If the formulas for each are:

TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

Shouldn't you be able to go between the two?

If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/

Also, isn't there supposed to be a proportionality constant in the RPKM/FPKM formula? Or is that "cancelled" out in the equation?

The reason I bring up the proportionality constant is that this has been a main reason that people have recommended not using FPKM/RPKM and have instead recommended using TPM:
Wagner, Kim, and Lynch: http://lynchlab.uchicago.edu/publica...%282012%29.pdf
Lior Pacther blog article: https://liorpachter.wordpress.com/20...he-supplement/
Lior Pacther talk: https://www.youtube.com/watch?v=5NiF...tu.be&t=30m30s

Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).

**mikep** · 02-04-2015, 05:51 PM

Originally posted by Hockeymac18 View Post

But I think I am missing something with the relationship between TPM and FPKM...

If the formulas for each are:

TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

Shouldn't you be able to go between the two?

If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/

Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.

Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).

Fair enuff

**Hockeymac18** · 02-04-2015, 05:53 PM

Originally posted by mikep View Post

Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.

Fair enuff

Thank you for you help on the matter. You helped me work through the issues conceptually, and I learned a great deal about RNA-Seq quantification methods along the way.

**Zapages** · 02-04-2015, 06:32 PM

I would like to say this was a very insightful topic in regards to TPM vs RPKM/FPKM situation. Thank you for sharing this information.

As for re-analyzing public data. My suggestion is to create pipeline and try to offload the information cloud based methodology and go from there. This will save you time.

Personally, I did the following:

NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > Cufflinks2 > Cuffmerge2 > Cuffdiff2 > Offline (CummeRbund)

and

NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > EdgeR

and

NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > DeSeq

This took about a 6 to 8 months to accomplish for about 40 samples. Its definitely do-able, but takes a bit of time.

I would suggest trying iPlant Collaborative's Discovery Environment.

All the best with your project.

**kopi-o** · 02-05-2015, 03:07 AM

This has code for going between RPKM and TPM (and also effective counts)

What the FPKM? A review of RNA-Seq expression units

https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

This post covers the units used in RNA-Seq that are, unfortunately, often misused and misunderstood. I’ll try to clear up a bit of the confusion here. The first thing one should remember is t…

A very nice post.

Also when comparing public data, I recommend that you try to correct for batch effects using ComBat or a similar program. Also you might want to convert to log scale before that. Good luck!

**mbblack** · 02-05-2015, 07:36 AM

One advantage of starting from raw data and re-normalizing and analyzing yourself is that you can investigate the various original data sets for any potential bias in library size and signal distribution. Your processed data may represent radically different original data sets, and so you may be introducing bias into your meta-analysis by starting from processed data only.

That actually, to me, is the whole point of requiring authors to submit raw data - you really need to begin from that if you want to compare across studies. I just would not be comfortable trying to do any real cross-study meta-analysis from processed data.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comparing gene expression of specific genes between, samples, datasets, and species

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News