Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing gene expression of specific genes between, samples, datasets, and species

    Our questions might be a bit different than most. As background for our lab, we are not experts in RNA-Seq data and are learning it as we go.

    There have been a number of studies that have sequenced "normal" individuals, and we are interested in using this public data to answer a few simple questions. Specifically, we're interested to find out the gene expression range between "normal" individuals in the population for a very small number of genes (which we are interested in from our wet-lab work). We are not comparing normal against a disease, or anything like that.

    We were wondering if there is a recommended way to normalize this data so that we can compare the gene expression of Gene X in individual A to individual B (letting us ultimately determine the range of expression in Gene X for all individuals).

    We know that just using RPKM values is not the way to go. Housekeeping genes are full of many issues (for instance, there is no guarantee that they are actually stable across individuals). We have looked at quantile normalized values, but will this let you compare between individuals the way we want?

    Naively, we are thinking that percent ranking of gene X against all genes in all individuals might work well. We are thinking this would also let us compare between studies and potentially even between species. But then this would remove the resolution (for instance, the number 2 gene vs. the number 3 gene might actually have a very large difference in expression, even if their "ranks" are nearly identical).

    We've also thought about using "total expression" as the denominator. That is, we would divide expression for gene X / total expression in each individual. I know people have shied away from using this when looking at differential expression analysis, but if we already know what genes we want to know the expression of, we are thinking this method "could work". But like percent rank, we're not sure if there are any limitations that we are missing.

    Does anyone know of a good way approach this question? We naively thought this would be a simple analysis (i.e. just grab the expression values for each and compare), but as we learn more it seems more complicated than we initially expected.

    We appreciate any insight.

  • #2
    TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.

    Comment


    • #3
      Originally posted by mikep View Post
      TPM (transcripts per million) would be a good way to go, you haven't mentioned how the public data was processed but you could use RSEM to generate the values. I would worry about any between sample normalization throwing away biological variation. Quantile normalization would be a really bad idea, and given RNAseq is a relative measure of abundance I'm not sure where you are getting "total expression" from, did you mean total mapped reads? If you did you are half way to RPKM.
      Thank you for your response. After doing a bit more reading, it does seem like TPM is what we're after.

      "Total expression" as a concept is something our P.I. thought would be good to normalize against conceptually. And yes, I believe from an RNA-Seq perspective, this would mean total mapped reads.

      The "public" datasets that we've found that we'd like to use have reported their expression figures as RPKM. We have also seen papers that were reporting expression as quantile-normalized RPKM/FPKM values.

      Is it possible to calculate TPM from RPKM/FPKM? I guess for that it would depend on how they calculated RPKM, correct?

      Am I correct in that the main difference between TPM and RPKM/FPKM is the length normalization for the transcript? Naively, then, I would think you could multiply RPKM/FPKM by the length of the transcript and get TPM, right? But then we would have to assume that each RPKM/FPKM value in each experiment is using the same length for the transcript...

      Or am I missing something fundamental there in the formulas for TPM and RPKM/FPKM (which is quite likely)?

      Comment


      • #4
        As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

        You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

        Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.

        Comment


        • #5
          Originally posted by mikep View Post
          As I said, normalizing by "total expression" as you define it is really just the M part of RPKM. There is no simple way to get TPM from RPKM. You could reverse engineer it if you had the original annotation used in the mapping and knew the total mapped reads, I guess.

          You have the "main difference" between TPM and FPKM wrong, you will get a better understanding of it by reading the papers and/or blogs on the subject.

          Practically, and what I would do if in your shoes, is to get my hands on the raw data and remap with RSEM, which will do the work for you.
          I think what I was thinking of is CPM (counts per million).


          But I think I am missing something with the relationship between TPM and FPKM...

          If the formulas for each are:

          TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

          RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

          Shouldn't you be able to go between the two?


          If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

          TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

          This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/


          Also, isn't there supposed to be a proportionality constant in the RPKM/FPKM formula? Or is that "cancelled" out in the equation?

          The reason I bring up the proportionality constant is that this has been a main reason that people have recommended not using FPKM/RPKM and have instead recommended using TPM:
          Wagner, Kim, and Lynch: http://lynchlab.uchicago.edu/publica...%282012%29.pdf
          Lior Pacther blog article: https://liorpachter.wordpress.com/20...he-supplement/
          Lior Pacther talk: https://www.youtube.com/watch?v=5NiF...tu.be&t=30m30s


          Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).
          Last edited by Hockeymac18; 02-04-2015, 05:02 PM.

          Comment


          • #6
            Originally posted by Hockeymac18 View Post
            But I think I am missing something with the relationship between TPM and FPKM...

            If the formulas for each are:

            TPM for any given gene = (count / length of transcript) * (1 / (sum for all genes: count / length of transcript)) * 10^6

            RPKM/FPKM for any given gene = count / ((length of transcript/10^3) * (total number of reads/10^6))

            Shouldn't you be able to go between the two?


            If you have all FPKM values for all genes, shouldn't you be able to get TPM for a given gene by:

            TPM = (FPKM for gene / (sum of all FPKM for all genes)) * 10^6

            This blog seems to confirm that (which references a review by Lior Pacther on transcript quantification methods): https://haroldpimentel.wordpress.com...ression-units/
            Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.

            Also, I appreciate your comments about re-analyzing the data. That is something we have thought of. But it also brings up the point about the applicability and usefulness of public data: If you have to download public raw data yourself and re-analyze it, I think the power of public processed data is a bit lessened (or even useless in nature). This also isn't a trivial thing to do for ~100's of samples (the type of comparisons we'd like to make), especially for a lab that is not really set up for full-fledged RNA-Seq analysis (we generally only do 1-2 RNA-Seq experiments a year).
            Fair enuff

            Comment


            • #7
              Originally posted by mikep View Post
              Well, you learn something new every day. That makes sense. What I meant to say was there is no constant scaling factor between the two, it differs according to the samples.



              Fair enuff
              Thank you for you help on the matter. You helped me work through the issues conceptually, and I learned a great deal about RNA-Seq quantification methods along the way.

              Comment


              • #8
                I would like to say this was a very insightful topic in regards to TPM vs RPKM/FPKM situation. Thank you for sharing this information.

                As for re-analyzing public data. My suggestion is to create pipeline and try to offload the information cloud based methodology and go from there. This will save you time.

                Personally, I did the following:

                NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > Cufflinks2 > Cuffmerge2 > Cuffdiff2 > Offline (CummeRbund)

                and


                NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > EdgeR

                and


                NCBI SRA/GEO > EBI-SRA > Trimmomatic > FastQC > Trimmomatic > FastQC > Iplant Collaborative > Tophat2 > DeSeq

                This took about a 6 to 8 months to accomplish for about 40 samples. Its definitely do-able, but takes a bit of time.

                I would suggest trying iPlant Collaborative's Discovery Environment.

                All the best with your project.

                Comment


                • #9
                  This has code for going between RPKM and TPM (and also effective counts)

                  This post covers the units used in RNA-Seq that are, unfortunately, often misused and misunderstood. I’ll try to clear up a bit of the confusion here. The first thing one should remember is t…


                  A very nice post.

                  Also when comparing public data, I recommend that you try to correct for batch effects using ComBat or a similar program. Also you might want to convert to log scale before that. Good luck!
                  Last edited by kopi-o; 02-05-2015, 04:09 AM.

                  Comment


                  • #10
                    One advantage of starting from raw data and re-normalizing and analyzing yourself is that you can investigate the various original data sets for any potential bias in library size and signal distribution. Your processed data may represent radically different original data sets, and so you may be introducing bias into your meta-analysis by starting from processed data only.

                    That actually, to me, is the whole point of requiring authors to submit raw data - you really need to begin from that if you want to compare across studies. I just would not be comfortable trying to do any real cross-study meta-analysis from processed data.
                    Michael Black, Ph.D.
                    ScitoVation LLC. RTP, N.C.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X