Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transcriptome similarity metric?

    Can anyone think of a clever way to compare transcriptomes? Let's say for example I've got a transcriptome (in the form of RPKM values for a number of transcripts) that I consider to be a gold standard, and I have two other experimental transcriptomes and I want to know which of my two experimental transcriptomes is more similar to my gold standard.

    Should I just calculate a Pearson's correlation coefficient for those genes above some threshold for which I have RPKM data? Or set unexpressed transcripts to zero and run the same comparison? Any other ideas?

    Thanks,

    ucpete

  • #2
    I tend to use Spearman correlation rather than Pearson, because the latter can be (too) strongly affected by occasional outliers with large FPKM/RPKM values.

    The paper at http://www.biomedcentral.com/1471-2164/12/293 argues for using something called the Kappa statistic instead, but I haven't tried it yet.

    Comment


    • #3
      Hmmm. Thanks for that. I don't think Cohen's Kappa is applicable in my case because I'm not doing any qualitative categorization of the data. And my concern with Spearman is that my RPKM distributions are kind of funky-- I have a vast majority of transcripts with extremely low RPKMs, a somewhat normal distribution around RPKM 20, and a really long tail that extends up to almost 5000 but with maybe 1% of the data in total with RPKM > 100. My concern with Spearman here is that a difference of RPKM of 0.1 on the low end (where most of my data are) could equate to a rank shift of like 10,000, whereas with Pearson it would be negligible. I'm considering lopping off the lower data points and performing Spearman on the top N% of the RPKM values, and I'm also considering log transforming before doing Pearson. Any other ideas from folks out there?

      Comment


      • #4
        So actually in the original RPKM paper (I believe?) they take the log transform of the RPKM and calculate the Pearson correlations between two datasets that way. Seems reasonable, especially because taking the log transform of data is another way to flatten some of the outlier effects.

        Check it out:



        ucpete

        Comment


        • #5
          Yes, I suppose that makes sense. Your comment about ranks & low RPKMs is a good one - perhaps log + Pearson is the way to go. I was never quite satisfied with the Spearman approach either although that's what I have used - I also routinely look at PCA plots of the samples and compare them that way, but even there I feel that the results can be dominated by non-specific effects.

          Comment


          • #6
            On further review...

            Upon further review, and after reading this post and its follow-up and many posts here and elsewhere on the web, I've changed my mind-- it appears that taking the log transform and calculating an R-squared value isn't really legitimate when the log transformed data aren't homoscedastic. Is this correct?

            I'm in a situation where I'm almost looking for the opposite of differential expression. I have a control dataset and many experimental datasets and I'm looking for the experimental condition that best models the control condition (among other things, but transcriptome comparison is one of my concerns). They're not really technical replicates, but they're also not really biological replicates either because the "experiment" is comparing an aliquot of untreated nucleic acid to a bunch of nucleic acid manipulations. I say it's not really a biological replicate in the traditional sense of the term because one big nucleic acid extraction feeds into all of the experimental and control conditions.

            What I've done is taken the proportion of reads mapping to each transcript, arcsine-root transformed them, and performed a simple linear regression on the transformed data to calculate an R-squared value. The data look pretty homoscedastic except for a handful of outliers for which the residuals are outside the normal distribution (though on the whole, 99% of the residuals are normally distributed). These outliers are essentially those things that are "differentially expressed" right? Is this a legitimate analysis and conclusion? I can't really seem to pin down whether people consider RNA-seq counts to be Poisson or Negative Binomial (or something else altogether) distributed, or what I should be looking for in my special case. It appears that these VSTs (Variance-Stabilizing Transformations) may work for Poisson *or* Negative Binomial, so does it even really matter? I think I'm going to try some of the differential expression R packages and see if I can explain the outliers with the biological implications of my manipulations. Any other ideas? Any input is greatly appreciated!

            ucpete

            Comment


            • #7
              vsd in DESeq

              Hi,

              The vsd function in DESeq will variance-stabilize your count data to make it homoscedastic.

              -Danielle

              ######################
              Danielle G. Lemay, PhD
              Assistant Professional Researcher
              Genome Center
              University of California at Davis

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X