Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is Principal Component Analysis suited for this analysis?

    Hi,

    I have expression values (calculated by RSEM, RNAseq data) for over 30 genes, from 5 samples. Based on these values, I would like to find which two samples display the most 'similar' expression profile. These genes are all from a common pathway for virus defence in plants (RNAi).

    Is PCA suited for this? I realize this is a very small sample set.

    Any advice, comments, recommendations for other tests, greatly appreciated!

  • #2
    PCA won't really do what you want (though I suppose it could vaguely hint at it). Why don't you just directly measure the correlation between the samples? That would seem to more directly answer the question.

    Comment


    • #3
      Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

      What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.

      Comment


      • #4
        Originally posted by Kennels View Post
        Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

        What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.
        PCA will let your group samples that are "similar" but it won't tell you if there's a correlation between the expression patterns. Correlations are good for telling you if there's a relationship and if it's positive or negative, which PCA won't tell you. If you're concerned about linearity, then use the Spearman rank correlation instead of Pearson.

        Comment


        • #5
          Originally posted by Kennels View Post
          ... But this would be assuming some kind of a linear relationship (?), and there is ....
          If you use pearson that would be the case. Maybe have a look at kendal-tau correlation which, afaik, also suits to non linear. What you can use for sure is the information content (mutual information).

          cheers...

          Comment


          • #6
            Originally posted by Kennels View Post
            Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

            What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.
            Fwiw, PCA is also based on linear relationships.

            Comment


            • #7
              Expression values generally have a log-linear distribution. You might get away with a standard linear Pearson's correlation if you take the log of expression values first.

              The best way to be sure is by graphing and eyeballing. With 5 samples and ~30 genes, you could probably get a quicker idea of the most similar profiles with a scatterplot matrix -- I would do one without transformation, and another with log-transformed values:

              Comment


              • #8
                Thanks everyone for the advice and information. I am currently testing out scatterplots and several correlation tests via R.

                However... and I do apologize if this doesn't seem to be getting through to me ... I am not really trying to find a positive or negative correlation. I will certainly get some value from my data, but I am concerned that biologically that could be misleading because of the complex interplay in this small set of genes.

                For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
                "Which samples behave most similarly in the interplay of the expression of these genes?"
                which is what led me to PCA in the first place.

                But perhaps this is exactly what correlation does and I am misunderstanding it? My understanding was the pearson/spearman correlations require the data to be linear or monotonic (somewhat linear), which in my samples I 'believe' they aren't. Of course i have to confirm this with gringer's suggestions.

                Comment


                • #9
                  For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
                  "Which samples behave most similarly in the interplay of the expression of these genes?"
                  You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).

                  Principal Component Analysis is a dimension reduction technique that uses linear transformations of multi-dimensional values to allow them to be reduced to a simpler (lower-dimensional) complexity, commonly down to two dimensions. I believe the usual methods for working out how to do this reduction involve expectations of normally distributed data, and carry out something similar to a correlation analysis to work out how to weight each component [rskr's statement seems to support this] -- someone please correct me if I'm wrong about that. I don't think you can get away completely from linear correlations by trying to hide your data in a PCA.
                  Last edited by gringer; 10-28-2013, 04:32 PM.

                  Comment


                  • #10
                    Originally posted by gringer View Post
                    You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).
                    Thanks gringer for your explanation - it is clearer now, and looking at the correlation output it makes more sense.

                    You were right that the log values seem to show overall a linear relationship (which was a surprise to me for these genes).
                    Would you be able to comment on my interpretation? I did a scatterplot matrix with log values (image below), and a spearmans correlation in R:
                    Code:
                              S1        S2        S3        S4        S5
                    S1 1.0000000 0.8409553 0.6508859 0.7027103 0.7342251
                    S2 0.8409553 1.0000000 0.6067227 0.7691877 0.8000000
                    S3 0.6508859 0.6067227 1.0000000 0.5299720 0.5543417
                    S4 0.7027103 0.7691877 0.5299720 1.0000000 0.7823529
                    S5 0.7342251 0.8000000 0.5543417 0.7823529 1.0000000
                    S3 seems to be the 'least' correlated with all the others, and S1 and S2 the 'most'.
                    I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.

                    Thanks again!
                    Attached Files

                    Comment


                    • #11
                      Originally posted by Kennels View Post
                      I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.
                      Indeed, significance depends largely on what feels right, the assumptiontions that have been made, and occasionally who's paying for the research.

                      Your original question was to find the two samples that are 'most similar', and as you have found it's fairly obvious given the correlation statistics and scatter plots. If you're looking for "most similar", the thing that matters is not the signficance of the correlation statistic (for that most similar pairing), but how different it is from the next "most similar" pairing. There are various other tests that can be done to find out the chance of confusion in that regard, but they're [currently] out of the scope of answers in this thread.

                      FWIW, The cor.test function of R will give you p values for your correlation statistic, and you can play around with parametric and non-parametric methods to see how much it changes things if you drop the assumption of normality (see 'help(cor.test)' for more information).

                      Comment


                      • #12
                        nmf or Isomap

                        if you want look if your samples are subdividable in some groups you could make a non-negative matrix factorization with 2 to 3 groups and look if you get a good separation (cophenetic correlation coefficient).

                        or if you want consider especially non-linear associations you could use Isomap, a non-linear dimensionality reduction (similar to PCA).

                        but how these methods will performe with such a small data set, i don't know...

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        9 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        50 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X