
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
illumina smallRNA adapter sequence for downstram analysis + miRNA analysis steps  ndeshpan  Bioinformatics  2  06142011 10:44 PM 
PubMed: A combination of LongSAGE with Solexa sequencing is well suited to explore th  Newsbot!  Literature Watch  0  09182008 06:00 AM 

Thread Tools 
10272013, 06:51 PM  #1 
Senior Member
Location: Sydney Join Date: Feb 2011
Posts: 149

Is Principal Component Analysis suited for this analysis?
Hi,
I have expression values (calculated by RSEM, RNAseq data) for over 30 genes, from 5 samples. Based on these values, I would like to find which two samples display the most 'similar' expression profile. These genes are all from a common pathway for virus defence in plants (RNAi). Is PCA suited for this? I realize this is a very small sample set. Any advice, comments, recommendations for other tests, greatly appreciated! 
10282013, 02:25 AM  #2 
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480

PCA won't really do what you want (though I suppose it could vaguely hint at it). Why don't you just directly measure the correlation between the samples? That would seem to more directly answer the question.

10282013, 03:11 AM  #3 
Senior Member
Location: Sydney Join Date: Feb 2011
Posts: 149

Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eyeballing' the expression profiles.
What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that. 
10282013, 05:05 AM  #4  
Senior Member
Location: Connecticut Join Date: Jul 2011
Posts: 162

Quote:


10282013, 06:09 AM  #5  
Senior Member
Location: Stuttgart, Germany Join Date: Apr 2010
Posts: 192

Quote:
cheers... 

10282013, 06:25 AM  #6  
Senior Member
Location: Santa Fe, NM Join Date: Oct 2010
Posts: 250

Quote:


10282013, 12:22 PM  #7 
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838

Expression values generally have a loglinear distribution. You might get away with a standard linear Pearson's correlation if you take the log of expression values first.
The best way to be sure is by graphing and eyeballing. With 5 samples and ~30 genes, you could probably get a quicker idea of the most similar profiles with a scatterplot matrix  I would do one without transformation, and another with logtransformed values: http://www.statmethods.net/graphs/scatterplot.html 
10282013, 05:09 PM  #8 
Senior Member
Location: Sydney Join Date: Feb 2011
Posts: 149

Thanks everyone for the advice and information. I am currently testing out scatterplots and several correlation tests via R.
However... and I do apologize if this doesn't seem to be getting through to me ... I am not really trying to find a positive or negative correlation. I will certainly get some value from my data, but I am concerned that biologically that could be misleading because of the complex interplay in this small set of genes. For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask: "Which samples behave most similarly in the interplay of the expression of these genes?" which is what led me to PCA in the first place. But perhaps this is exactly what correlation does and I am misunderstanding it? My understanding was the pearson/spearman correlations require the data to be linear or monotonic (somewhat linear), which in my samples I 'believe' they aren't. Of course i have to confirm this with gringer's suggestions. 
10282013, 05:28 PM  #9  
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838

Quote:
Principal Component Analysis is a dimension reduction technique that uses linear transformations of multidimensional values to allow them to be reduced to a simpler (lowerdimensional) complexity, commonly down to two dimensions. I believe the usual methods for working out how to do this reduction involve expectations of normally distributed data, and carry out something similar to a correlation analysis to work out how to weight each component [rskr's statement seems to support this]  someone please correct me if I'm wrong about that. I don't think you can get away completely from linear correlations by trying to hide your data in a PCA. Last edited by gringer; 10282013 at 05:32 PM. 

10282013, 06:33 PM  #10  
Senior Member
Location: Sydney Join Date: Feb 2011
Posts: 149

Quote:
You were right that the log values seem to show overall a linear relationship (which was a surprise to me for these genes). Would you be able to comment on my interpretation? I did a scatterplot matrix with log values (image below), and a spearmans correlation in R: Code:
S1 S2 S3 S4 S5 S1 1.0000000 0.8409553 0.6508859 0.7027103 0.7342251 S2 0.8409553 1.0000000 0.6067227 0.7691877 0.8000000 S3 0.6508859 0.6067227 1.0000000 0.5299720 0.5543417 S4 0.7027103 0.7691877 0.5299720 1.0000000 0.7823529 S5 0.7342251 0.8000000 0.5543417 0.7823529 1.0000000 I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points. Thanks again! 

10282013, 08:19 PM  #11  
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838

Quote:
Your original question was to find the two samples that are 'most similar', and as you have found it's fairly obvious given the correlation statistics and scatter plots. If you're looking for "most similar", the thing that matters is not the signficance of the correlation statistic (for that most similar pairing), but how different it is from the next "most similar" pairing. There are various other tests that can be done to find out the chance of confusion in that regard, but they're [currently] out of the scope of answers in this thread. FWIW, The cor.test function of R will give you p values for your correlation statistic, and you can play around with parametric and nonparametric methods to see how much it changes things if you drop the assumption of normality (see 'help(cor.test)' for more information). 

10292013, 12:35 AM  #12 
Senior Member
Location: Vienna Join Date: Mar 2010
Posts: 107

nmf or Isomap
if you want look if your samples are subdividable in some groups you could make a nonnegative matrix factorization with 2 to 3 groups and look if you get a good separation (cophenetic correlation coefficient).
or if you want consider especially nonlinear associations you could use Isomap, a nonlinear dimensionality reduction (similar to PCA). but how these methods will performe with such a small data set, i don't know... 
Thread Tools  

