SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
illumina smallRNA adapter sequence for downstram analysis + miRNA analysis steps ndeshpan Bioinformatics 2 06-14-2011 10:44 PM
PubMed: A combination of LongSAGE with Solexa sequencing is well suited to explore th Newsbot! Literature Watch 0 09-18-2008 06:00 AM

Reply
 
Thread Tools
Old 10-27-2013, 06:51 PM   #1
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default Is Principal Component Analysis suited for this analysis?

Hi,

I have expression values (calculated by RSEM, RNAseq data) for over 30 genes, from 5 samples. Based on these values, I would like to find which two samples display the most 'similar' expression profile. These genes are all from a common pathway for virus defence in plants (RNAi).

Is PCA suited for this? I realize this is a very small sample set.

Any advice, comments, recommendations for other tests, greatly appreciated!
Kennels is offline   Reply With Quote
Old 10-28-2013, 02:25 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

PCA won't really do what you want (though I suppose it could vaguely hint at it). Why don't you just directly measure the correlation between the samples? That would seem to more directly answer the question.
dpryan is offline   Reply With Quote
Old 10-28-2013, 03:11 AM   #3
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.
Kennels is offline   Reply With Quote
Old 10-28-2013, 05:05 AM   #4
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by Kennels View Post
Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.
PCA will let your group samples that are "similar" but it won't tell you if there's a correlation between the expression patterns. Correlations are good for telling you if there's a relationship and if it's positive or negative, which PCA won't tell you. If you're concerned about linearity, then use the Spearman rank correlation instead of Pearson.
mcnelson.phd is offline   Reply With Quote
Old 10-28-2013, 06:09 AM   #5
sphil
Senior Member
 
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 192
Default

Quote:
Originally Posted by Kennels View Post
... But this would be assuming some kind of a linear relationship (?), and there is ....
If you use pearson that would be the case. Maybe have a look at kendal-tau correlation which, afaik, also suits to non linear. What you can use for sure is the information content (mutual information).

cheers...
sphil is offline   Reply With Quote
Old 10-28-2013, 06:25 AM   #6
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by Kennels View Post
Thanks for the reply, I suppose I can do all pairwise comparisons for all samples and find the best correlation. But this would be assuming some kind of a linear relationship (?), and there is high variability between the genes for any two samples based on 'eye-balling' the expression profiles.

What I would like is to use all the data at once and find some kind of pattern from the variability and 'group' the samples based on that.
Fwiw, PCA is also based on linear relationships.
rskr is offline   Reply With Quote
Old 10-28-2013, 12:22 PM   #7
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Expression values generally have a log-linear distribution. You might get away with a standard linear Pearson's correlation if you take the log of expression values first.

The best way to be sure is by graphing and eyeballing. With 5 samples and ~30 genes, you could probably get a quicker idea of the most similar profiles with a scatterplot matrix -- I would do one without transformation, and another with log-transformed values:

http://www.statmethods.net/graphs/scatterplot.html
gringer is offline   Reply With Quote
Old 10-28-2013, 05:09 PM   #8
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Thanks everyone for the advice and information. I am currently testing out scatterplots and several correlation tests via R.

However... and I do apologize if this doesn't seem to be getting through to me ... I am not really trying to find a positive or negative correlation. I will certainly get some value from my data, but I am concerned that biologically that could be misleading because of the complex interplay in this small set of genes.

For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
"Which samples behave most similarly in the interplay of the expression of these genes?"
which is what led me to PCA in the first place.

But perhaps this is exactly what correlation does and I am misunderstanding it? My understanding was the pearson/spearman correlations require the data to be linear or monotonic (somewhat linear), which in my samples I 'believe' they aren't. Of course i have to confirm this with gringer's suggestions.
Kennels is offline   Reply With Quote
Old 10-28-2013, 05:28 PM   #9
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
For example, gene A and B could be positively correlated in sample 1, but in sample 2 they could be negatively correlated because gene C had a higher expression. Take into account 30 odd genes. If i did a correlation test, I would be trying to put some kind of 'positive or negative' relationship of the samples. But biologically, I want to take into account all that variability and ask:
"Which samples behave most similarly in the interplay of the expression of these genes?"
You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).

Principal Component Analysis is a dimension reduction technique that uses linear transformations of multi-dimensional values to allow them to be reduced to a simpler (lower-dimensional) complexity, commonly down to two dimensions. I believe the usual methods for working out how to do this reduction involve expectations of normally distributed data, and carry out something similar to a correlation analysis to work out how to weight each component [rskr's statement seems to support this] -- someone please correct me if I'm wrong about that. I don't think you can get away completely from linear correlations by trying to hide your data in a PCA.

Last edited by gringer; 10-28-2013 at 05:32 PM.
gringer is offline   Reply With Quote
Old 10-28-2013, 06:33 PM   #10
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Quote:
Originally Posted by gringer View Post
You should be checking for correlation between samples, not genes, and that's what everyone is suggesting that you do. The correlation test will relate to how expression is different (or similar) between two samples (e.g. a non-parametric correlation might check if the expression levels go (in increasing order) A,B,E,C,D in both sample 1 and sample 2).
Thanks gringer for your explanation - it is clearer now, and looking at the correlation output it makes more sense.

You were right that the log values seem to show overall a linear relationship (which was a surprise to me for these genes).
Would you be able to comment on my interpretation? I did a scatterplot matrix with log values (image below), and a spearmans correlation in R:
Code:
          S1        S2        S3        S4        S5
S1 1.0000000 0.8409553 0.6508859 0.7027103 0.7342251
S2 0.8409553 1.0000000 0.6067227 0.7691877 0.8000000
S3 0.6508859 0.6067227 1.0000000 0.5299720 0.5543417
S4 0.7027103 0.7691877 0.5299720 1.0000000 0.7823529
S5 0.7342251 0.8000000 0.5543417 0.7823529 1.0000000
S3 seems to be the 'least' correlated with all the others, and S1 and S2 the 'most'.
I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.

Thanks again!
Attached Images
File Type: png ScatterPlot.LogtpmsRSEM.png (25.2 KB, 8 views)
Kennels is offline   Reply With Quote
Old 10-28-2013, 08:19 PM   #11
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by Kennels View Post
I'm not sure what value would be considered a 'good/strong' correlation (obviously 1 is best, but perhaps above 0.9?), but I suppose that depends on the biological context, as well as the number of data points.
Indeed, significance depends largely on what feels right, the assumptiontions that have been made, and occasionally who's paying for the research.

Your original question was to find the two samples that are 'most similar', and as you have found it's fairly obvious given the correlation statistics and scatter plots. If you're looking for "most similar", the thing that matters is not the signficance of the correlation statistic (for that most similar pairing), but how different it is from the next "most similar" pairing. There are various other tests that can be done to find out the chance of confusion in that regard, but they're [currently] out of the scope of answers in this thread.

FWIW, The cor.test function of R will give you p values for your correlation statistic, and you can play around with parametric and non-parametric methods to see how much it changes things if you drop the assumption of normality (see 'help(cor.test)' for more information).
gringer is offline   Reply With Quote
Old 10-29-2013, 12:35 AM   #12
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default nmf or Isomap

if you want look if your samples are subdividable in some groups you could make a non-negative matrix factorization with 2 to 3 groups and look if you get a good separation (cophenetic correlation coefficient).

or if you want consider especially non-linear associations you could use Isomap, a non-linear dimensionality reduction (similar to PCA).

but how these methods will performe with such a small data set, i don't know...
dietmar13 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO