SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Pearson correlation between replicates afreund Bioinformatics 9 06-05-2014 01:38 AM
correlation in amplification RNAseq papori Bioinformatics 15 09-05-2012 11:34 AM
Gene expression correlation mgaelle General 0 05-02-2012 07:37 AM
454AlignmentInfo.tsv correlation to contigs Anelda Bioinformatics 8 09-03-2010 04:13 AM
Clusters-reads correlation? cggj Bioinformatics 5 03-10-2010 08:10 PM

Reply
 
Thread Tools
Old 02-19-2013, 09:19 PM   #1
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default Pearson correlation in R

Guys,
I tried to compare two vectors and each vector is a RPKM value I calculated using ChIP-seq reads. I want to use pearson correlation as a way to access the similarity between two ChIP-seq datasets.

The two RPKM values are values in 500bp bins genome-wide, for mouse. So there are more than 5 million bins in total. I checked the difference between two RPKM value files and they are actually very similar. So presumably the pearson correlation coefficient should be very close to 1.

However, when I used cor.test() in R, the output is like this:
Quote:
Pearson's product-moment correlation

data: a$V4 and b$V4
t = -0.0037, df = 5309833, p-value = 0.9971
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.0008521632 0.0008489671
sample estimates:
cor
-1.598034e-06
I couldn't figure out why I got a weird correlation like that.. does anyone have an idea of why it's the case? Did I do something completely wrong?

Thanks a lot!
gene_x is offline   Reply With Quote
Old 02-20-2013, 03:09 AM   #2
mudshark
Senior Member
 
Location: Munich

Join Date: Jan 2009
Posts: 138
Default

can you provide a scatterplot a$V4 vs b$V4 (probably best to subsample)?

just 2 additional cents: pearson correlations do not provide much information on the similarity of ChIPSeq samples given that most of the data points reflect background reads. in other words you get the correlation coefficient of the background noise.
mudshark is offline   Reply With Quote
Old 02-20-2013, 12:08 PM   #3
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Simply do cor() instead of cor.test() (which performs a statistical test instead of reporting the correlation). You can do cor(x, method="spearman") for a Spearman rank correlation.

However, the I agree on mudshark's caveat; you should actually not trust the results because most of the correlation will come from zero counts or low non-specific (noise) alignments.
kopi-o is offline   Reply With Quote
Old 02-20-2013, 12:50 PM   #4
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

Quote:
Originally Posted by mudshark View Post
can you provide a scatterplot a$V4 vs b$V4 (probably best to subsample)?

just 2 additional cents: pearson correlations do not provide much information on the similarity of ChIPSeq samples given that most of the data points reflect background reads. in other words you get the correlation coefficient of the background noise.
Thanks for your suggestion. I've plotted a scatterplot and it doesn't look right. Only 3 dots on the plot.. with the help from people i the lab, I found a bug in my code to calculate RPKM... now that problem is fixed.

Regarding the usefulness of pearson correlations on ChIP-Seq data, I agree that the majority of the regions will have a PRKM value of 0.. do you know how much this affects interpretation of correlation? What would you do to compare two ChIP-Seq? Do you call peaks and calculate the overlap of peaks?

My purpose of doing this is to access how saturated my ChIP-seq data is.. so I split my ChIP-Seq data for ONE sample into 2 subsets and then random sample 10%, 20%, 40%, 60% etc of reads from each subset and I use pearson correlation between RPKM values (two vectors) of 2 subsets as a way to access saturation.. I felt pearson correlation would do the job for this purpose. What do you think?

Thanks so much!

Last edited by gene_x; 02-20-2013 at 01:46 PM.
gene_x is offline   Reply With Quote
Old 02-20-2013, 12:52 PM   #5
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

Quote:
Originally Posted by kopi-o View Post
Simply do cor() instead of cor.test() (which performs a statistical test instead of reporting the correlation). You can do cor(x, method="spearman") for a Spearman rank correlation.

However, the I agree on mudshark's caveat; you should actually not trust the results because most of the correlation will come from zero counts or low non-specific (noise) alignments.
cor() and cor.test() gave the same correlation coefficient actually.

My problem is fixed now. If you are interested.. please take a look at my reply to mudshark's post and let me know what your thoughts are.

Thanks a lot!
gene_x is offline   Reply With Quote
Old 02-20-2013, 02:51 PM   #6
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

I would recommend you to look at some tools for ChIP-seq data comparison such as MANorm (http://bcb.dfci.harvard.edu/~gcyuan/MAnorm/MAnorm.htm) or perhaps SeqMiner (http://bips.u-strasbg.fr/seqminer/tiki-index.php) or CHANCE (http://www.ncbi.nlm.nih.gov/pubmed/23068444).

The thing is that ChIP-seq data actually is mostly noise (to simplify a bit, a small fraction of the reads are in peaks) so one needs to be careful to really look at the informative part of the data.
kopi-o is offline   Reply With Quote
Old 02-20-2013, 02:56 PM   #7
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

Quote:
Originally Posted by kopi-o View Post
I would recommend you to look at some tools for ChIP-seq data comparison such as MANorm (http://bcb.dfci.harvard.edu/~gcyuan/MAnorm/MAnorm.htm) or perhaps SeqMiner (http://bips.u-strasbg.fr/seqminer/tiki-index.php) or CHANCE (http://www.ncbi.nlm.nih.gov/pubmed/23068444).

The thing is that ChIP-seq data actually is mostly noise (to simplify a bit, a small fraction of the reads are in peaks) so one needs to be careful to really look at the informative part of the data.
Thanks for the links, I'll take a look at them!
gene_x is offline   Reply With Quote
Old 02-21-2013, 02:47 AM   #8
rory
Member
 
Location: Cambridge, UK

Join Date: Aug 2008
Posts: 28
Default

Another tool to try is the Bioconductor package DiffBind:
http://www.bioconductor.org/packages...c/DiffBind.pdf

It gives you a number of ways to compare ChIP-seq samples using a variety of scores and normalisations. It will do exactly these types of correlations and show how samples cluster. The vignette walks through an example in some detail.

It does rely on some sort of intervals of interest being identified beforehand. You can use traditional peak calling, or a windowing scheme. To compare replicates we often use all the promoters (windows 1000bp upstream from each TSS; while this isn;t always appropriate, it usually gives a good idea of how similar the signal in different samples are.
rory is offline   Reply With Quote
Old 02-21-2013, 08:15 AM   #9
gene_x
Senior Member
 
Location: MO

Join Date: May 2010
Posts: 108
Default

Quote:
Originally Posted by rory View Post
Another tool to try is the Bioconductor package DiffBind:
http://www.bioconductor.org/packages...c/DiffBind.pdf

It gives you a number of ways to compare ChIP-seq samples using a variety of scores and normalisations. It will do exactly these types of correlations and show how samples cluster. The vignette walks through an example in some detail.

It does rely on some sort of intervals of interest being identified beforehand. You can use traditional peak calling, or a windowing scheme. To compare replicates we often use all the promoters (windows 1000bp upstream from each TSS; while this isn;t always appropriate, it usually gives a good idea of how similar the signal in different samples are.
OK. Thanks for the info.
gene_x is offline   Reply With Quote
Old 10-28-2015, 09:12 AM   #10
mirror_hc
Junior Member
 
Location: New York

Join Date: Dec 2013
Posts: 1
Default

Quote:
Originally Posted by rory View Post
Another tool to try is the Bioconductor package DiffBind:
http://www.bioconductor.org/packages...c/DiffBind.pdf

It gives you a number of ways to compare ChIP-seq samples using a variety of scores and normalisations. It will do exactly these types of correlations and show how samples cluster. The vignette walks through an example in some detail.

It does rely on some sort of intervals of interest being identified beforehand. You can use traditional peak calling, or a windowing scheme. To compare replicates we often use all the promoters (windows 1000bp upstream from each TSS; while this isn;t always appropriate, it usually gives a good idea of how similar the signal in different samples are.
Hi Rory,

I googled this thread because I want to you do exactly you suggested using DiffBind. However, a "score" column is needed to create a dba object at the beginning. Obviously, the artificial bed files with all promoters cannot have the score column. Please let me know how you import promoter regions in DiffBind.

Many thanks.
Xin
Mount Sinai School of Medicine

BTW, I tried to use "." in the score column and got the following error message:
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : ‘max’ not meaningful for factors
mirror_hc is offline   Reply With Quote
Old 10-29-2015, 08:25 AM   #11
rory
Member
 
Location: Cambridge, UK

Join Date: Aug 2008
Posts: 28
Default

Try:

> samples<-dba(sampleSheet="samples.csv", scoreCol=0)

This tells DiffBind that there are no scores. Note that everything will get an identical score of 1, so clustering/PCA plots won't be very interesting until you get read-based scores using dba.count().

Cheers-

Rory
rory is offline   Reply With Quote
Reply

Tags
pearson correlation

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO