SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Correlation between sequencing depth and false positives Jane M Bioinformatics 8 07-30-2012 05:52 AM
the correlation of RNA-seq data. kentnf Bioinformatics 6 07-17-2012 10:08 AM
DESeq: question about with replicates and without any replicates. nb509 RNA Sequencing 2 10-25-2011 06:04 AM
454AlignmentInfo.tsv correlation to contigs Anelda Bioinformatics 8 09-03-2010 03:13 AM
Clusters-reads correlation? cggj Bioinformatics 5 03-10-2010 07:10 PM

Reply
 
Thread Tools
Old 04-10-2012, 10:06 AM   #1
afreund
Junior Member
 
Location: Palo Alto, CA

Join Date: Jan 2012
Posts: 4
Default Pearson correlation between replicates

Hi all,

I'm analyzing some ChIP-seq data, and I'd like to see how different my replicate samples are (pairwise comparison only is fine). To be clear, I want to compare the read density across the whole genome, not just the peaks. I know the process of comparing peaks has been covered in other threads, but I couldn't find a description of how to generate a simple scatterplot of read density at all genomic positions (after normalization), with the corresponding pearson correlation coefficient. It seems to be a common form of analysis, and it looks like people generally use ~200 bp windows. It also seems important to eliminate regions with no reads in either sample, to avoid artificially increasing the correlation. Any suggestions on how to tackle this would be much appreciated!

I'm dealing with FASTQ files, already mapped to hg19.
afreund is offline   Reply With Quote
Old 04-10-2012, 12:59 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Sounds like you'll be needing R and one of the packages from Bioconductor that can handle SAM/BAM files (unless you prefer to use the C or whatever API).
dpryan is offline   Reply With Quote
Old 04-10-2012, 01:48 PM   #3
afreund
Junior Member
 
Location: Palo Alto, CA

Join Date: Jan 2012
Posts: 4
Default

Thanks, that's definitely a good start, although a little more detail would be helpful. I have used R in the past, and I've looked around Bioconductor's website, but I'm still not sure how to go forward.

Does anyone know if Galaxy's pre-loaded SAMtools offer a way to do this?
afreund is offline   Reply With Quote
Old 04-10-2012, 06:44 PM   #4
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

thats interesting. Why would you need that correlation? why not just use a statistical model, like the ones in useq or macs, and when you input the replicates as case and control, obtain not many significantly different regions.
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 04-10-2012, 08:46 PM   #5
afreund
Junior Member
 
Location: Palo Alto, CA

Join Date: Jan 2012
Posts: 4
Default

Hi bioinfosm,

I'm fairly new to this, so correct me if I'm wrong, but although the two approaches are superficially similar (i.e. they both are a measure of the similarity between two samples), the Pearson correlation coefficient (PCC) is a much more robust measurement because it doesn't rely on an arbitrary significance cutoff, as well as being more quantitative, allowing for simpler comparisons. I'd like to assess the reproducibility of my samples between experiments, conditions, etc, and I believe it's easier to do that with the PCC. From what I've read, it's a fairly standard approach.
afreund is offline   Reply With Quote
Old 04-11-2012, 02:03 AM   #6
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

I think it's very easy to go wrong with simple Pearson correlations on genome-wide ChIP-seq tag count profiles. I would advise to go for something like this recent method: http://genomebiology.com/2012/13/3/R16/abstract

Note, I haven't tried it myself yet
kopi-o is offline   Reply With Quote
Old 04-11-2012, 05:12 AM   #7
tir_al
Member
 
Location: Croatia

Join Date: Sep 2010
Posts: 22
Default

If you want to do genome wide correlation in R, you will need 2 bioconductor packages: RSamtools and GenomicRanges, and also the lengths for the chromosomes from the corresponding genome.

The pipeline should go something like this:

readBam -> convert bam to GRanges -> extend the reads (resize function) -> make the coverage (coverage function). Now you need to smooth the coverage a bit because, otherwise the resulting vector will be to big. This can be done easily using the first function in the Rle tips and tricks manual.
Repeat that for each sample, put all of the data in a data frame, and use the cor function on the resulting object.

I would advise you to go chromosome by chromosome (if you have an indexed bam file), and aggregate the data only before doing the correlation analysis.

If you want to do a comparison of peak regions, a very cool method was published in Bioinformatics recently: An effective statistical evaluation of ChIPseq dataset similarity.

Cheers!
tir_al is offline   Reply With Quote
Old 04-11-2012, 01:18 PM   #8
afreund
Junior Member
 
Location: Palo Alto, CA

Join Date: Jan 2012
Posts: 4
Default

Thanks very much, tir_al, I'll give that a try!
afreund is offline   Reply With Quote
Old 06-05-2014, 12:17 AM   #9
archi
Member
 
Location: India

Join Date: Apr 2014
Posts: 16
Default

Hi all, I'm also doing analysis of some nextgen data. right now,I'm handling fastq files of three experimental replicates and want to calculate their correlation(i.i correlation of the three replicates).
Can someone please guide on how to go about this in R using or without using Bioconductor.

I'm from a math background and know how to calculate pearson correlation coefficient between two set of numbers,but how to do that for fastq data??


thanks in advance.
archi is offline   Reply With Quote
Old 06-05-2014, 12:38 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The short answer is that you don't.

Firstly, map the reads to the appropriate genome. Then perform whatever type of quantification you need (for ChIP-seq, this would be peak calling; for RNA-seq this would be counting reads per gene/transcript/whatever; etc.) and then calculate the correlation from the resulting metrics.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:09 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO