SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Correlation Coefficient between ChIP-seq replicates abmmki Bioinformatics 2 12-29-2016 09:59 AM
DEXSeq error in estimateDispersions: match.arg(start.method, c("log(y)", "mean")) fpadilla Bioinformatics 14 07-03-2013 03:11 PM
CummeRbund csDendro error: "need finite ylim values" when replicates=TRUE mebbert Bioinformatics 4 07-18-2012 09:22 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 08:55 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 01:19 PM

Reply
 
Thread Tools
Old 02-26-2013, 03:49 AM   #1
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default "Offset" in the correlation between two ChIP-seq biological replicates

Hello.
I was looking at the correlation of signal intensities between two biological replicates (ChIP-Seq) and I got a strange plot. I was wondering if anyone has seen something like this before, or if anyone has any ideas of why this is happening.

I counted the number of reads in 1kb windows across the whole genome, and plotted replicate 1 Vs. replicate 2. In the image it shows the raw signal and the normalized "Reads per million" signal.
There seems to be some kind of an "offset" between two trend lines in the plot:



I have looked at the GC content, and position relative to genes (exons, introns, TSS, TTS) of the windows that are in the two different "offsets", and nothing came out as significantly different.
Any ideas would be very much appreciated!
Many thanks!
ines
inesdesantiago is offline   Reply With Quote
Old 02-26-2013, 04:45 AM   #2
pmcget
Member
 
Location: Dublin, Ireland

Join Date: Nov 2007
Posts: 28
Default

I think you should take a look at some of the extreme points in the plot in a genome browser and see if the raw data shows any evidence of weirdness - e.g.PCR artifacts - which would be visible as read stacking (same read sequence, same start and end position).
If this was the case then removing duplicate reads from the data would reduce/eliminate the effect in the plots.

If the reads in both samples look normal then maybe some manual checking of some peaks to ensure that the code you have written to generate the plots is correct.

Maybe rather than plotting for whole genome break it out by chromosome and see if it is a genome-wide or localized effect?
pmcget is offline   Reply With Quote
Old 02-26-2013, 06:32 AM   #3
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default

Hello pmcget,
I tried the same plot with and without replicates. The one I posted is without duplicates, which I removed using picard.

I donít think the code has any error. I tried the same code with other datasets and the plots look fine, with good correlations between biological replicates. I also tried a different software (DiffBind) to count reads and the same plot is created. I used DiffBind to count reads in regions that are called as peaks in both replicates, so essentially, instead of considering 1kb windows across the genome, DiffBind considers only the genomic regions overlapped by peaks. But the same plot is produced, with that "double" correlation.

I just looked at the correlation plot by chromosome, and here is the result:
inesdesantiago is offline   Reply With Quote
Old 02-26-2013, 07:03 AM   #4
pmcget
Member
 
Location: Dublin, Ireland

Join Date: Nov 2007
Posts: 28
Default

Is it possible that your peaks are enriched for repetitive sequences? There is a lot of inter-individual variation in some of these sequences e.g. microsatelites/STRs

You could look at overlap of the 2 offset groups with the overall repeatmasker track - or even subtypes of repeat.

You could also do a quick check of some of the extreme peaks and see if the reads are piling up over regions that are annotated as repeats. e.g. upload the raw reads and peak information into IGV and then load the repeatmasker track from UCSC.
It would be very useful to see an example of the 2 sorts of peaks in each replicate (i.e. the raw reads for the peak in a genome viewer).

Are these biological samples from normal tissue/diseased tissue/cell lines/cancer lines? The biological origin of the samples might give some clue...

Maybe the cells in one of your replicates is undergoing synchronized mitosis and the ChIP'd protein is a marker of this process??
pmcget is offline   Reply With Quote
Old 02-26-2013, 12:16 PM   #5
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default

Hello pmcget,
These are T47D cells, a human breast cancer cell line, and the Transcription factor is FOXA1 ChIP-Seq.
I have checked a few peaks with the track "repeat masker", and I didn't find any striking difference or piling up of reads (at least by browsing the regions by eye).

In the meantime, I started to consider a batch effect.
The two biological replicates were done in two different days, alongside with other ChIP-Seq experiments.

I tried to remove the batch effects using limma in R with the "removeBatchEffect" function.
My matrix consists of 6 columns and ~2 million rows (each row is a 1kb window). The 6 columns correspond to 3 ChIP-seq experiments prepared in duplicates, but each duplicate was prepared in a different day.

The design matrix is something like this:
HTML Code:
sample  cell  replicate  batch
1  T47D  rep1  1
2  T47D  rep2  2
3  MCF7  rep1  1
4  MCF7  rep2  2
5  ZR751  rep1  1
6  ZR751  rep2  2
After running the batch effect correction (limma), I re-plot the pairwise comparison between two biological replicates (e.g. the two replicates in T47D cells) and it looks a bit better:



I am not sure how to interpret this.. but it seems like the batch effect pushes some of the signal apart and creates that weird looking plot..
What are your thoughts?

Thanks very much!
ines
inesdesantiago is offline   Reply With Quote
Old 02-26-2013, 01:01 PM   #6
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Looks like your ChIP worked very well and that the enrichment is 1.5x higher in replicate 2. Most regions will benegative and have about the same read numbers, with repeats giving the lower arm in the plot. This is evident from chrY where there are no true binding sites but only misaligned reads.
Chipper is offline   Reply With Quote
Old 02-26-2013, 04:11 PM   #7
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default

Good point.. The chrY shows the negative regions, and they correspond to the lower arm of the plot.
Thanks!
inesdesantiago is offline   Reply With Quote
Old 02-27-2013, 03:24 AM   #8
pmcget
Member
 
Location: Dublin, Ireland

Join Date: Nov 2007
Posts: 28
Default

Hi inesdesantiago,

I think chipper is correct.

If you want to graphically investigate it further you could identify all the windows with a substantial repeat overlap e.g. using BEDtools intersectbed.

Then you could colour the repeat enriched windows differently to the non-repeats in your plot and see do they segregate into the 2 arms of the plot.
pmcget is offline   Reply With Quote
Old 02-27-2013, 09:20 AM   #9
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default

Hello.
Looking at different classes of repeats, it seems like satellite repeats are enriched in the lower arm. Also quite common in the Y chromosome.
Thanks!
ines

Last edited by inesdesantiago; 02-28-2013 at 02:44 AM.
inesdesantiago is offline   Reply With Quote
Reply

Tags
chip-seq analysis, correlation, replicates

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO