Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: A strand specific high resolution normalization method for chip-sequencing Newsbot! Literature Watch 0 01-18-2012 03:20 AM
chip-seq normalization when two sample total reads vary largely tujchl Bioinformatics 0 01-10-2012 04:00 AM
ChIP-Seq: Analyzing ChIP-seq Data: Preprocessing, Normalization, Differential Identif Newsbot! Literature Watch 0 12-02-2011 05:51 AM
ChIP-Seq Data Analysis and Normalization snape_ar Bioinformatics 0 10-25-2011 01:16 PM
ChIP-Seq: ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysi Newsbot! Literature Watch 0 03-02-2011 03:50 AM

Thread Tools
Old 10-29-2009, 02:57 PM   #1
Junior Member
Location: san diego

Join Date: Oct 2009
Posts: 1
Default normalization of ChIP-seq data

Hi all,

if I may add a more general question regarding the way to normalize the
ChIP-seq data when comparing multiple experiments. Considering the
Poisson distribution based models for peak finding, my question is the
following :

- assuming there are 2 ChIP-seq experiments :

<> minus treatment : 10 mil tags : 10000 peaks
<> plus treatment : 4 mil tags: 3000 peaks

- although the data is differently saturated, could we still compare
10 000 peaks vs 3 000 peaks and say, for instance, 8000 peaks are lost

with the treatment, 2000 peaks remain unchanged and 1000 peaks
are gained ?
- assuming that the cut-off to call a peak is different for minus
treatment (let's say 6 tags) vs plus treatment
(let's say 4 tags), would the comparison that is described above be
statistically legitimate ? thanks a lot,

tanasabogdan is offline   Reply With Quote
Old 10-30-2009, 01:09 AM   #2
Junior Member
Location: California

Join Date: Sep 2008
Posts: 5

IMHO the answer is no. There are several issues I can think of.

First, you cannot compare the number of peaks or get any statistics directly without any estimate of replication noise. If you had replicate experiments for atleast one (ideally both) conditions and u ran peak calls on the replicates you could get an estimate of the variance of the called peaks.

Secondly, it depends on your p-value/enrichment cutoff. If you are restricting to very strong peaks then you could potentially compare the numbers. Reason is as you relax ur threshold for calling peaks, the different experiments could bleed in noisy peaks at different rates. So a tiny change in the p-value threshold could cause massive differences in number of peaks called. For example, I have seem ample cases of biological and technical replicates of the same experiment giving quite different number of peaks for the same threshold with the same peak caller program. The strongest peaks tend to agree but as u go down the list the consistency gets worse.

Also, hopefully the control experiment used is common or that is going to make it even harder to do a head to head comparison.

Ideally, you want to rank your peaks by their enrichment/p-value and compute rank statistics on that to estimate how different the two experiments are.
akundaje is offline   Reply With Quote
Old 08-06-2011, 06:00 PM   #3
Location: asia

Join Date: Dec 2009
Posts: 80

May I add another question. I have the same scenario. 2 different chipseq from 2 different experiments (one in brain and one in heart). brain chipseq has 10 million tags and heart 4 million tags. I want to map the raw number of tags around promoter. But this difference in no.of tags is not giving any patterns except a flat line on the top and one at the bottom.

I tried to normalize in this way. But it didn't work at all. Any ideas about normalizing ChIP-Seq sample with different number of tags from 2 different experiments ?

position_cDNAnorm = (position_cDNA / sum_cDNA) * average_sum_cDNA

* position_cDNAnorm = normalised cDNA value for specific position and specific DBP
* position_cDNA = cDNA value for specific position and specific DBP
* sum_cDNA = total cDNA count for specific DBP
* average_sum_cDNA = average of total cDNA counts of all DBPs
DBP= DNA Bindign Protein (Transcription factor)
repinementer is offline   Reply With Quote
Old 08-06-2011, 11:08 PM   #4
Senior Member
Location: Western Australia

Join Date: Feb 2010
Posts: 310

I completely agree with akundaje. I would like to emphasize his point that even if you have the the same number of tags from two biological or even technical replicates and compare them you will get different peaks called. Replicates will help weed out the borderline peaks. The peaks called and read count is not a linear relationship.

Since there is this issue with variation of borderline peaks called at the peak arbitrary cut off, I think the thing to do is in you peak finder run you two ChIP samples as 'treatment' and 'control'. This will identify significant differences between the samples. However, some of these differences may be from differences in chromatin structure and sheering efficiency and not txn factor binding. So this requires a second step. Take your significant differences and then intersect those with your list of peaks and you should end up with a list of real differences between the two conditions.

You should still normalize the read counts and get some replicates.

I made a blog post on my new blog on this subject. So here is the shameless link to it:

This seems like a pretty good way to go about addressing the question at hand, but there may be better ways.

Last edited by ETHANol; 08-07-2011 at 05:12 AM.
ETHANol is offline   Reply With Quote
Old 08-07-2011, 10:42 AM   #5
Junior Member
Location: Berlin

Join Date: Apr 2011
Posts: 6

In my experience no clear statement can be made without replicates. E.g. we had two replicates with about 4K peaks. The overlap of the peaks was 100. That already tells you quite something about peak calling and its interpretations. After looking at ChIP-seq data from others I experienced the same. But folks tend to pool their replicates before peak calling to get around that. Anyway if I see people taking peak numbers to answer biological questions the first thing I do is to look at the raw data (if it is available). In most but one cases I would say that peak numbers mean nothing.

Another case was the analysis of cells with very low TF protein level upon treatment (like in a KO situation). Peak calling reveals double the amount of peaks for that situation compared to untreated cells with TF binding and normal protein levels.

I did not find any answers on how to rank my peaks to compare different treatments. For me it worked quite well to plot the tag enrichments (Input, IgG, Treated, Untreated) +-3kb around my peaks in a heat map and do k-means clustering. That identified strongly enriched sites I can trust.
howi is offline   Reply With Quote
Old 08-22-2011, 02:43 PM   #6
Location: usa

Join Date: May 2011
Posts: 59


It is a good solution somehow. But in my case, I would like to compare the two samples to see if these two samples are similar or different. It might need some statistical calculation I guess.
Any suggestions will be highly appreciated.
emilyjia2000 is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 03:26 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO