Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • normalization of ChIP-seq data

    Hi all,

    if I may add a more general question regarding the way to normalize the
    ChIP-seq data when comparing multiple experiments. Considering the
    Poisson distribution based models for peak finding, my question is the
    following :

    - assuming there are 2 ChIP-seq experiments :

    <> minus treatment : 10 mil tags : 10000 peaks
    <> plus treatment : 4 mil tags: 3000 peaks

    - although the data is differently saturated, could we still compare
    10 000 peaks vs 3 000 peaks and say, for instance, 8000 peaks are lost

    with the treatment, 2000 peaks remain unchanged and 1000 peaks
    are gained ?
    - assuming that the cut-off to call a peak is different for minus
    treatment (let's say 6 tags) vs plus treatment
    (let's say 4 tags), would the comparison that is described above be
    statistically legitimate ? thanks a lot,

    Bogdan

  • #2
    IMHO the answer is no. There are several issues I can think of.

    First, you cannot compare the number of peaks or get any statistics directly without any estimate of replication noise. If you had replicate experiments for atleast one (ideally both) conditions and u ran peak calls on the replicates you could get an estimate of the variance of the called peaks.

    Secondly, it depends on your p-value/enrichment cutoff. If you are restricting to very strong peaks then you could potentially compare the numbers. Reason is as you relax ur threshold for calling peaks, the different experiments could bleed in noisy peaks at different rates. So a tiny change in the p-value threshold could cause massive differences in number of peaks called. For example, I have seem ample cases of biological and technical replicates of the same experiment giving quite different number of peaks for the same threshold with the same peak caller program. The strongest peaks tend to agree but as u go down the list the consistency gets worse.

    Also, hopefully the control experiment used is common or that is going to make it even harder to do a head to head comparison.

    Ideally, you want to rank your peaks by their enrichment/p-value and compute rank statistics on that to estimate how different the two experiments are.

    Comment


    • #3
      May I add another question. I have the same scenario. 2 different chipseq from 2 different experiments (one in brain and one in heart). brain chipseq has 10 million tags and heart 4 million tags. I want to map the raw number of tags around promoter. But this difference in no.of tags is not giving any patterns except a flat line on the top and one at the bottom.

      I tried to normalize in this way. But it didn't work at all. Any ideas about normalizing ChIP-Seq sample with different number of tags from 2 different experiments ?

      position_cDNAnorm = (position_cDNA / sum_cDNA) * average_sum_cDNA

      * position_cDNAnorm = normalised cDNA value for specific position and specific DBP
      * position_cDNA = cDNA value for specific position and specific DBP
      * sum_cDNA = total cDNA count for specific DBP
      * average_sum_cDNA = average of total cDNA counts of all DBPs
      DBP= DNA Bindign Protein (Transcription factor)

      Comment


      • #4
        I completely agree with akundaje. I would like to emphasize his point that even if you have the the same number of tags from two biological or even technical replicates and compare them you will get different peaks called. Replicates will help weed out the borderline peaks. The peaks called and read count is not a linear relationship.

        Since there is this issue with variation of borderline peaks called at the peak arbitrary cut off, I think the thing to do is in you peak finder run you two ChIP samples as 'treatment' and 'control'. This will identify significant differences between the samples. However, some of these differences may be from differences in chromatin structure and sheering efficiency and not txn factor binding. So this requires a second step. Take your significant differences and then intersect those with your list of peaks and you should end up with a list of real differences between the two conditions.

        You should still normalize the read counts and get some replicates.

        I made a blog post on my new blog on this subject. So here is the shameless link to it:
        This is a question people seem to be having some difficult with, as I’ve seen it asked a few times on SeqAnswers. You have results from two ChIP-seq experiments.  For example, you want to know if N…


        This seems like a pretty good way to go about addressing the question at hand, but there may be better ways.
        Last edited by ETHANol; 08-07-2011, 04:12 AM.
        --------------
        Ethan

        Comment


        • #5
          In my experience no clear statement can be made without replicates. E.g. we had two replicates with about 4K peaks. The overlap of the peaks was 100. That already tells you quite something about peak calling and its interpretations. After looking at ChIP-seq data from others I experienced the same. But folks tend to pool their replicates before peak calling to get around that. Anyway if I see people taking peak numbers to answer biological questions the first thing I do is to look at the raw data (if it is available). In most but one cases I would say that peak numbers mean nothing.

          Another case was the analysis of cells with very low TF protein level upon treatment (like in a KO situation). Peak calling reveals double the amount of peaks for that situation compared to untreated cells with TF binding and normal protein levels.

          I did not find any answers on how to rank my peaks to compare different treatments. For me it worked quite well to plot the tag enrichments (Input, IgG, Treated, Untreated) +-3kb around my peaks in a heat map and do k-means clustering. That identified strongly enriched sites I can trust.

          Comment


          • #6
            ETHANol,

            It is a good solution somehow. But in my case, I would like to compare the two samples to see if these two samples are similar or different. It might need some statistical calculation I guess.
            Any suggestions will be highly appreciated.
            Thanks

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X