Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastqc on Chip-Seq library: confusion

    Hello everybody,

    I am working on my first ever Chip-Seq experiment (Transcription factor binding on a HiSeq with 51bp single end reads) and at the moment I am looking at my libraries using Fastqc.

    Among several fails that I could track down the reasons for, the following seems odd to me (pictures are attached):

    Under Sequence content across all bases I find that my data seems quite AT rich.
    Then, under sequence duplication level I find it is bigger than 95%.
    As suggested in various posts in this forum, I read up on this following this link:


    Is it likely that in the course of library prep or sequencing a bias was created that I now find as a duplication of lots of AT rich reads?
    And if that could be the case, how could I confirm this?

    Maybe I should add that my bioinformatics level is very low, so at the moment I rely solely on the functions found in fastqc and anything that has a GUI.

    Thanks al lot for your input!

    Tobias
    Attached Files

  • #2
    What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.

    Regarding the Sequence duplication plot that may be entirely expected as well. You are doing a ChIP-Seq experiment. How many total sequences did you generate? How big is the genome of your organism? How big is the total target size of your ChIP enrichment? This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads.

    These plots can not be properly interpreted without a more thorough understanding of the biology of your system and what steps were carried out to generate your sequence data.
    Last edited by kmcarr; 03-23-2013, 03:51 AM.

    Comment


    • #3
      [QUOTE=kmcarr;99728]What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.

      I am working in mouse.

      Regarding the Sequence duplication plot that may be entirely expected as well. You are doing a ChIP-Seq experiment. How many total sequences did you generate? How big is the genome of your organism?

      The total genome size should be 2,644,093,988 bases. The total number of reads obtained for the data I posted previously is 30,223,517.

      How big is the total target size of your ChIP enrichment?
      This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads.


      If by target size you are referring to the size the chromatin was fragmented to, then the answer is around 150 bp.

      These plots can not be properly interpreted without a more thorough understanding of the biology of your system and what steps were carried out to generate your sequence data.

      In this ChIP-Seq experiment, I used a bioruptor to shear the chromatin of mouse neural stem cells to a size of 150 bp after crosslinking. From the following immunoprecipitation I aimed at 2 biological replicates with a yield of 5 nanogramms of ds DNA as determined by picogreen assay. This DNA as well as input and IgG control went into a Illumina TruSeq ChIP Sample prep Kit and were then evenly pooled into a 4-plex library for sequencing on a HiSeq2000, single end on one lane of a HiSeq flow cell. The Yield from all libraries was between 1,500 and 2,100 Mbases with 30,000,000 to 42,000,000 reads.
      I am currently unsure about how the starting DNA was treated in terms of PCR conditions, as this and library prep was carried out by a commercial service, but I am about to find out.

      Thank you very much again for your help!

      Tobias

      Comment


      • #4
        Originally posted by Tobikenobi View Post
        Originally posted by kmcarr View Post
        What species are you working with and what is the normal GC content of that species' genome? Your Sequence content plot looks perfectly normal if the genome of your species of interest has a GC content of 40%, lots of species do.
        I am working in mouse.
        The %GC of the mouse genome is 41-42% so your base composition plot looks exactly like you would expect it to look.

        How big is the total target size of your ChIP enrichment?
        This plot may simply indicate that the target size you were enriching for is not that large and your ChIP enrichment worked very well. If you sequenced very deeply (e.g. 200 million reads) on such a small target you are inevitably going to get a lot of duplicate reads.
        If by target size you are referring to the size the chromatin was fragmented to, then the answer is around 150 bp.
        No, the fragment size is not what I was referring to. When I say target size I mean how many, and what it the total length of the regions targeted by your transcription factor. That is the target of your enrichment in this ChIP experiment. Is it a general transcription factor or one that is highly specific to a relatively small number of promoters? As a mental exercise let's imagine that your transcription factor targets 1,000 genes and the binding site size is ~100bp. This means that your target size is 100,000bp of DNA.

        Now your input was a mouse genome, 2.64 Gbp of DNA. You obtained approximately 1.54 Gbp of DNA sequence data or < 1X coverage. In an unenriched sample the probability of duplicate reads would be close to 0. Honestly I am not that familiar with the normal statistics of ChIP enrichment but it seems to me that your enrichment would have to be off the charts fantastic to be able to see the level of duplication your are seeing explained by enrichment efficiency alone. I would start to worry that at one point during the ChIP process you ended up with an extremely limiting amount of DNA and subsequent PCR produced a biased, low diversity sample.

        Have you tried mapping the reads to the mouse genome yet to see where they align?

        Comment


        • #5
          Dear kmcarr,

          thank you very much for your help.
          In fact, I am not looking at an general TF but a rather specific one. People have done FLAG-CHIP-Seq on the factor on human cells and have identified about 5,500 genes to be targeted. So enrichment using this figure would mean about 500,000 bp, I guess.

          I need to apologize, I should have probably attached the duplication level for my input control as well. This should not be enriched in any way, right? Even though the graph looks different, the duplication level is >80% here as well.

          I have mapped the reads using bowtie and tried to look at them in the UCSC browser. In case of low complexity I should see regions that have high numbers of aligned reads vs regions that have low or no aligned reads?

          Thank you again for your help!
          Tobias
          Attached Files

          Comment


          • #6
            Originally posted by Tobikenobi View Post
            I need to apologize, I should have probably attached the duplication level for my input control as well. This should not be enriched in any way, right? Even though the graph looks different, the duplication level is >80% here as well.
            For the input control did you simply sequence some of the starting material, after fragmentation but before any immunoprecipitation? You are saying that this image is NOT from a no antibody IP control?

            If that is the case then there is something significantly wrong with your input DNA. If your are sequencing random mouse genomic DNA and only collecting ~1.5 Gbp of sequence data (< 1X coverage of the genome) there is no way you should be observing read duplication like that. Did you start out with an extremely limiting amount of input DNA, because that can lead to a low diversity library. If you started with an adequate amount of genomic DNA then something went wrong with the library prep which drastically reduced the diversity of your sample.

            Comment


            • #7
              This is in fact the input control, i.e. fragmented chromatin that was put aside before the immunoprecipitation.
              The amount of starting material was indeed limiting in this experiment, as a specific type of neural stem cell was targeted. After discussing with the service facility that provides the library construction and sequencing at our institute, it was agreed to aim at 5 nanograms of immunoprecipitated, double stranded DNA as starting material to be sufficient with the TruSeq ChiP-Seq Kit. I assume that for the input a similar amount was used. The yield for the Input control was 1.9 Gbp and 38 million reads.

              I guess the bottomline is that I am looking at libraries with very poor complexity. How could that reflect on later peak calling?

              In the meantime I have used bowtie to map the reads to the mm9 reference and filtered for duplicates. I received the following numbers:

              Input: 30,512,219 mapped reads (80%)
              IP: 20,586,367 mapped reads (68%)

              Thank you very much for your Input!

              Tobias

              Comment


              • #8
                Originally posted by Tobikenobi View Post
                I guess the bottomline is that I am looking at libraries with very poor complexity. How could that reflect on later peak calling?
                Clearly the input control doesn't represent the true background (the whole mouse genome), further you can not know that the bias in amplifying your IP sample was the same as the bias during amplification of the control. Given these results I would be skeptical about the validity of any "peaks" observed in your IP sample.

                Comment


                • #9
                  That certainly does not make things easier for me.
                  In any case your help is much appreciated!

                  Comment


                  • #10
                    I had a sequence duplication of like 90% once with mouse tissues... to fix it we now do library size selection after adapter ligation. Good luck.

                    Comment


                    • #11
                      Originally posted by silkiechicken View Post
                      I had a sequence duplication of like 90% once with mouse tissues... to fix it we now do library size selection after adapter ligation. Good luck.
                      Hi!
                      Could you please explain that in more detail?
                      What was the size of your libraries before and after the adapters were ligated and which size did you purify?
                      How much starting material did you use?

                      Thank you very much!

                      Tobias

                      Comment


                      • #12
                        So I was doing a ChIP-seq with embryonic tissues dissected from mouse. Samples were fixed and sonicated to fragment sizes between 200-500bp.

                        These samples were then IP'ed and we were able to recover about 15ng of total DNA from about 500ug of starting chromatin. When we had our 15ng of ChIPed DNA.

                        When using the illumina tru-seq kits as described, for the input and chip libraries, we had a low diversity and over 90% repeat reads randomly distributed, ie not adapter dimers and not from the IP. This was after bioanalyzer results verified that our resulting product was indeed centered around about 275bp. Second round, we requested the gel size selection to be after the amplification and adapter ligation. This resulted in a similar bioanalyzer result, and when ran on the sequencer, gave us only about 10% non unique reads.

                        Does that make more sense? I can be rather confusing.

                        Comment


                        • #13
                          That makes it very clear!
                          Thank you very much for your input!

                          Comment


                          • #14
                            actually still confused

                            Sorry, I think I still dont get it. I just wen t back to the Illumina Truseq DNA protocol, and , if I understand correctly, the gel excision step here is after adapter ligation. How does this differ from you protocol?

                            Comment


                            • #15
                              We did the gel extraction as the very last step, so after ligation and pcr amplification. Our guess is we lost too much DNA during gel purification thus resulting in amplification of a small subset of our sample.

                              eta: We didn't gel extract twice, we just moved it to the very last step.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X