Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concern about short fragment size and high duplication rate in paired-end ChIP-Seq

    We just did our first set of paired end 2x100 bp ChIP-Sequencing. I've aligned and looked at the results and it looks pretty decent, but I'm wondering if a good amount of data is being lost because of both shorter fragments and higher duplication. We did the IP and sent purified DNA to a facility for Illumina HiSeq 2000 sequencing, and I wanted some feedback before mentioning my concerns to the sequencing facility and maybe looking foolish

    1. Fragment size before IP looked brightest at about 250 bp but the average fragment size of the final aligned data is about 175 bp so most of the reads overlap. Is this normal? Would this be because of how the facility did their size selection or should I sonicate less next time?

    2. We have 40 million paired end reads (= 80 million individual reads) for each of 5 samples. The 5 samples were multiplexed and run together in a single lane. The 3 IP samples have 50-60% duplicate sequences as determined by FastQC and by checking for duplicates after alignment. The 2 input samples have only 5%-10% duplication. Is this high duplication in the IP samples normal? I imagine this would be affected by antibody/number of expected binding sites and since input should be evenly distributed I can understand why it would show lower duplication rate. I don't know how much DNA the facility started with or how much PCR they did but they required at least 10 ng and we sent 20-30 to be safe.

    I've analyzed 2 other datasets (different antibodies/proteins) from the Geo database both published in good journals and the sample with 40 million single end reads also had ~50% duplication while the sample with 20 million single end reads had very low duplication, so maybe duplication becomes unavoidable when you have more sequences? Next time we could multiplex more samples and aim for say only 30 million reads each if we're going to get so many duplicates with 40 million. Regarding fragment size, I used the SISSRs peak finding program on these other samples and it can predict fragment sized from singled end reads. I don't know how accurate it is but the predicted sizes were always around 170-190 bp which is pretty close to what we got with our paired end sequencing so maybe this is normal?
    Last edited by biznatch; 09-26-2012, 12:20 PM.

  • #2
    For your high duplication level you might just be saturating your peaks. If your ChIP is really good then you're only looking at a limited region of your genome so eventually duplication becomes inevitable from a random selection of a diverse library. You should be able to see from your results whether you're getting incomplete or uneven coverage in your peaks which might suggest that the duplication is more technical and problematic. If the peaks look smooth and evenly covered then I'd not worry about it too much.

    For the fragment size it's difficult to know why you're seeing a shift in average size but normally the only size selection during library preparation would be to avoid adapter dimers, which are small, so it would seem odd if the library preparation decreased the average insert size.

    For ChIP you really want short insert sizes so you get more specific information about binding locations. If your data looks good then I wouldn't worry about messing around with your protocol.

    Comment


    • #3
      Thank you this is good to hear, it sounds like the results are pretty much as expected then, and based on what I've looked at so far the peaks do look smooth and evenly covered. It makes sense that for ChIP you want shorter sizes for more specific binding, so I'm wondering is 2x100 bp very common for ChIP or do people tend to use 2x50 or 2x75 or even single end reads? The facility we sent it to said that they pretty much only do 2x100 bp now for everything (chip, rna, etc). There's nothing wrong with getting extra data but I think usually it's cheaper to do shorter reads.

      Comment


      • #4
        Actually we tend to do 1 x 50 for a lot of our ChIP. As long as you know the expected insert size for your library you can simply extend the single end reads to infer where the whole insert would have been. Makes things even cheaper and still seems to work OK if you've got a decent antibody.

        Comment


        • #5
          Ok that's kind of what I thought. The place we sent it said that since they do mostly 2x100 now it would take a lot longer if we did anything else, I guess because they have to wait until they have enough 1x50 requests to fill the machine? I'm not sure exactly how that works, but we only used 1 lane. The cost even for 2x100 was cheaper than other places with shorter read so it wasn't a big deal but for future we'll have to consider other options.

          We did 1x50 a year or so ago at a different facility but for our 5 samples this time it was actually cheaper to do 2x100 at the new place vs 1x50 at the old place.
          Last edited by biznatch; 09-27-2012, 12:19 AM.

          Comment


          • #6
            Biznatch,

            Did you do this analysis at TCAG? I am thinking of doing the same thing right, now and was wondering exactly what you were regarding the read length, and whether to do single end instead of paired to avoid over redundancy. Did everything work out okay with your data? Would you have done things differently looking back??

            cheers

            Comment


            • #7
              Hi mitcherr, yes it was TCAG. Everything worked out ok with the data, we actually just got our second set back today and I'm in the process of aligning it. The paired end reads seem to give less artifacts in a few places. There's one site in particular near a gene of interest that always shows a large peak of non-specific alignment that shows up in the 50bp single end samples and inputs but not in the 2x100 paired end reads, but maybe 1x100 would look fine too.

              I don't think paired end reads would increase redundancy. I think you start getting redundancy once you get a certain amount of reads, regardless of whether you have single or paired end reads. The only problem with paired end reads is that maybe you're paying a lot more money for only a small increase in alignment accuracy. From a biological/technical perspective I think paired end can only help.

              With the new data set we went with the same 2x100 reads again because the facility couldn't estimate a turnaround time for anything else, and since the 2x100 at TCAG was the same price or less than shorter single end reads elsewhere. But if it wasn't for the turnaround time issue I think single end reads would be fine and we would have gone with that. I'd suggest contacting TCAG and asking about single end reads, maybe it will be faster now.

              Comment


              • #8
                Thanks for the reply. Pretty funny that I could figure out what facility you used via read length and country of origin lol

                Comment


                • #9
                  on the advantage/cost of PE vs. SE

                  Originally posted by simonandrews View Post
                  Actually we tend to do 1 x 50 for a lot of our ChIP. As long as you know the expected insert size for your library you can simply extend the single end reads to infer where the whole insert would have been. Makes things even cheaper and still seems to work OK if you've got a decent antibody.
                  Originally posted by biznatch View Post
                  The only problem with paired end reads is that maybe you're paying a lot more money for only a small increase in alignment accuracy. [...] But if it wasn't for the turnaround time issue I think single end reads would be fine and we would have gone with that.
                  Aren't paired end reads better to detect and remove duplicates?

                  Comment


                  • #10
                    @biznatch
                    Hi, as you see, nearly all the NGS data on illumina platform are 2x100 bp now. However, I can not find the suitable analysis software for ChIP-seq with the paired reads. MACS just can accept the ELANDMULTI format for paired reads. If the format is sam/bam that is most widely used format for maping reads, MACS will just keep the left mate(5' end) tag. That will work, but I don't think that used the paired information well. Any suggestion?

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    66 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X