SEQanswers

Go Back   SEQanswers > Applications Forums > Epigenetics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Human Illumina Paired-end RNA-Seq remove duplication. fabrice Bioinformatics 8 10-15-2012 09:10 PM
polymorphisms among mouse strains concern for mapping the reads from ChIP-seq? ramkumar General 4 05-07-2012 09:07 AM
question on DNA fragment size after shear the DNA (in Chip-seq) kaixinsjtu Sample Prep / Library Generation 4 04-05-2012 03:36 AM
Very high duplication of sequences in ChIP-Seq sequencing results OptimusBrien Epigenetics 8 09-15-2011 08:23 AM
short duplication of part of the insert ChIP-Seq Nihilo Illumina/Solexa 0 12-17-2009 09:58 AM

Reply
 
Thread Tools
Old 09-26-2012, 12:16 PM   #1
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 126
Default Concern about short fragment size and high duplication rate in paired-end ChIP-Seq

We just did our first set of paired end 2x100 bp ChIP-Sequencing. I've aligned and looked at the results and it looks pretty decent, but I'm wondering if a good amount of data is being lost because of both shorter fragments and higher duplication. We did the IP and sent purified DNA to a facility for Illumina HiSeq 2000 sequencing, and I wanted some feedback before mentioning my concerns to the sequencing facility and maybe looking foolish

1. Fragment size before IP looked brightest at about 250 bp but the average fragment size of the final aligned data is about 175 bp so most of the reads overlap. Is this normal? Would this be because of how the facility did their size selection or should I sonicate less next time?

2. We have 40 million paired end reads (= 80 million individual reads) for each of 5 samples. The 5 samples were multiplexed and run together in a single lane. The 3 IP samples have 50-60% duplicate sequences as determined by FastQC and by checking for duplicates after alignment. The 2 input samples have only 5%-10% duplication. Is this high duplication in the IP samples normal? I imagine this would be affected by antibody/number of expected binding sites and since input should be evenly distributed I can understand why it would show lower duplication rate. I don't know how much DNA the facility started with or how much PCR they did but they required at least 10 ng and we sent 20-30 to be safe.

I've analyzed 2 other datasets (different antibodies/proteins) from the Geo database both published in good journals and the sample with 40 million single end reads also had ~50% duplication while the sample with 20 million single end reads had very low duplication, so maybe duplication becomes unavoidable when you have more sequences? Next time we could multiplex more samples and aim for say only 30 million reads each if we're going to get so many duplicates with 40 million. Regarding fragment size, I used the SISSRs peak finding program on these other samples and it can predict fragment sized from singled end reads. I don't know how accurate it is but the predicted sizes were always around 170-190 bp which is pretty close to what we got with our paired end sequencing so maybe this is normal?

Last edited by biznatch; 09-26-2012 at 12:20 PM.
biznatch is offline   Reply With Quote
Old 09-26-2012, 11:37 PM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

For your high duplication level you might just be saturating your peaks. If your ChIP is really good then you're only looking at a limited region of your genome so eventually duplication becomes inevitable from a random selection of a diverse library. You should be able to see from your results whether you're getting incomplete or uneven coverage in your peaks which might suggest that the duplication is more technical and problematic. If the peaks look smooth and evenly covered then I'd not worry about it too much.

For the fragment size it's difficult to know why you're seeing a shift in average size but normally the only size selection during library preparation would be to avoid adapter dimers, which are small, so it would seem odd if the library preparation decreased the average insert size.

For ChIP you really want short insert sizes so you get more specific information about binding locations. If your data looks good then I wouldn't worry about messing around with your protocol.
simonandrews is offline   Reply With Quote
Old 09-26-2012, 11:47 PM   #3
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 126
Default

Thank you this is good to hear, it sounds like the results are pretty much as expected then, and based on what I've looked at so far the peaks do look smooth and evenly covered. It makes sense that for ChIP you want shorter sizes for more specific binding, so I'm wondering is 2x100 bp very common for ChIP or do people tend to use 2x50 or 2x75 or even single end reads? The facility we sent it to said that they pretty much only do 2x100 bp now for everything (chip, rna, etc). There's nothing wrong with getting extra data but I think usually it's cheaper to do shorter reads.
biznatch is offline   Reply With Quote
Old 09-26-2012, 11:54 PM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Actually we tend to do 1 x 50 for a lot of our ChIP. As long as you know the expected insert size for your library you can simply extend the single end reads to infer where the whole insert would have been. Makes things even cheaper and still seems to work OK if you've got a decent antibody.
simonandrews is offline   Reply With Quote
Old 09-27-2012, 12:15 AM   #5
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 126
Default

Ok that's kind of what I thought. The place we sent it said that since they do mostly 2x100 now it would take a lot longer if we did anything else, I guess because they have to wait until they have enough 1x50 requests to fill the machine? I'm not sure exactly how that works, but we only used 1 lane. The cost even for 2x100 was cheaper than other places with shorter read so it wasn't a big deal but for future we'll have to consider other options.

We did 1x50 a year or so ago at a different facility but for our 5 samples this time it was actually cheaper to do 2x100 at the new place vs 1x50 at the old place.

Last edited by biznatch; 09-27-2012 at 12:19 AM.
biznatch is offline   Reply With Quote
Old 02-14-2013, 11:47 AM   #6
mitcherr
Junior Member
 
Location: Canada

Join Date: Feb 2013
Posts: 4
Default

Biznatch,

Did you do this analysis at TCAG? I am thinking of doing the same thing right, now and was wondering exactly what you were regarding the read length, and whether to do single end instead of paired to avoid over redundancy. Did everything work out okay with your data? Would you have done things differently looking back??

cheers
mitcherr is offline   Reply With Quote
Old 02-14-2013, 01:45 PM   #7
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 126
Default

Hi mitcherr, yes it was TCAG. Everything worked out ok with the data, we actually just got our second set back today and I'm in the process of aligning it. The paired end reads seem to give less artifacts in a few places. There's one site in particular near a gene of interest that always shows a large peak of non-specific alignment that shows up in the 50bp single end samples and inputs but not in the 2x100 paired end reads, but maybe 1x100 would look fine too.

I don't think paired end reads would increase redundancy. I think you start getting redundancy once you get a certain amount of reads, regardless of whether you have single or paired end reads. The only problem with paired end reads is that maybe you're paying a lot more money for only a small increase in alignment accuracy. From a biological/technical perspective I think paired end can only help.

With the new data set we went with the same 2x100 reads again because the facility couldn't estimate a turnaround time for anything else, and since the 2x100 at TCAG was the same price or less than shorter single end reads elsewhere. But if it wasn't for the turnaround time issue I think single end reads would be fine and we would have gone with that. I'd suggest contacting TCAG and asking about single end reads, maybe it will be faster now.
biznatch is offline   Reply With Quote
Old 02-15-2013, 07:29 AM   #8
mitcherr
Junior Member
 
Location: Canada

Join Date: Feb 2013
Posts: 4
Default

Thanks for the reply. Pretty funny that I could figure out what facility you used via read length and country of origin lol
mitcherr is offline   Reply With Quote
Old 05-16-2013, 06:25 AM   #9
syfo
Just a member
 
Location: Southern EU

Join Date: Nov 2012
Posts: 103
Default on the advantage/cost of PE vs. SE

Quote:
Originally Posted by simonandrews View Post
Actually we tend to do 1 x 50 for a lot of our ChIP. As long as you know the expected insert size for your library you can simply extend the single end reads to infer where the whole insert would have been. Makes things even cheaper and still seems to work OK if you've got a decent antibody.
Quote:
Originally Posted by biznatch View Post
The only problem with paired end reads is that maybe you're paying a lot more money for only a small increase in alignment accuracy. [...] But if it wasn't for the turnaround time issue I think single end reads would be fine and we would have gone with that.
Aren't paired end reads better to detect and remove duplicates?
syfo is offline   Reply With Quote
Old 08-13-2013, 10:51 PM   #10
mxqian
Junior Member
 
Location: shanghai

Join Date: Sep 2011
Posts: 3
Default

@biznatch
Hi, as you see, nearly all the NGS data on illumina platform are 2x100 bp now. However, I can not find the suitable analysis software for ChIP-seq with the paired reads. MACS just can accept the ELANDMULTI format for paired reads. If the format is sam/bam that is most widely used format for maping reads, MACS will just keep the left mate(5' end) tag. That will work, but I don't think that used the paired information well. Any suggestion?
mxqian is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO