We just did our first set of paired end 2x100 bp ChIP-Sequencing. I've aligned and looked at the results and it looks pretty decent, but I'm wondering if a good amount of data is being lost because of both shorter fragments and higher duplication. We did the IP and sent purified DNA to a facility for Illumina HiSeq 2000 sequencing, and I wanted some feedback before mentioning my concerns to the sequencing facility and maybe looking foolish
1. Fragment size before IP looked brightest at about 250 bp but the average fragment size of the final aligned data is about 175 bp so most of the reads overlap. Is this normal? Would this be because of how the facility did their size selection or should I sonicate less next time?
2. We have 40 million paired end reads (= 80 million individual reads) for each of 5 samples. The 5 samples were multiplexed and run together in a single lane. The 3 IP samples have 50-60% duplicate sequences as determined by FastQC and by checking for duplicates after alignment. The 2 input samples have only 5%-10% duplication. Is this high duplication in the IP samples normal? I imagine this would be affected by antibody/number of expected binding sites and since input should be evenly distributed I can understand why it would show lower duplication rate. I don't know how much DNA the facility started with or how much PCR they did but they required at least 10 ng and we sent 20-30 to be safe.
I've analyzed 2 other datasets (different antibodies/proteins) from the Geo database both published in good journals and the sample with 40 million single end reads also had ~50% duplication while the sample with 20 million single end reads had very low duplication, so maybe duplication becomes unavoidable when you have more sequences? Next time we could multiplex more samples and aim for say only 30 million reads each if we're going to get so many duplicates with 40 million. Regarding fragment size, I used the SISSRs peak finding program on these other samples and it can predict fragment sized from singled end reads. I don't know how accurate it is but the predicted sizes were always around 170-190 bp which is pretty close to what we got with our paired end sequencing so maybe this is normal?
1. Fragment size before IP looked brightest at about 250 bp but the average fragment size of the final aligned data is about 175 bp so most of the reads overlap. Is this normal? Would this be because of how the facility did their size selection or should I sonicate less next time?
2. We have 40 million paired end reads (= 80 million individual reads) for each of 5 samples. The 5 samples were multiplexed and run together in a single lane. The 3 IP samples have 50-60% duplicate sequences as determined by FastQC and by checking for duplicates after alignment. The 2 input samples have only 5%-10% duplication. Is this high duplication in the IP samples normal? I imagine this would be affected by antibody/number of expected binding sites and since input should be evenly distributed I can understand why it would show lower duplication rate. I don't know how much DNA the facility started with or how much PCR they did but they required at least 10 ng and we sent 20-30 to be safe.
I've analyzed 2 other datasets (different antibodies/proteins) from the Geo database both published in good journals and the sample with 40 million single end reads also had ~50% duplication while the sample with 20 million single end reads had very low duplication, so maybe duplication becomes unavoidable when you have more sequences? Next time we could multiplex more samples and aim for say only 30 million reads each if we're going to get so many duplicates with 40 million. Regarding fragment size, I used the SISSRs peak finding program on these other samples and it can predict fragment sized from singled end reads. I don't know how accurate it is but the predicted sizes were always around 170-190 bp which is pretty close to what we got with our paired end sequencing so maybe this is normal?
Comment