Hi all,
I have been performing an in-depth quality analysis of some of our 454 whole-genome shotgun runs for a fungal species (~35-70 Mb genome) and plant species (~1 Gb genome) from both FLX and Titanium runs. In both datasets between 15 and 35% of the reads in each individual run are duplicate reads, i.e. the first 100 nt or more are exactly same and they start at exactly the same nucleotide. Even though both genomes are repetitive (to some extent), this is far more than expected by chance alone. Our hypothesis at the moment is that these duplicates are a result of the emulsion PCR step, but we think the percentage is really on the high side! Between runs from the same library there are not so many duplicates, so it is not a library issue. Furthermore we observe roughly the same numbers for paired-end libraries, so this confirms our hypothesis of this being an emPCR problem.
Does anyone here have any experience with such analyses, and if so, do you find similar numbers?
I have been performing an in-depth quality analysis of some of our 454 whole-genome shotgun runs for a fungal species (~35-70 Mb genome) and plant species (~1 Gb genome) from both FLX and Titanium runs. In both datasets between 15 and 35% of the reads in each individual run are duplicate reads, i.e. the first 100 nt or more are exactly same and they start at exactly the same nucleotide. Even though both genomes are repetitive (to some extent), this is far more than expected by chance alone. Our hypothesis at the moment is that these duplicates are a result of the emulsion PCR step, but we think the percentage is really on the high side! Between runs from the same library there are not so many duplicates, so it is not a library issue. Furthermore we observe roughly the same numbers for paired-end libraries, so this confirms our hypothesis of this being an emPCR problem.
Does anyone here have any experience with such analyses, and if so, do you find similar numbers?
Comment