SEQanswers (
-   Bioinformatics (
-   -   how to filter unaligned duplicate reads (

HESmith 11-28-2011 09:58 AM

how to filter unaligned duplicate reads
I've been given a data set (PE-100 reads from both standard and mate-pair libraries) for de novo assembly that's likely to contain a significant fraction of duplicates, based on the number of PCR cycles used to amplify the libraries. I'm aware of tools that filter duplicates based on alignment, but I'd like to do the same for the unaligned reads before attempting assembly (by identifying reads that have identical sequences at both the 5' and 3' ends). Any recommendations?


swbarnes2 11-29-2011 08:52 AM

That sounds horribly memory intensive, that's probably why almost no one does it that way.

HESmith 11-29-2011 09:59 AM

I agree but, without a reference genome for alignment, it seems like the only option. A simplistic approach would be to generate hash tables using the first 10 nucleotides from read 1 and read 2 as the key, and keep only one sequence per key. It doesn't account for sequencing errors, but would probably be good enough for my purposes (or at least give me a sense of how much duplication is present). Alternatively, I suppose I could build an assembly from the whole data set, then align to that assembly to identify duplicates.

Any advice/recommendations/alternative approaches would be welcome.

stuka 11-29-2011 10:03 AM

I've developed a naive tool to brute force compare to do some basic removal using hadoop

rudi283 11-30-2011 02:13 AM

In Genomics Workbench, from CLC Bio, you can remove PCR duplicates before alignment

All times are GMT -8. The time now is 08:39 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.