View Single Post
Old 11-29-2011, 09:59 AM   #3
Senior Member
Location: Bethesda MD

Join Date: Oct 2009
Posts: 509

I agree but, without a reference genome for alignment, it seems like the only option. A simplistic approach would be to generate hash tables using the first 10 nucleotides from read 1 and read 2 as the key, and keep only one sequence per key. It doesn't account for sequencing errors, but would probably be good enough for my purposes (or at least give me a sense of how much duplication is present). Alternatively, I suppose I could build an assembly from the whole data set, then align to that assembly to identify duplicates.

Any advice/recommendations/alternative approaches would be welcome.
HESmith is offline   Reply With Quote