This topic has been discussed a fair bit on seqanswers but I haven't found the answer to this exact question, so am throwing it out there again. I have some 100 bp PE metagenomic data sets from rather low complexity samples. Analysis of duplication levels in read 1 and read 2 separately shows rather high levels of duplication (25-50%). I'm interested in identifying true PCR duplicates through analysis of read 1 and read 2 together-- i.e. true duplicates should be identical at both ends of the molecule. This data is from a community without reference genomes so mapping then using samtools rmdup is not an option. Is there an existing tool for non-map based duplicate removal of PE reads or do I need to cobble something together? I think something like the following could work:
1. separate out the first xx bp of read 1 and read 2, then merge using fastq-joiner in galaxy or the like
2. remove exact duplicates using fastx-collapser in galaxy fastx toolkit
3. extract list of non-duplicated reads from output of fastx-collapser
4. pull these reads out of the original fastq files for read 1 and read 2
1. separate out the first xx bp of read 1 and read 2, then merge using fastq-joiner in galaxy or the like
2. remove exact duplicates using fastx-collapser in galaxy fastx toolkit
3. extract list of non-duplicated reads from output of fastx-collapser
4. pull these reads out of the original fastq files for read 1 and read 2