SEQanswers (
-   Bioinformatics (
-   -   removing duplicate PE reads from unmappable data (

greigite 07-31-2012 11:31 AM

removing duplicate PE reads from unmappable data
This topic has been discussed a fair bit on seqanswers but I haven't found the answer to this exact question, so am throwing it out there again. I have some 100 bp PE metagenomic data sets from rather low complexity samples. Analysis of duplication levels in read 1 and read 2 separately shows rather high levels of duplication (25-50%). I'm interested in identifying true PCR duplicates through analysis of read 1 and read 2 together-- i.e. true duplicates should be identical at both ends of the molecule. This data is from a community without reference genomes so mapping then using samtools rmdup is not an option. Is there an existing tool for non-map based duplicate removal of PE reads or do I need to cobble something together? I think something like the following could work:

1. separate out the first xx bp of read 1 and read 2, then merge using fastq-joiner in galaxy or the like
2. remove exact duplicates using fastx-collapser in galaxy fastx toolkit
3. extract list of non-duplicated reads from output of fastx-collapser
4. pull these reads out of the original fastq files for read 1 and read 2

All times are GMT -8. The time now is 05:39 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.