Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract unaligned reads (Tophat) from FastQ Uwe Appelt Bioinformatics 5 08-07-2012 04:33 AM
how to extract raw unaligned reads? joseph Bioinformatics 2 12-20-2011 05:24 PM
unaligned reads of tophat ae_ucla RNA Sequencing 1 04-07-2011 10:06 AM
Seqman Leaves most of the reads unaligned Mansequencer Bioinformatics 5 07-28-2010 02:05 PM
[Optimization] perl script for unaligned reads Adamo Bioinformatics 5 07-15-2010 05:20 AM

Thread Tools
Old 11-28-2011, 09:58 AM   #1
Senior Member
Location: Bethesda MD

Join Date: Oct 2009
Posts: 510
Default how to filter unaligned duplicate reads

I've been given a data set (PE-100 reads from both standard and mate-pair libraries) for de novo assembly that's likely to contain a significant fraction of duplicates, based on the number of PCR cycles used to amplify the libraries. I'm aware of tools that filter duplicates based on alignment, but I'd like to do the same for the unaligned reads before attempting assembly (by identifying reads that have identical sequences at both the 5' and 3' ends). Any recommendations?


Last edited by HESmith; 11-28-2011 at 11:06 AM. Reason: typos
HESmith is offline   Reply With Quote
Old 11-29-2011, 08:52 AM   #2
Senior Member
Location: San Diego

Join Date: May 2008
Posts: 912

That sounds horribly memory intensive, that's probably why almost no one does it that way.
swbarnes2 is offline   Reply With Quote
Old 11-29-2011, 09:59 AM   #3
Senior Member
Location: Bethesda MD

Join Date: Oct 2009
Posts: 510

I agree but, without a reference genome for alignment, it seems like the only option. A simplistic approach would be to generate hash tables using the first 10 nucleotides from read 1 and read 2 as the key, and keep only one sequence per key. It doesn't account for sequencing errors, but would probably be good enough for my purposes (or at least give me a sense of how much duplication is present). Alternatively, I suppose I could build an assembly from the whole data set, then align to that assembly to identify duplicates.

Any advice/recommendations/alternative approaches would be welcome.
HESmith is offline   Reply With Quote
Old 11-29-2011, 10:03 AM   #4
Junior Member
Location: OK

Join Date: Oct 2008
Posts: 3

I've developed a naive tool to brute force compare to do some basic removal using hadoop
stuka is offline   Reply With Quote
Old 11-30-2011, 02:13 AM   #5
Location: europe

Join Date: Sep 2010
Posts: 27

In Genomics Workbench, from CLC Bio, you can remove PCR duplicates before alignment
rudi283 is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 04:23 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO