SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem removing duplicate reads? (samtools and picard) cbl Bioinformatics 19 09-17-2015 12:01 PM
example for using Picard removing duplicate reads? fabrice Bioinformatics 9 10-18-2013 03:32 AM
Duplicate Reads myronpeto Bioinformatics 7 03-07-2013 08:36 AM
Removing similar sequence reads loba17 Bioinformatics 4 10-17-2011 08:31 AM
Removing duplicate reads for tophat? hong_sunwoo RNA Sequencing 2 10-09-2010 01:46 AM

Reply
 
Thread Tools
Old 06-07-2010, 02:26 PM   #1
Bueller_007
Member
 
Location: Earth

Join Date: May 2010
Posts: 16
Default Removing duplicate reads from multigig .csfasta

Hi all.

I'm trying to do a de novo transcriptome assembly using ABI SOLiD data. I'm trying to use Velvet/Oases at the moment, and I've found that PCR duplicates seem to be a serious problem during the postprocessing step when the double-encoded contigs are converted back into colour-space reads prior to the final assembly. This step takes at least 72 hours, which is an order of magnitude greater than the time required by the Velvet/Oases assemblers themselves. The postprocessing output file just keeps swelling in size because there are so many PCR duplicates.

So the question is: is there an efficient program out there I can use to remove duplicate reads from my .csfasta (and preferably the corresponding _QV.qual) file prior to assembly? I know there's an option to do this filtering on the SOLiD machine itself, but the person who did the sequencing didn't enable it.

Thanks.
Bueller_007 is offline   Reply With Quote
Old 06-07-2010, 03:02 PM   #2
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

I don't think there is anything like that out there. You need alignments to detect duplicates.
About the SOLiD instrument filtering, perhaps you are talking about dropping reads with low quality?
__________________
-drd
drio is offline   Reply With Quote
Old 06-07-2010, 03:18 PM   #3
Bueller_007
Member
 
Location: Earth

Join Date: May 2010
Posts: 16
Default

Quote:
Originally Posted by drio View Post
I don't think there is anything like that out there. You need alignments to detect duplicates.
About the SOLiD instrument filtering, perhaps you are talking about dropping reads with low quality?
I don't think I need alignments, as I'm talking about identical ~reads~. Removing these duplicates can be performed by Corona prior to data output using the --noduplicates option. However, I can't find an equivalent for data that has already been outputted by the SOLiD system.

There are multiple programs available for filtering out low-quality reads. That's not what I need.
Bueller_007 is offline   Reply With Quote
Old 06-07-2010, 03:50 PM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Bueller_007 View Post
I don't think I need alignments, as I'm talking about identical ~reads~. Removing these duplicates can be performed by Corona prior to data output using the --noduplicates option. However, I can't find an equivalent for data that has already been outputted by the SOLiD system.

There are multiple programs available for filtering out low-quality reads. That's not what I need.
A few lines of your favorite programming language should be able to do it. Lexicographically sort by sequence and remove duplicates.
nilshomer is offline   Reply With Quote
Old 06-07-2010, 06:19 PM   #5
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by nilshomer View Post
A few lines of your favorite programming language should be able to do it. Lexicographically sort by sequence and remove duplicates.
Something like this: http://github.com/drio/dups.fasta.qual
__________________
-drd
drio is offline   Reply With Quote
Old 06-25-2010, 10:58 AM   #6
Bueller_007
Member
 
Location: Earth

Join Date: May 2010
Posts: 16
Default

Thanks. I didn't get email notifications that people had replied to my post, so I didn't find these until just now.

For what it's worth, I believe that FASTX_collapser ( http://hannonlab.cshl.edu/fastx_toolkit/ ) can also do this, with the caveat that your .csfasta and _QV.qual have to be merged into a .fastq first (with the .csfasta double-encoded) if you also want to remove the duplicates from your _QV.qual file.
Bueller_007 is offline   Reply With Quote
Old 06-26-2010, 03:57 PM   #7
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Wouldn't removing all identical reads result in enrichment of reads with errorrs? Perhaps filterting on the first part and allowing some duplicates would work better.
Chipper is offline   Reply With Quote
Old 06-26-2010, 04:07 PM   #8
Bueller_007
Member
 
Location: Earth

Join Date: May 2010
Posts: 16
Default

Quote:
Originally Posted by Chipper View Post
Wouldn't removing all identical reads result in enrichment of reads with errorrs? Perhaps filterting on the first part and allowing some duplicates would work better.
Probably true. That's why it's better to remove duplicates after alignment/assembly. Unfortunately, I'm feeding the end-product to CLC Genomics Workbench and they don't have duplicate removal yet. The dupes are messing up my SNP discovery pretty badly.

I'd turn on a maximum coverage limit, but since it's a transcriptome, the coverage varies with expression level, so I'm hesitant to omit highly covered regions. I've tried exporting to BAM, removing dupes with Picard and importing back in, but the reimport didn't work for whatever reason.
Bueller_007 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO