![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem removing duplicate reads? (samtools and picard) | cbl | Bioinformatics | 19 | 09-17-2015 11:01 AM |
example for using Picard removing duplicate reads? | fabrice | Bioinformatics | 9 | 10-18-2013 02:32 AM |
Duplicate Reads | myronpeto | Bioinformatics | 7 | 03-07-2013 07:36 AM |
Removing similar sequence reads | loba17 | Bioinformatics | 4 | 10-17-2011 07:31 AM |
Removing duplicate reads for tophat? | hong_sunwoo | RNA Sequencing | 2 | 10-09-2010 12:46 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Earth Join Date: May 2010
Posts: 16
|
![]()
Hi all.
I'm trying to do a de novo transcriptome assembly using ABI SOLiD data. I'm trying to use Velvet/Oases at the moment, and I've found that PCR duplicates seem to be a serious problem during the postprocessing step when the double-encoded contigs are converted back into colour-space reads prior to the final assembly. This step takes at least 72 hours, which is an order of magnitude greater than the time required by the Velvet/Oases assemblers themselves. The postprocessing output file just keeps swelling in size because there are so many PCR duplicates. So the question is: is there an efficient program out there I can use to remove duplicate reads from my .csfasta (and preferably the corresponding _QV.qual) file prior to assembly? I know there's an option to do this filtering on the SOLiD machine itself, but the person who did the sequencing didn't enable it. Thanks. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
I don't think there is anything like that out there. You need alignments to detect duplicates.
About the SOLiD instrument filtering, perhaps you are talking about dropping reads with low quality?
__________________
-drd |
![]() |
![]() |
![]() |
#3 | |
Member
Location: Earth Join Date: May 2010
Posts: 16
|
![]() Quote:
There are multiple programs available for filtering out low-quality reads. That's not what I need. |
|
![]() |
![]() |
![]() |
#4 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]() Quote:
__________________
-drd |
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Earth Join Date: May 2010
Posts: 16
|
![]()
Thanks. I didn't get email notifications that people had replied to my post, so I didn't find these until just now.
For what it's worth, I believe that FASTX_collapser ( http://hannonlab.cshl.edu/fastx_toolkit/ ) can also do this, with the caveat that your .csfasta and _QV.qual have to be merged into a .fastq first (with the .csfasta double-encoded) if you also want to remove the duplicates from your _QV.qual file. |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
Wouldn't removing all identical reads result in enrichment of reads with errorrs? Perhaps filterting on the first part and allowing some duplicates would work better.
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: Earth Join Date: May 2010
Posts: 16
|
![]() Quote:
I'd turn on a maximum coverage limit, but since it's a transcriptome, the coverage varies with expression level, so I'm hesitant to omit highly covered regions. I've tried exporting to BAM, removing dupes with Picard and importing back in, but the reimport didn't work for whatever reason. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|