Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help needed for exomeCNV sht41 Bioinformatics 2 07-23-2013 12:04 AM
genemapper ID-X needed markwest Bioinformatics 0 07-15-2013 11:20 AM
IQseq help needed bruce01 Bioinformatics 0 05-09-2012 05:45 AM
Newbie...Help needed!! Beta1 Sample Prep / Library Generation 1 01-23-2012 09:21 AM
Help needed smjazayeri Introductions 0 06-16-2011 03:08 PM

Thread Tools
Old 12-17-2013, 08:43 AM   #1
Senior Member
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default Dereplication tools needed

It's been a while since I got my hands on shotgun Illumina metagenomic data. I've found that it's quite important to dereplicate before doing any downstream analysis to avoid problems with assembly and inaccurate quantification. The last time around I used usearch --derep_fulllength on a subset of the data to filter out artificial replicate reads, but it is choking on the larger datasets I have now. My approach was to identify a high quality subsection of R1 and dereplicate that, then filter out reads from the raw data. The reason for this is that often there can be a single cycle with high error, and there is always higher error at the end of the read, so some actual replicates could be missed if the whole read is used.

Can anyone recommend a good current tool for dereplicating Illumina reads? My datasets are about 20-30 million reads each. I came across Fulcrum with google search--any experiences with that? (paper)
greigite is offline   Reply With Quote
Old 02-07-2014, 10:56 AM   #2
Junior Member
Location: New Orleans

Join Date: May 2013
Posts: 5

Here is a python script that I wrote to dereplicate larger fastas github
The only requirement is BioPython needs to be in your python path. You can use the -h option for more information. An example usage would be -i somefasta.fna

I hope it helps you!
gblanchard4 is offline   Reply With Quote
Old 02-26-2014, 04:28 PM   #3
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

The BBMap package contains a tool for dereplication. It's intended for assembly dereplication, but written so that it works with paired-end fastq reads.

Usage: in=reads.fq out=fixed.fq maxsubs=0 int=t ac=f

If your OS does not process bash shellscripts, you can replace "" with "java -Xmx31g -cp /path/to/current jgi.Dedupe" where 31g should be adjusted to be around 80% of your physical memory.

"maxsubs=0" means that only exact matches are allowed; you can make that number higher if you want. "int=t" is used to indicate that the data is paired and interleaved. If your data is paired and in 2 files, you need to interleave it first: in1=reads1.fq in2=reads2.fq out=interleaved.fq
(or java -Xmx200m -cp /path/to/current jgi.ReformatReads)

Other options are random subsampling to reduce the volume of data, and error-correction to help better detect duplicates. The BBMap package contains an error-correction program (, though the effectiveness depends on the makeup of the metagenomic community and data volume, as it depends on high kmer depth. 30m reads for a real metagenome is pretty small.

Last edited by Brian Bushnell; 02-18-2015 at 09:27 AM.
Brian Bushnell is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 09:22 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO