View Single Post
Old 02-26-2014, 04:28 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The BBMap package contains a tool for dereplication. It's intended for assembly dereplication, but written so that it works with paired-end fastq reads.

Usage:
dedupe.sh in=reads.fq out=fixed.fq maxsubs=0 int=t ac=f

If your OS does not process bash shellscripts, you can replace "dedupe.sh" with "java -Xmx31g -cp /path/to/current jgi.Dedupe" where 31g should be adjusted to be around 80% of your physical memory.

"maxsubs=0" means that only exact matches are allowed; you can make that number higher if you want. "int=t" is used to indicate that the data is paired and interleaved. If your data is paired and in 2 files, you need to interleave it first:

reformat.sh in1=reads1.fq in2=reads2.fq out=interleaved.fq
(or java -Xmx200m -cp /path/to/current jgi.ReformatReads)

Other options are random subsampling to reduce the volume of data, and error-correction to help better detect duplicates. The BBMap package contains an error-correction program (ecc.sh), though the effectiveness depends on the makeup of the metagenomic community and data volume, as it depends on high kmer depth. 30m reads for a real metagenome is pretty small.

Last edited by Brian Bushnell; 02-18-2015 at 09:27 AM.
Brian Bushnell is offline   Reply With Quote