Unconfigured Ad

**gblanchard4** · 02-07-2014, 11:56 AM

Here is a python script that I wrote to dereplicate larger fastas github
The only requirement is BioPython needs to be in your python path. You can use the -h option for more information. An example usage would be derep_seqs.py -i somefasta.fna

I hope it helps you!

**Brian Bushnell** · 02-26-2014, 05:28 PM

The BBMap package contains a tool for dereplication. It's intended for assembly dereplication, but written so that it works with paired-end fastq reads.

Usage:
dedupe.sh in=reads.fq out=fixed.fq maxsubs=0 int=t ac=f

If your OS does not process bash shellscripts, you can replace "dedupe.sh" with "java -Xmx31g -cp /path/to/current jgi.Dedupe" where 31g should be adjusted to be around 80% of your physical memory.

"maxsubs=0" means that only exact matches are allowed; you can make that number higher if you want. "int=t" is used to indicate that the data is paired and interleaved. If your data is paired and in 2 files, you need to interleave it first:

reformat.sh in1=reads1.fq in2=reads2.fq out=interleaved.fq
(or java -Xmx200m -cp /path/to/current jgi.ReformatReads)

Other options are random subsampling to reduce the volume of data, and error-correction to help better detect duplicates. The BBMap package contains an error-correction program (ecc.sh), though the effectiveness depends on the makeup of the metagenomic community and data volume, as it depends on high kmer depth. 30m reads for a real metagenome is pretty small.

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, Yesterday, 11:10 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 102 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Dereplication tools needed

Comment

Comment

Latest Articles

ad_right_rmr

News