Seqanswers Leaderboard Ad

**gblanchard4** · 02-07-2014, 11:56 AM

Here is a python script that I wrote to dereplicate larger fastas github
The only requirement is BioPython needs to be in your python path. You can use the -h option for more information. An example usage would be derep_seqs.py -i somefasta.fna

I hope it helps you!

**Brian Bushnell** · 02-26-2014, 05:28 PM

The BBMap package contains a tool for dereplication. It's intended for assembly dereplication, but written so that it works with paired-end fastq reads.

Usage:
dedupe.sh in=reads.fq out=fixed.fq maxsubs=0 int=t ac=f

If your OS does not process bash shellscripts, you can replace "dedupe.sh" with "java -Xmx31g -cp /path/to/current jgi.Dedupe" where 31g should be adjusted to be around 80% of your physical memory.

"maxsubs=0" means that only exact matches are allowed; you can make that number higher if you want. "int=t" is used to indicate that the data is paired and interleaved. If your data is paired and in 2 files, you need to interleave it first:

reformat.sh in1=reads1.fq in2=reads2.fq out=interleaved.fq
(or java -Xmx200m -cp /path/to/current jgi.ReformatReads)

Other options are random subsampling to reduce the volume of data, and error-correction to help better detect duplicates. The BBMap package contains an error-correction program (ecc.sh), though the effectiveness depends on the makeup of the metagenomic community and data volume, as it depends on high kmer depth. 30m reads for a real metagenome is pretty small.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Dereplication tools needed

Comment

Comment

Latest Articles

ad_right_rmr

News