SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Illumina/Solexa (http://seqanswers.com/forums/forumdisplay.php?f=6)
-   -   Dereplication tools needed (http://seqanswers.com/forums/showthread.php?t=39270)

greigite 12-17-2013 08:43 AM

Dereplication tools needed
 
It's been a while since I got my hands on shotgun Illumina metagenomic data. I've found that it's quite important to dereplicate before doing any downstream analysis to avoid problems with assembly and inaccurate quantification. The last time around I used usearch --derep_fulllength on a subset of the data to filter out artificial replicate reads, but it is choking on the larger datasets I have now. My approach was to identify a high quality subsection of R1 and dereplicate that, then filter out reads from the raw data. The reason for this is that often there can be a single cycle with high error, and there is always higher error at the end of the read, so some actual replicates could be missed if the whole read is used.

Can anyone recommend a good current tool for dereplicating Illumina reads? My datasets are about 20-30 million reads each. I came across Fulcrum with google search--any experiences with that? (paper)

gblanchard4 02-07-2014 10:56 AM

Here is a python script that I wrote to dereplicate larger fastas github
The only requirement is BioPython needs to be in your python path. You can use the -h option for more information. An example usage would be derep_seqs.py -i somefasta.fna

I hope it helps you!

Brian Bushnell 02-26-2014 04:28 PM

The BBMap package contains a tool for dereplication. It's intended for assembly dereplication, but written so that it works with paired-end fastq reads.

Usage:
dedupe.sh in=reads.fq out=fixed.fq maxsubs=0 int=t ac=f

If your OS does not process bash shellscripts, you can replace "dedupe.sh" with "java -Xmx31g -cp /path/to/current jgi.Dedupe" where 31g should be adjusted to be around 80% of your physical memory.

"maxsubs=0" means that only exact matches are allowed; you can make that number higher if you want. "int=t" is used to indicate that the data is paired and interleaved. If your data is paired and in 2 files, you need to interleave it first:

reformat.sh in1=reads1.fq in2=reads2.fq out=interleaved.fq
(or java -Xmx200m -cp /path/to/current jgi.ReformatReads)

Other options are random subsampling to reduce the volume of data, and error-correction to help better detect duplicates. The BBMap package contains an error-correction program (ecc.sh), though the effectiveness depends on the makeup of the metagenomic community and data volume, as it depends on high kmer depth. 30m reads for a real metagenome is pretty small.


All times are GMT -8. The time now is 01:26 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.