Go Back   SEQanswers > Applications Forums > Metagenomics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Hello All.I have annotated my metagenomic dataset using MG-RAST and now want to gener Gaurav_genomics Metagenomics 0 08-06-2015 01:28 AM
Partitioning of metagenomic dataset MikeT Bioinformatics 0 07-30-2015 04:39 AM
toxonomic classification and abundance research in metagenomic dataset aipolly Bioinformatics 0 10-24-2013 08:07 PM
Best way to search for gene homologue in metagenomic dataset Noa Metagenomics 2 10-08-2013 05:00 AM
Assembling very large reads k-gun12 Bioinformatics 2 03-12-2011 08:06 AM

Thread Tools
Old 03-01-2017, 11:14 AM   #1
Junior Member
Location: La Jolla

Join Date: Mar 2017
Posts: 2
Default Advice on assembling very large metagenomic dataset?

I need to assemble a large metagenomics dataset from Illumina NextSeq reads. My read depth is approximately 20 million reads per sample (28 samples) and the concatenated R1 and R2 reads are 130 GB each. I'm using 64-threads and it's still not enough.

I've been using metaspades which has been doing a great job. This is the command I ran:

python /usr/local/packages/spades-3.9.0/bin/ -t 64 -m 1000 -1 ./paired_1.fastq -2 ./paired_2.fastq -o . > spades.log
It crashed and here's the end of the output log:

==> spades.log <==
576G / 944G INFO General (distance_estimation.cpp : 226) Processing library #0
576G / 944G INFO General (distance_estimation.cpp : 132) Weight Filter Done
576G / 944G INFO DistanceEstimator (distance_estimation.hpp : 185) Using SIMPLE distance estimator
<jemalloc>: Error in malloc(): out of memory. Requested: 256, active: 933731762176
It's obviously a memory issue. Has anyone had any success: (1) using either another assembler; (2) a method to collapse the data before hand; or (3) data processing that could give unbiased assemblies?

I do not want to assemble in stages because it is difficult to collapse the data into a single dataset.

We thought about randomly selecting R1 and R2 reads but is there another method?

This method seems interesting to do unsupervised clustering of the reads before hand but I haven't seen any application-based implementations.

Last edited by jol.espinoz; 03-01-2017 at 11:16 AM.
jol.espinoz is offline   Reply With Quote
Old 03-01-2017, 02:47 PM   #2
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553

There are several possible approaches here. First, you can try other assemblers:

Megahit - we use this routinely for metagenome assemblies because the resource requirements (time and memory) are much lower than Spades.

Disco - an overlap-based assembler designed for metagenomes, which uses a similar amount of memory to the size of the input data.

Second, you can reduce the memory footprint of the data through preprocessing. This involves filtering and trimming the data, and potentially by error-correcting it and/or discarding reads with very high coverage or with too low coverage to assemble. An example is posted here; at least, the first 5 steps. For a large metagenome, I also recommend removing human reads (just prior to error-correction) as a way to reduce memory consumption.

Normalization can be done like this:

Code: in1=./paired_1.fastq in2=./paired_2.fastq out=normalized.fq target=100 min=3
That will reduce coverage to a maximum of 100x and discard reads with coverage under 3x, which can greatly increase speed and reduce memory consumption. Sometimes it also results in a better assembly, but that depends on the data. Normalization should be (ideally) done after error-correction.
Brian Bushnell is offline   Reply With Quote

assemblers, assembly, big data, large dataset, metagenomics

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 06:33 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO