Advice on assembling very large metagenomic dataset?

jol.espinoz

Junior Member

Join Date: Mar 2017

Posts: 2
- Share
- Tweet
#1

Advice on assembling very large metagenomic dataset?

03-01-2017, 12:14 PM

I need to assemble a large metagenomics dataset from Illumina NextSeq reads. My read depth is approximately 20 million reads per sample (28 samples) and the concatenated R1 and R2 reads are 130 GB each. I'm using 64-threads and it's still not enough.

I've been using metaspades which has been doing a great job. This is the command I ran:

python /usr/local/packages/spades-3.9.0/bin/metaspades.py -t 64 -m 1000 -1 ./paired_1.fastq -2 ./paired_2.fastq -o . > spades.log

It crashed and here's the end of the output log:

==> spades.log <==
576G / 944G INFO General (distance_estimation.cpp : 226) Processing library #0
576G / 944G INFO General (distance_estimation.cpp : 132) Weight Filter Done
576G / 944G INFO DistanceEstimator (distance_estimation.hpp : 185) Using SIMPLE distance estimator
<jemalloc>: Error in malloc(): out of memory. Requested: 256, active: 933731762176

It's obviously a memory issue. Has anyone had any success: (1) using either another assembler; (2) a method to collapse the data before hand; or (3) data processing that could give unbiased assemblies?

I do not want to assemble in stages because it is difficult to collapse the data into a single dataset.

We thought about randomly selecting R1 and R2 reads but is there another method?

This method seems interesting to do unsupervised clustering of the reads before hand but I haven't seen any application-based implementations.

Last edited by jol.espinoz; 03-01-2017, 12:16 PM.
Tags: assemblers, assembly, big data, large dataset, metagenomics
Brian Bushnell

Super Moderator

Join Date: Jan 2014

Posts: 2709
- Share
- Tweet
#2

03-01-2017, 03:47 PM

There are several possible approaches here. First, you can try other assemblers:

Megahit - we use this routinely for metagenome assemblies because the resource requirements (time and memory) are much lower than Spades.

Disco - an overlap-based assembler designed for metagenomes, which uses a similar amount of memory to the size of the input data.

Second, you can reduce the memory footprint of the data through preprocessing. This involves filtering and trimming the data, and potentially by error-correcting it and/or discarding reads with very high coverage or with too low coverage to assemble. An example is posted here; at least, the first 5 steps. For a large metagenome, I also recommend removing human reads (just prior to error-correction) as a way to reduce memory consumption.

Normalization can be done like this:

Code:

bbnorm.sh in1=./paired_1.fastq in2=./paired_2.fastq out=normalized.fq target=100 min=3

That will reduce coverage to a maximum of 100x and discard reads with coverage under 3x, which can greatly increase speed and reduce memory consumption. Sometimes it also results in a better assembly, but that depends on the data. Normalization should be (ideally) done after error-correction.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Advice on assembling very large metagenomic dataset?

Comment

Latest Articles

ad_right_rmr

News