Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advice on assembling very large metagenomic dataset?

    I need to assemble a large metagenomics dataset from Illumina NextSeq reads. My read depth is approximately 20 million reads per sample (28 samples) and the concatenated R1 and R2 reads are 130 GB each. I'm using 64-threads and it's still not enough.

    I've been using metaspades which has been doing a great job. This is the command I ran:

    python /usr/local/packages/spades-3.9.0/bin/metaspades.py -t 64 -m 1000 -1 ./paired_1.fastq -2 ./paired_2.fastq -o . > spades.log
    It crashed and here's the end of the output log:

    ==> spades.log <==
    576G / 944G INFO General (distance_estimation.cpp : 226) Processing library #0
    576G / 944G INFO General (distance_estimation.cpp : 132) Weight Filter Done
    576G / 944G INFO DistanceEstimator (distance_estimation.hpp : 185) Using SIMPLE distance estimator
    <jemalloc>: Error in malloc(): out of memory. Requested: 256, active: 933731762176
    It's obviously a memory issue. Has anyone had any success: (1) using either another assembler; (2) a method to collapse the data before hand; or (3) data processing that could give unbiased assemblies?

    I do not want to assemble in stages because it is difficult to collapse the data into a single dataset.

    We thought about randomly selecting R1 and R2 reads but is there another method?

    This method seems interesting to do unsupervised clustering of the reads before hand but I haven't seen any application-based implementations.
    Last edited by jol.espinoz; 03-01-2017, 12:16 PM.

  • #2
    There are several possible approaches here. First, you can try other assemblers:

    Megahit - we use this routinely for metagenome assemblies because the resource requirements (time and memory) are much lower than Spades.

    Disco - an overlap-based assembler designed for metagenomes, which uses a similar amount of memory to the size of the input data.

    Second, you can reduce the memory footprint of the data through preprocessing. This involves filtering and trimming the data, and potentially by error-correcting it and/or discarding reads with very high coverage or with too low coverage to assemble. An example is posted here; at least, the first 5 steps. For a large metagenome, I also recommend removing human reads (just prior to error-correction) as a way to reduce memory consumption.

    Normalization can be done like this:

    Code:
    bbnorm.sh in1=./paired_1.fastq in2=./paired_2.fastq out=normalized.fq target=100 min=3
    That will reduce coverage to a maximum of 100x and discard reads with coverage under 3x, which can greatly increase speed and reduce memory consumption. Sometimes it also results in a better assembly, but that depends on the data. Normalization should be (ideally) done after error-correction.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM
    • seqadmin
      The Impact of AI in Genomic Medicine
      by seqadmin



      Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
      02-26-2024, 02:07 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-14-2024, 06:13 AM
    0 responses
    32 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-08-2024, 08:03 AM
    0 responses
    71 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-07-2024, 08:13 AM
    0 responses
    80 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-06-2024, 09:51 AM
    0 responses
    68 views
    0 likes
    Last Post seqadmin  
    Working...
    X