View Single Post
Old 04-18-2017, 04:20 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

There are a few options for this. First off, preprocessing can reduce the number of kmers present, which typically reduces memory requirements:

Adapter-trimming
Quality-trimming (at least to get rid of those Q2 trailing bases)
Contaminant removal (even if your dataset is 0.1% human, that's still the whole human genome...)
Normalization (helpful if you have a few organisms with extremely high coverage that constitute the bulk of the data; this happens in some metagenomes)
Error-correction
Read merging (useful for many assemblers, but generally has a negative impact on Megahit. Still should reduce the kmer space though).
Duplicate removal, if the library is PCR-amplified or for certain platforms like NextSeq, HiSeq3000/4000, or NovaSeq.

All of these will reduce the data volume and kmer space somewhat. If they are not sufficient, you can also discard reads that won't assemble; for example, those with a kmer depth of 1 across the entire read. Dividing randomly is generally not a good idea, but there are some read-based binning tools that use features such as tetramers and depth that try to bin by organism prior to assembly. There are also some distributed assemblers, like Ray, Disco, and MetaHipMer that allow you to use memory across multiple nodes. Generating a kmer-depth histogram can help indicate what kind of preprocessing and assembly strategies might be useful.
Brian Bushnell is offline   Reply With Quote