SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Minimum amount of data needed for reliable results? bloosnail Metagenomics 5 12-12-2016 03:30 PM
Amount of data for metagenomic analysis cheezemeister Metagenomics 3 11-24-2014 01:31 PM
MeDIP-sequencing data amount required WaitingNail Epigenetics 2 06-24-2013 10:19 PM
idba_ud coassemble Miseq 2x250 and HiSeq 2x100 reads? ewilbanks Bioinformatics 0 04-30-2013 07:59 AM
GALAXY: Huge amount of data? sklages Bioinformatics 1 04-12-2011 05:03 AM

Reply
 
Thread Tools
Old 04-18-2017, 02:13 PM   #1
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 2
Default coassemble massive amount of data (>3TB)

Hello, I am attempting to assemble a massive amount of Illumina 2x150bp pair-ended reads data (>3TB). I am considering using megahit as it is the least resource-intensive assemblers I have used and still gives reasonably good results.

What are the typical strategies if one wants to assemble data size that's beyond typical limitations? I am thinking of dividing them into smaller pools but of course it's not ideal. Thanks
confurious is offline   Reply With Quote
Old 04-18-2017, 04:20 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,473
Default

There are a few options for this. First off, preprocessing can reduce the number of kmers present, which typically reduces memory requirements:

Adapter-trimming
Quality-trimming (at least to get rid of those Q2 trailing bases)
Contaminant removal (even if your dataset is 0.1% human, that's still the whole human genome...)
Normalization (helpful if you have a few organisms with extremely high coverage that constitute the bulk of the data; this happens in some metagenomes)
Error-correction
Read merging (useful for many assemblers, but generally has a negative impact on Megahit. Still should reduce the kmer space though).
Duplicate removal, if the library is PCR-amplified or for certain platforms like NextSeq, HiSeq3000/4000, or NovaSeq.

All of these will reduce the data volume and kmer space somewhat. If they are not sufficient, you can also discard reads that won't assemble; for example, those with a kmer depth of 1 across the entire read. Dividing randomly is generally not a good idea, but there are some read-based binning tools that use features such as tetramers and depth that try to bin by organism prior to assembly. There are also some distributed assemblers, like Ray, Disco, and MetaHipMer that allow you to use memory across multiple nodes. Generating a kmer-depth histogram can help indicate what kind of preprocessing and assembly strategies might be useful.
Brian Bushnell is offline   Reply With Quote
Old 04-18-2017, 05:36 PM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,239
Default

What is the expected genome size/ploidy level? Having massive oversampling of data will not guarantee good assemblies.
GenoMax is offline   Reply With Quote
Old 04-19-2017, 04:01 PM   #4
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 2
Default

It's more like an environmental microbiome dataset collections. So no real expectations for genome size/ploidy level.

@Brian, I have found out that at this size, normalization (I use BBnorm) becomes so difficult that it would almost certainly exceed the allowed time amount for my university's cluster (7 days), because I could not finish the job even by down-sampling 10x of total reads. I suppose I could actually normalize each sample (because they were amplified individually), and pool them together and maybe try another round of normalization (in case any "duplications" happen inbetween samples. It seems to me that reads binning would basically achieve something very similar to normalization anyway (algorithmically I can't see it being more time and memory efficient)?

I also was not able to generate a kmer-depth graph when dealing with the multi-TB datasets directly, or perhaps you know of something much more efficient?

Thanks
confurious is offline   Reply With Quote
Reply

Tags
assembly, big data

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO