SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Minimum amount of data needed for reliable results? bloosnail Metagenomics 5 12-12-2016 04:30 PM
Amount of data for metagenomic analysis cheezemeister Metagenomics 3 11-24-2014 02:31 PM
MeDIP-sequencing data amount required WaitingNail Epigenetics 2 06-24-2013 11:19 PM
idba_ud coassemble Miseq 2x250 and HiSeq 2x100 reads? ewilbanks Bioinformatics 0 04-30-2013 08:59 AM
GALAXY: Huge amount of data? sklages Bioinformatics 1 04-12-2011 06:03 AM

Reply
 
Thread Tools
Old 04-18-2017, 03:13 PM   #1
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default coassemble massive amount of data (>3TB)

Hello, I am attempting to assemble a massive amount of Illumina 2x150bp pair-ended reads data (>3TB). I am considering using megahit as it is the least resource-intensive assemblers I have used and still gives reasonably good results.

What are the typical strategies if one wants to assemble data size that's beyond typical limitations? I am thinking of dividing them into smaller pools but of course it's not ideal. Thanks
confurious is offline   Reply With Quote
Old 04-18-2017, 05:20 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

There are a few options for this. First off, preprocessing can reduce the number of kmers present, which typically reduces memory requirements:

Adapter-trimming
Quality-trimming (at least to get rid of those Q2 trailing bases)
Contaminant removal (even if your dataset is 0.1% human, that's still the whole human genome...)
Normalization (helpful if you have a few organisms with extremely high coverage that constitute the bulk of the data; this happens in some metagenomes)
Error-correction
Read merging (useful for many assemblers, but generally has a negative impact on Megahit. Still should reduce the kmer space though).
Duplicate removal, if the library is PCR-amplified or for certain platforms like NextSeq, HiSeq3000/4000, or NovaSeq.

All of these will reduce the data volume and kmer space somewhat. If they are not sufficient, you can also discard reads that won't assemble; for example, those with a kmer depth of 1 across the entire read. Dividing randomly is generally not a good idea, but there are some read-based binning tools that use features such as tetramers and depth that try to bin by organism prior to assembly. There are also some distributed assemblers, like Ray, Disco, and MetaHipMer that allow you to use memory across multiple nodes. Generating a kmer-depth histogram can help indicate what kind of preprocessing and assembly strategies might be useful.
Brian Bushnell is offline   Reply With Quote
Old 04-18-2017, 06:36 PM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,548
Default

What is the expected genome size/ploidy level? Having massive oversampling of data will not guarantee good assemblies.
GenoMax is offline   Reply With Quote
Old 04-19-2017, 05:01 PM   #4
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

It's more like an environmental microbiome dataset collections. So no real expectations for genome size/ploidy level.

@Brian, I have found out that at this size, normalization (I use BBnorm) becomes so difficult that it would almost certainly exceed the allowed time amount for my university's cluster (7 days), because I could not finish the job even by down-sampling 10x of total reads. I suppose I could actually normalize each sample (because they were amplified individually), and pool them together and maybe try another round of normalization (in case any "duplications" happen inbetween samples. It seems to me that reads binning would basically achieve something very similar to normalization anyway (algorithmically I can't see it being more time and memory efficient)?

I also was not able to generate a kmer-depth graph when dealing with the multi-TB datasets directly, or perhaps you know of something much more efficient?

Thanks
confurious is offline   Reply With Quote
Old 04-24-2017, 11:37 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.

Last edited by Brian Bushnell; 04-24-2017 at 11:39 AM.
Brian Bushnell is offline   Reply With Quote
Old 04-24-2017, 12:41 PM   #6
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

Quote:
Originally Posted by Brian Bushnell View Post
The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.
This is excellent advice. I have never tried reads binning but I am very familiar with tetra-nucleotide signature of genomes. Is such a thing possible even at reads level? If so, could you point to me the best software package to use please?

I have indeed tried bbnorm but its significantly slower for me and uses way more memory than I anticipated, I have however not used passes = 1 filter and that might be why. By the way, bbnorm says that memory should not be a hard cap for the program to run but I am unsure how much memory should I request, the best I could do is probably 1.2TB, could you please give me some advice on that?

I have attached the error message which I got after 4 days of running while I only specified 10 million reads (the whole is thing 100 times more). So it's way slower for me. Thanks so much!

Code:
java -ea -Xmx131841m -Xms131841m -cp /home/jiangch/software/bbmap/current/ jgi.KmerNormalize bits=32 in=mega.nonhuman.fastq interleaved=true threads=16 prefilter=t fixspikes=t target=50 out=faster.fastq prefiltersize=0.5 reads=10000000
Executing jgi.KmerNormalize [bits=32, in=mega.nonhuman.fastq, interleaved=true, threads=16, prefilter=t, fixspikes=t, target=50, out=faster.fastq, prefiltersize=0.5, reads=10000000]

BBNorm version 37.02
Set INTERLEAVED to true
Set threads to 16

   ***********   Pass 1   **********   


Settings:
threads:          	16
k:                	31
deterministic:    	true
toss error reads: 	false
passes:           	1
bits per cell:    	16
cells:            	24.16B
hashes:           	3
prefilter bits:   	2
prefilter cells:  	193.29B
prefilter hashes: 	2
base min quality: 	5
kmer min prob:    	0.5

target depth:     	200
min depth:        	3
max depth:        	250
min good kmers:   	15
depth percentile: 	64.8
ignore dupe kmers:	true
fix spikes:       	false

Made prefilter:   	hashes = 2   	 mem = 44.99 GB   	cells = 193.22B   	used = 47.819%
Made hash table:  	hashes = 3   	 mem = 44.96 GB   	cells = 24.14B   	used = 87.988%
Warning:  This table is very full, which may reduce accuracy.  Ideal load is under 60% used.
For better accuracy, use the 'prefilter' flag; run on a node with more memory; quality-trim or error-correct reads; or increase the values of the minprob flag to reduce spurious kmers.  In practice you should still get good normalization results even with loads over 90%, but the histogram and statistics will be off.

Estimated kmers of depth 1-3: 	51336579492
Estimated kmers of depth 4+ : 	11502226335
Estimated unique kmers:     	62838805828

Table creation time:		290885.545 seconds.
Writing interleaved.

Last edited by Brian Bushnell; 04-24-2017 at 12:59 PM.
confurious is offline   Reply With Quote
Old 04-24-2017, 01:13 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

The flags/documentation for BBNorm might be a little misleading in this case... BBNorm acts in 3 phases per pass:

1) Prefilter construction (which finished). This uses all reads.
2) Main hashtable construction (which finished). This uses all reads.
3) Normalization (which started but did not finish). This uses the number of reads you specify with the "reads" flag.

So, the reason it took so long was because it was using all reads in phases 1 and 2 (otherwise, the depths would be incorrect).

To reduce the number of reads from the first two phases, you would need to specify "tablereads"; e.g. "reads=10000000 tablereads=10000000". Or most simply just pull the first 10m reads first:

Code:
reformat.sh in=reads.fq out=10m.fq reads=10m
bbnorm.sh prefilter=t passes=1 target=50 in=10m.fq out=normalized.fq
For binning, you might try MetaBAT. Binning tools are generally better for assemblies, so that tetramer frequencies become more accurate. But, they can work with reads as well, particularly when you have multiple samples of the same metagenome (preferably from slightly different locations / conditions / times), as they can use read depth covariance to assist in binning. But I don't know how well it will work or how long it will take. Also, the longer the reads are, the better; so if the reads mostly overlap, it's prudent to merge them first.
Brian Bushnell is offline   Reply With Quote
Old 04-24-2017, 02:01 PM   #8
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

I see, that makes a lot of senses why it was also slow when I specified only 1 million reads. I just tried to add tablereads= and it works so fast now!!! I noticed you are the author of BBtools and apparently your toolkit has covered the purposes of multiple scripts I have written for myself but always better implemented and covers more stuff. RIP me! I guess it's still a great way for me to learn bioinformatics though.

Why don't you just go for a quick application note on bioinformatics to publish your toolkit? Is that publications no longer matters to you any more
confurious is offline   Reply With Quote
Old 04-25-2017, 01:06 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

It's not like I don't want publications, it's just that it takes time away from development/support...

And I agree, there's no better way to learn the intricacies of a problem than by attempting to solve it yourself, whether or not the solution is optimal!
Brian Bushnell is offline   Reply With Quote
Old 04-25-2017, 01:17 PM   #10
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

The only reason I was suggesting a publication is because it makes citation easier for me

I am trying to use dedupe.sh to work on the aggregate of individually assembled contigs as an alternative strategy to co-assmebly. Since it already is over-lap aware, I am wondering if it is capable to simply "merge" those over-lapping contigs, instead of reporting them in different clusters (kinda like minimus 2)? If not, would that be something you consider to add to the module.
confurious is offline   Reply With Quote
Old 04-25-2017, 01:27 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I agree, I have been planning to add merging to Dedupe for a long time! Hopefully I'll have a chance sometime this year...
Brian Bushnell is offline   Reply With Quote
Old 04-29-2017, 01:53 AM   #12
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

All you need to do is package a release and put it on zenodo.org. That will give you a citeable DOI and "eternal" data record.

It's up to the journals (and reviewers) to accept such a citation in submitted research papers. For what it's worth, I haven't had a problem with this for F1000 research (in fact, they encourage it).
gringer is offline   Reply With Quote
Old 05-08-2017, 04:50 PM   #13
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

Quote:
Originally Posted by Brian Bushnell View Post
The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.
For the alternative strategy which is to assemble them independently and deduplicate them using dedupe.sh . I am wondering should I then map normalized reads to the deduplicated contigs or original reads (10x bigger)? It kind of makes sense to me to just go with the normalized reads since I wanted to co-assemble them afterwards anyway.

Thanks.
confurious is offline   Reply With Quote
Old 05-08-2017, 05:00 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

If you plan to normalize the reads after mapping but before assembly anyway, then yes, it's logical to just map with the normalized reads... but only if the reads were normalized together (which they probably weren't). If they were normalized independently, you need to either combine them all, then normalize, then map, or map, then combine them all, then normalize. Normalizing independently will lose the low-coverage part of each library (depending on your settings) that could still be assembled once the libraries are combined.
Brian Bushnell is offline   Reply With Quote
Old 05-08-2017, 05:05 PM   #15
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

Quote:
Originally Posted by Brian Bushnell View Post
If you plan to normalize the reads after mapping but before assembly anyway, then yes, it's logical to just map with the normalized reads... but only if the reads were normalized together (which they probably weren't). If they were normalized independently, you need to either combine them all, then normalize, then map, or map, then combine them all, then normalize. Normalizing independently will lose the low-coverage part of each library (depending on your settings) that could still be assembled once the libraries are combined.
Thanks for the reply. The reads were indeed normalized together after following your suggestion. Took about 5 days to go through 7 TB of data which is faster than i expected. I still want to approximate "co-assemble" strategy as best as I can even if I can't assemble the normalized reads directly.
confurious is offline   Reply With Quote
Old 05-08-2017, 06:52 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

In that case, mapping the co-normalized reads and assembling the unmapped portion should be fine. I recommend with this strategy that if you have a pair in which only one read maps, to use both of for assembly, since presumably at least one was not assembled, and the other may have mapped to something that it did not actually come from. Of course, if that generates too much data to assemble, you can also just discard half-mapped pairs.
Brian Bushnell is offline   Reply With Quote
Old 05-11-2017, 11:01 AM   #17
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

Hi Brian, I am looking for a tool to merge multiple assemblies together but have not had any luck yet. I am wondering if you are aware of any tools like such? Minimus2 would not be ideal because it's merges only two sets, although I am not sure if I could simply divide my total concatenation file into two files to "fool" it. Thanks!
confurious is offline   Reply With Quote
Old 05-11-2017, 11:07 AM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Sorry, I'm not aware of such a tool. We used to use Dedupe followed by Minimus2 to merge multiple assemblies. I don't know the details of how Minimus2 was used, but if it is restricted to two files at a time, you could proceed iteratively:

assembly1 + assembly2 -> combined12
combined12 + assembly3 -> combined123
combined123 + assembly4 -> combined1234

...etc
Brian Bushnell is offline   Reply With Quote
Old 05-12-2017, 09:13 PM   #19
confurious
Junior Member
 
Location: california

Join Date: Apr 2017
Posts: 9
Default

Hi Brian, sorry for asking for your help again. I am trying to use your bbcountunique.sh but I would like to change the k mer size to a number that is much bigger than 31, say something like 127 because I think 31 is not enough in my case.

I examined the CalcUniqueness.java file and see the assert statement that makes it smaller than 31. I tried to change it and make it into a class via javac but for some reasons it reports a lot of errors within the java file that is probably due to windows to linux system conversion. However, I am not a guru of JAVA so I am stuck at this point. I could message you my email address if it's easier for you to quickly change that (assuming that's the only thing needed) and send me the updated CalcUniqueness.class file.

Thanks so much!
confurious is offline   Reply With Quote
Old 05-12-2017, 10:13 PM   #20
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Sorry, but the reason for the assertion is because the data structures fundamentally won't handle kmers longer than 31, since they are being stored in 64-bit integers. It's possible to modify the code to allow K>31 because I now have some classes that support unlimited-length kmers, but that would take quite a lot of work. You can always disable assertions by adding the flag -da when the program runs - "bbcountunique.sh -da <other arguments>". But it won't give correct results for K>31.

Generally, I don't see a useful purpose for K>31 with that program - the longer K is, the more errors, which inflate the uniqueness; and K=31 should be sufficient for determining whether you have seen a read before. You won't saturate the K=31 kmer space for read uniqueness plots with Illumina HiSeq machines; that would take 600 million terabases at 2x150bp.

Last edited by Brian Bushnell; 05-12-2017 at 10:45 PM.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
assembly, big data

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO