View Single Post
Old 04-24-2017, 11:41 AM   #6
Junior Member
Location: california

Join Date: Apr 2017
Posts: 9

Originally Posted by Brian Bushnell View Post
The difference between binning and normalization would be that binning seeks to divide the reads into different organisms prior to assembly, so they can be assembled independently, using less time and memory per job. Normalization simply attempts to reduce the coverage of high-depth organisms/sequences, but still keeps the dataset intact. With no high-depth component, normalization will basically do nothing (unless you configure it to throw away low-depth reads, which is BBNorm's default behavior), but binning should still do something.

Working with huge datasets is tough when you have compute time limitations. But, BBNorm should process data at ~20Mbp/s or so (with "passes=1 prefilter", on a 20-core machine), which would be around 1.7Tbp/day, so it should be possible to normalize or generate a kmer-depth histogram from a several-Tbp sample in 7 days...

But, another option is to assemble independently, deduplicate those assemblies together, then co-assemble only the reads that don't map to the deduplicated combined assembly. The results won't be as good as a full co-assembly, but it is more computationally tractable.
This is excellent advice. I have never tried reads binning but I am very familiar with tetra-nucleotide signature of genomes. Is such a thing possible even at reads level? If so, could you point to me the best software package to use please?

I have indeed tried bbnorm but its significantly slower for me and uses way more memory than I anticipated, I have however not used passes = 1 filter and that might be why. By the way, bbnorm says that memory should not be a hard cap for the program to run but I am unsure how much memory should I request, the best I could do is probably 1.2TB, could you please give me some advice on that?

I have attached the error message which I got after 4 days of running while I only specified 10 million reads (the whole is thing 100 times more). So it's way slower for me. Thanks so much!

java -ea -Xmx131841m -Xms131841m -cp /home/jiangch/software/bbmap/current/ jgi.KmerNormalize bits=32 in=mega.nonhuman.fastq interleaved=true threads=16 prefilter=t fixspikes=t target=50 out=faster.fastq prefiltersize=0.5 reads=10000000
Executing jgi.KmerNormalize [bits=32, in=mega.nonhuman.fastq, interleaved=true, threads=16, prefilter=t, fixspikes=t, target=50, out=faster.fastq, prefiltersize=0.5, reads=10000000]

BBNorm version 37.02
Set threads to 16

   ***********   Pass 1   **********   

threads:          	16
k:                	31
deterministic:    	true
toss error reads: 	false
passes:           	1
bits per cell:    	16
cells:            	24.16B
hashes:           	3
prefilter bits:   	2
prefilter cells:  	193.29B
prefilter hashes: 	2
base min quality: 	5
kmer min prob:    	0.5

target depth:     	200
min depth:        	3
max depth:        	250
min good kmers:   	15
depth percentile: 	64.8
ignore dupe kmers:	true
fix spikes:       	false

Made prefilter:   	hashes = 2   	 mem = 44.99 GB   	cells = 193.22B   	used = 47.819%
Made hash table:  	hashes = 3   	 mem = 44.96 GB   	cells = 24.14B   	used = 87.988%
Warning:  This table is very full, which may reduce accuracy.  Ideal load is under 60% used.
For better accuracy, use the 'prefilter' flag; run on a node with more memory; quality-trim or error-correct reads; or increase the values of the minprob flag to reduce spurious kmers.  In practice you should still get good normalization results even with loads over 90%, but the histogram and statistics will be off.

Estimated kmers of depth 1-3: 	51336579492
Estimated kmers of depth 4+ : 	11502226335
Estimated unique kmers:     	62838805828

Table creation time:		290885.545 seconds.
Writing interleaved.

Last edited by Brian Bushnell; 04-24-2017 at 11:59 AM.
confurious is offline   Reply With Quote