View Single Post
Old 01-23-2015, 12:49 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by sarvidsson View Post
How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.
The implementation is a bit different in a couple of respects. Normalization can preferentially retain reads with errors, since they have a low apparent coverage; as a result, normalized data - particularly from single-cells - will often have a much higher error rate than the original data, even if low-depth reads are discarded. BBNorm, by default, uses 2-pass normalization which allows it - if there is sufficient initial depth - to preferentially discard low-quality reads, and still hit the target depth with a very narrow peak. So, if you look at the post-normalization kmer frequency histogram, BBNorm's output will have substantially fewer error kmers and a substantially narrower peak. This can be confirmed by mapping; the error rate in the resulting data is much lower.

I'm working on publishing BBNorm, which will have comparative benchmarks versus other normalization tools, but in my informal testing it's way faster and yields better assemblies than the two other normalizers I have tested. The specific way that the decision is made on whether or not to discard a read has a huge impact on the end results, as does the way in which pairs are handled, and exactly how kmers are counted, and how a kmer's frequency is used to estimate a read's depth. BBNorm has various heuristics in place for each of these that substantially improved assemblies compared to leaving the heuristic disabled; my earlier description of discarding a read or not based on the median frequency of the read's kmers is actually a gross oversimplification. Also, using error-correction in conjunction with normalization leads to different results, as it can make it easier to correctly determine the depth of a read.

I guess I would say the the theory is similar, but the implementation is probably very different than other normalizers.
Brian Bushnell is offline   Reply With Quote