View Single Post
Old 07-06-2014, 08:20 PM   #5
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707


There are two passes because normalization can selectively enrich the dataset for reads with errors, if no precautions are taken, since reads with errors appear to have rare kmers. To combat this, I first normalize down to a higher-than-requested level; or specifically, reads that appear error-free are normalized to the high level (140 in this case), and reads that appear to contain errors are normalized to the low level (35 in this case) on the first pass.

So, after the first pass all reads will still have minimum depth 35 (so nothing was lost), but the new dataset will be selectively enriched for error-free reads. The second pass normalizes the remaining reads to the final target regardless of whether errors are suspected.

It's not really possible to do this in a single pass because if (for example) half your reads contain errors, and error-free reads are randomly discarded at the target rate but error-containing reads are discarded at a higher rate, you will ultimately achieve only half of the desired final coverage.

You can set "passes=1" when running normalization, and look at the kmer frequency histogram afterward with "histout=x.txt". The histogram from a run with 2-pass normalization will have far fewer error kmers, which is obvious from a quick visual examination of the graph.

Normalization is not useful for mapping in most circumstances. The only advantage it would convey is a decrease in runtime by reducing the size of the input, which is useful if you are using a huge reference (like nt) or slow algorithm (like BLAST), but I don't use it for that. Error-correction (done by the same program as normalization, with instead of may be useful before mapping, though - it will increase mapping rates, though always with a possibility that the ultimate output may be slightly altered. So, I would not error-correct data before mapping, either, except when using a mapping algorithm that is very intolerant of errors.

I designed BBNorm as a preprocessing step for assembly. Normalizing and/or error-correcting data with highly uneven coverage - single cell amplified data and metagenomes (and possibly transcriptomes, though I have not tried it) - can yield a much better assembly, particularly with assemblers that are designed for isolates with a fairly flat distribution, but even on assemblers designed for single-cell or metagenomic assembly. Also, normalization can allow assembly of datasets that would otherwise run out of memory or take too long. Depending on the circumstance, it often yields better assemblies with isolates too, but not always.

I'm sure there are other places where BBNorm can be useful, but I don't recommend it as a general-purpose preprocessing step prior to any analysis - just for cases where you can achieve better results be reducing data volume, or flattening the coverage, or reducing the error rate.

Oh... and as for the error reads/pairs/types, that shows a summary of the input data quality, and fractions of reads or pairs that appear to contain errors. The "types" indicate which heuristic classified the read as appearing to contain an error; that's really there for testing and I may remove it.
Brian Bushnell is offline   Reply With Quote