SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries Brian Bushnell Bioinformatics 55 11-14-2017 04:48 AM
LSC - a fast PacBio long read error correction tool. LSC Bioinformatics 9 08-21-2015 07:06 AM
LSC - a fast PacBio long read error correction tool. LSC Pacific Biosciences 55 02-14-2014 06:34 AM
Reptile error correction tool: fastq not readable stepa_t Bioinformatics 2 07-25-2013 07:49 PM
BFAST and read error correction (with SAET or similar tool) javijevi Bioinformatics 4 01-27-2010 01:46 PM

Reply
 
Thread Tools
Old 01-22-2015, 10:52 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default Introducing BBNorm, a read normalization and error-correction tool

I'd like to introduce BBNorm, a member of the BBTools package.

BBNorm is a kmer-based normalization tool for NGS reads, which may also be used for error-correction and generating kmer-frequency plots. It is extremely fast and memory-efficient due to the use of atomic counters and probabilistic data structures.

First, what is normalization? Many assemblers perform poorly in the presence of too much data, and data with irregular coverage, such as MDA-amplified single cells or metagenomes. And even if an assembler performs well with these datasets (Spades, for example, does a very good job with single cells, though still benefits from normalization in my tests), more data will increase the runtime and memory usage, potentially to the point that the data cannot be assembled.

Subsampling randomly discards some percentage of the reads, to reduce the average coverage, and is computationally cheap (you can do that quickly with reformat). However, if you have one area with 10000x coverage and another area with 10x coverage, subsampling to 1% will reduce the 10000x area to 100x - a good level for some assemblers - but the 10x area to 0.1x, which cannot be assembled. So with irregular coverage, it is not ideal!

Normalization discards reads with a probability based on the local coverage. So, for example, if you normalize to a target 40x coverage, reads in regions with 10x coverage will all be retained; in an 80x region they will be discarded with a 0.5 probability; and in a 10000x area they will be discarded with a 0.004 probability. As a result, after normalization, areas with coverage above the target will be reduced to the target, and areas below the target will be left untouched. This generally makes the data much easier to assemble and typically increases continuity (L50) by a substantial amount.

To normalize to 40x coverage with BBNorm, and discard reads with an apparent depth under 2x (which typically indicates the reads have errors):

bbnorm.sh in=reads.fq out=normalized.fq target=40 mindepth=2

Error-correction is also useful in many cases when you have sufficient depth and wish to avoid false-positive variant calls, or achieve a higher mapping rate, or want to merge paired reads via overlap, or assemble high error-rate data with an assembler that is not very tolerant of errors, or reduce the memory usage of a DeBruijn assembler. Like normalization, error-correction is not universally a good idea - JGI does not normalize and error-correct all data prior to use, for example - but it is highly beneficial in many situations. Also, BBNorm only corrects substitution errors, not indels, since that is the error mode that occurs in Illumina data. In other words, it will NOT error-correct PacBio or 454 data, which feature indel errors. BBNorm is currently being used for error-correction by the OLC assembler Omega.

To error-correct reads with BBNorm:

ecc.sh in=reads.fq out=corrected.fq

To error-correct and normalize at the same time, just add the flag "ecc" when running BBNorm.

Lastly, BBNorm can be used for producing kmer frequency histograms, and binning reads based on coverage depth. The histograms are useful to determine things like the ploidy of an organism, the genome size and coverage, the heterozygousity rate, and the presence of contaminants (which typically have drastically different coverage than the genome of interest).

To generate a kmer frequency histogram:

khist.sh in=reads.fq hist=histogram.txt

You can also use the "hist" flag during normalization or error-correction, for the histogram of input reads, for free; "histout" will generate the histogram of output reads, but cost an additional pass.

So, what is the difference between bbnorm.sh, ecc.sh, and khist.sh? They all call the same program, just with different default parameters (you can see the exact parameters by looking at the bottom of the shellscripts). If you specify all parameters manually, they are equivalent.

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers. For error-correction, situations where there are adjacent kmers in a read with drastically different frequencies - for example, differing by a factor of 180 - are detected; the offending base is altered to a different base, if doing so will restore the kmer to a similar frequency as its adjacent kmers.

BBNorm is memory-efficient because it does not explicitly store kmers - everything is in a probabilistic data structure called a count-min sketch. As a result, BBNorm will never run out of memory, slow down, or use disk, no matter how much data you have or how big the genome is. Rather, the accuracy will decline as the table's loading increases - but because kmers are not explicitly stored, it can store several times more than an explicit data structure (such as Google Sparse Hash). And for normalization, the reduction in accuracy at extremely high loading does not matter, because the median is used - so even if multiple kmers within a read have an incorrectly high count, they will not even be considered, and thus the results will not be affected at all. As a result - in practice, you should use all available memory even for a tiny genome with a small number of reads; but even for a huge genome with very high coverage, BBNorm will still work, and produce good results quickly on a computer with limited memory.

Speedwise, BBNorm is multithreaded in all stages, using atomic counters which do not require locking - this allows it to scale efficiently with processor core counts.

BBTools has another program functionally related to BBNorm, "kmercountexact.sh". It does NOT use probabilistic data structures, and uses locking rather than atomic counters, and as a result may not scale as well, and will run out of memory on large datasets. However, it is still extremely fast and memory-efficient - using ~15 bytes per kmer (with an optional count-min-sketch prefilter to remove low-count error kmers). It cannot normalize or error-correct, but it can generate the exact kmer count of a dataset as well as the exact kmer frequency histogram (and do rudimentary peak calling for genome size estimation). In practice, when I am interested in kmer frequency histograms, I use KmerCountExact for isolates, and BBNorm for metagenomes.

BBNorm (and all BBTools) can be downloaded here:

http://sourceforge.net/projects/bbmap/

Edit: Note! Most programs in the BBMap package run in Java 6 or higher, but BBNorm requires Java 7 or higher.

Last edited by Brian Bushnell; 01-29-2015 at 09:47 AM.
Brian Bushnell is offline   Reply With Quote
Old 01-23-2015, 01:42 AM   #2
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.
sarvidsson is offline   Reply With Quote
Old 01-23-2015, 03:47 AM   #3
titusbrown
Junior Member
 
Location: Midwest

Join Date: Aug 2013
Posts: 8
Default

There are a number of similar tools now --

Digital normalization, http://ivory.idyll.org/blog/diginorm-paper-posted.html

Trinity's in silico read normalization, based on Jellyfish and custom Perl scripts: http://trinityrnaseq.sourceforge.net...alization.html

NeatFreq, written in Java (I think): http://www.biomedcentral.com/1471-2105/15/357/abstract

Mira also contains an implementation of a similar approach.

I'd love to see a comparison of the algorithms in use! I know what Trinity's approach does, but I haven't looked into NeatFreq, BBNorm, or Mira.

--titus
titusbrown is offline   Reply With Quote
Old 01-23-2015, 12:49 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by sarvidsson View Post
How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.
The implementation is a bit different in a couple of respects. Normalization can preferentially retain reads with errors, since they have a low apparent coverage; as a result, normalized data - particularly from single-cells - will often have a much higher error rate than the original data, even if low-depth reads are discarded. BBNorm, by default, uses 2-pass normalization which allows it - if there is sufficient initial depth - to preferentially discard low-quality reads, and still hit the target depth with a very narrow peak. So, if you look at the post-normalization kmer frequency histogram, BBNorm's output will have substantially fewer error kmers and a substantially narrower peak. This can be confirmed by mapping; the error rate in the resulting data is much lower.

I'm working on publishing BBNorm, which will have comparative benchmarks versus other normalization tools, but in my informal testing it's way faster and yields better assemblies than the two other normalizers I have tested. The specific way that the decision is made on whether or not to discard a read has a huge impact on the end results, as does the way in which pairs are handled, and exactly how kmers are counted, and how a kmer's frequency is used to estimate a read's depth. BBNorm has various heuristics in place for each of these that substantially improved assemblies compared to leaving the heuristic disabled; my earlier description of discarding a read or not based on the median frequency of the read's kmers is actually a gross oversimplification. Also, using error-correction in conjunction with normalization leads to different results, as it can make it easier to correctly determine the depth of a read.

I guess I would say the the theory is similar, but the implementation is probably very different than other normalizers.
Brian Bushnell is offline   Reply With Quote
Old 01-29-2015, 11:04 AM   #5
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

Hi Brian,

I'm trying to do some normalization but I want to set my target coverage to 10X rather than 40X. Is there any way to change that in BBNorm? I tried target=10, but it still says 40X on the run description.
jazz710 is offline   Reply With Quote
Old 01-29-2015, 11:30 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

By default, BBNorm will run 2 passes. The first pass will normalize to some depth higher than the ultimate desired depth, and the second pass will normalize to the target depth. This allows, in the first pass, preferential discarding of reads that are low quality. So the result from the second pass should still be a target of 10x.

You can instead set "passes=1" which will aim for the target on the first pass and not do a second pass. This is slightly faster but will typically yield data with more errors. Neither is universally better, though.

If you are going to target a depth of 10x, it's important to also reduce "mindepth" - by default it is 6, which is appropriate for 40x but not for 10x. Probably 2 would be better. Everything with apparent depth below that gets discarded.
Brian Bushnell is offline   Reply With Quote
Old 02-04-2015, 01:21 PM   #7
damiankao
Member
 
Location: UK

Join Date: Jan 2010
Posts: 49
Default

Hi Brian,

This tool looks great. Is there a way to accept multiple fastq.gz files for inputs? I want to run all my reads (multiple fastq.gz) through bbnorm.
damiankao is offline   Reply With Quote
Old 02-04-2015, 01:30 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

At this point, BBNorm does not accept multiple input files (other than dual files for paired reads). You would have to concatenate them first:

cat a.fastq.gz b.fastq.gz > all.fa.gz

...which works fine for gzipped files. Most of my programs can accept piped input from stdin, but not BBNorm since it needs to read the files twice.
Brian Bushnell is offline   Reply With Quote
Old 02-09-2015, 09:08 PM   #9
sathiyamurthi
Junior Member
 
Location: south korea

Join Date: Jun 2011
Posts: 3
Default

Dear Brain Bushnell

BBNORM can used to normalize MATE pair sequences by Nextra kit such as (2k - 20K) to reduce the input size?
sathiyamurthi is offline   Reply With Quote
Old 02-09-2015, 09:28 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Yes, it can. BBNorm will (by default, it can be changed) discard pairs based on the depth of the lower mate, so if read 1 has high coverage and read 2 has low coverage, the pair will not be discarded. If both are high depth, they will be discarded.
Brian Bushnell is offline   Reply With Quote
Old 02-09-2015, 09:48 PM   #11
sathiyamurthi
Junior Member
 
Location: south korea

Join Date: Jun 2011
Posts: 3
Default

Dear Brian Bushnell

Thank you for your valuable response and tool, your tools reduced my 80% of time

I have few more doubts, Please write your suggestion

If the libraries are from the different platform such as (HiSeq, Miseq and NextSeq) or different insert size such as (2k 4k 8k ....)

which is the best method to normalize?

1) Pool together and perform normalization or Sequencing Platform dependent normalization?

Another issue if i perform pre-processing the read length will vary according to sequencing artifacts.

2) So, before/after pre-processing is better for normalization?

3) If i want to use only 40X from 120X from the given genome (estimated size : 1.2GB) the normalized data should be <=(40*1.2GB) or the BNORM will give more than that?

3) Can i used for RNA-Seq libraries before perform Denovo assembly? will it affect the isoform detection or chance to miss transcripts ?

Thank you
sathiyamurthi is offline   Reply With Quote
Old 02-19-2015, 11:30 AM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Sorry, I somehow missed your post!

1) This is kind of tricky. Typically, though, I would recommend normalizing data independently if it is different (such as different insert size) since it has a different use, and you don't want it all mixed together anyway. If it is the same type - for example, 2x150bp reads with short inserts - then I would normalize it all together regardless of whether it came from a different platform or library, because it will all be used the same way.

2) I recommend pre-processing (adapter trimming, contaminant removal, quality-trimming or filtering) prior to normalization, because those processes all remove spurious kmers that make it harder to determine read depth, and thus improve the normalization results.

3) If you target 40x coverage for a 1.2Gbp genome, BBNorm should output approximately 20*1.2Gbp of data. Normally it will go a little bit over to try to ensure everywhere has at least 40x.

4) Normalizing RNA-seq data can certainly be done prior to assembly. But if you have 2 isoforms of a gene - one that uses exons 1, 2, and 3, and one that only uses exons 1 and 3, and one of them is expressed 100x more highly than the other, then after normalization, the less-expressed isoform may not get assembled, only the more abundant one. So there are definite disadvantages. But, it's worth trying if you get a bad assembly from the raw data.
Brian Bushnell is offline   Reply With Quote
Old 02-22-2015, 11:43 PM   #13
sathiyamurthi
Junior Member
 
Location: south korea

Join Date: Jun 2011
Posts: 3
Default

Dear Brian Bushnell

Thank You !!!

Can you please refer the article, which explain BBNORM methodology in detail. For complete understanding and to code citation
sathiyamurthi is offline   Reply With Quote
Old 02-23-2015, 12:13 AM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I am currently collaborating with another person on writing BBNorm's paper and we plan to submit it in March. I will post here once it gets accepted.
Brian Bushnell is offline   Reply With Quote
Old 03-10-2015, 01:27 PM   #15
dcard
Junior Member
 
Location: Arlington, TX

Join Date: Mar 2013
Posts: 5
Default Optimal depth for read error correcting

Hi Brian and others,

I am wondering what depth you need and what depth is optimal (if the two differ) for proper read error correcting using BBMap or any other error correcting program. The Quake website mentioned >15x coverage but a quick round of Googling hasn't given me much more than that.

The reason I ask is because I have a couple lanes of MiSeq data (600 cycle PE sequencing), which individually total to about 3x coverage of my genome each. Therefore, a kmer based error correction wouldn't work well, even if I were to concatenate the two together. We do have an additional HiSeq lane (100bp PE) and a few GAII lanes (so 50-60x coverage total), so we have the option of concatenating all of the datasets together (though one GAII lane isn't paired-end). However, then we would have the separate the individual lanes back out, since we next plan to merge the MiSeq reads to create longer, 454-like reads.

Therefore, my second question is about what workflow would be best to accomplish this task? Are there some settings in ecc.sh or the like that would allow decent error correction with low coverage? Or alternatively, is there an easy way of separating data from different lanes if we were to concatenate a bunch together to give the coverage necessary to confidently correct? Thanks in advance for the help.
dcard is offline   Reply With Quote
Old 03-10-2015, 01:48 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

At default settings, BBNorm will correct using kmers that occur a minimum of 22 times (echighthresh flag). Whether 50x is sufficient to do a good job depends on things like the ploidy of the organism, how heterozygous it is, and the uniformity of your coverage. 50x of fairly uniform coverage is plenty for a haploid organism, or a diploid organism like human with a low het rate, but not for a wild tetraploid plant. You can always reduce the setting, of course, but I would not put it below ~16x typically. You can't get good error-correction with 3x depth nor matter what you do. Bear in mind that the longer the kmer is compared to read length, the lower the kmer depth will be compared to read depth.

To deal with multiple different data sources, you can run BBNorm with the "extra" flag to use additional data to build kmer frequency tables but not as output, like this:

ecc.sh in=miseq.fq out=miseq_ecc.fq extra=gaII_1.fq,gaII_2.fq,gaII_3.fq

That would take extra processing time, since all the data would have to be reprocessed for every output file you generate. Alternately, you can do this:

bbrename.sh miseq.fq addprefix prefix=miseq
bbrename.sh gaII_1.fq addprefix prefix=gaII_1

...etc

Then cat all the files together, and error-correct them:
ecc.sh in=combined.fq out=ecc.fq ordered int=f

Then demultiplex:
demuxbyname.sh in=ecc.fq out=demuxed_%.fq names=miseq,gaII_1 int=f
(note the % symbol; it will be replaced by a name)

That will keep all the read order the same. So, if all the reads are initially either single-ended or interleaved (i.e. one file per library) pairs will be kept together, and you can de-interleave them afterward if you want. You can convert between 2-file and interleaved with reformat.sh.
Brian Bushnell is offline   Reply With Quote
Old 03-10-2015, 07:18 PM   #17
inkang
Junior Member
 
Location: South Korea

Join Date: Mar 2015
Posts: 1
Default

Dear Brian,

Is it possible to use BBNorm to normalize filtered subreads from PacBio sequencing?

BBNorm seems to work well for my Illumina MiSeq data.
But, when I tried it for PacBio filtered subreads (fastq format), the resulting output file was empty.

Actually, I'm working on PacBio sequencing data from MDA-amplified single cell genomes. When I assembled the data using HGAP (SMRT Analysis), the results were not good (too many rather short contigs). Some of my colleagues told me that uneven coverage might be the cause of bad assembly. So, I've been trying to normalize raw data.

Thanks.
inkang is offline   Reply With Quote
Old 03-10-2015, 07:45 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.

PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work.

PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads.
Brian Bushnell is offline   Reply With Quote
Old 09-05-2015, 12:48 AM   #19
suzumar
Junior Member
 
Location: France

Join Date: Sep 2012
Posts: 2
Default Coverage Binning

Hi Brian
I am trying to assemble a genome from single cell amplification using spades and subsequently using tetramer + coverage analysis (i.e. CONCOCT) to remove "contaminating reads" which seem to be created at the sequencing. If I understood correctly bbnorm will keep enough information to allow coverage information to be used subsequently to bin contigs. Is that the case?

I also would like to know your opinion on how bbnorm compares to the strategy described in http://www.sciencemag.org/cgi/conten.../6047/1296/DC1

I quote:

"High-depth k-mers,presumably derived from MDA amplification bias, cause problems in the assembly, especially if the k-mer depth varies in orders of magnitude for different regions of the genome. We removed reads representing high-abundance k-mers (>64x k-mer depth) and
trimmed reads that contain unique k-mers. The results of the k-mer based coverage normalization are shown in table S8."
suzumar is offline   Reply With Quote
Old 09-05-2015, 06:17 AM   #20
titusbrown
Junior Member
 
Location: Midwest

Join Date: Aug 2013
Posts: 8
Default

Hi suzumar, if you apply any kind of coverage normalization (whether with khmer or bbnorm) prior to assembly, you won't be able to use those reads to calculate the coverage of the contigs. However, you can go back and use the unnormalized reads to calculate the contig coverage.

Re the science paper, I don't know what they actually did; if you can find the software and/or detailed instructions in their methods, I'd be happy to comment.
titusbrown is offline   Reply With Quote
Reply

Tags
assembly, bbnorm, bbtools, error correction, kmer frequency histogram, kmer genome size, kmercountexact, kmerfreq, metagenome assembly, normalization, quality control, single cell, soap, spades, subsample, velvet

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO