![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries | Brian Bushnell | Bioinformatics | 64 | 03-28-2020 04:54 AM |
LSC - a fast PacBio long read error correction tool. | LSC | Bioinformatics | 9 | 08-21-2015 07:06 AM |
LSC - a fast PacBio long read error correction tool. | LSC | Pacific Biosciences | 55 | 02-14-2014 06:34 AM |
Reptile error correction tool: fastq not readable | stepa_t | Bioinformatics | 2 | 07-25-2013 07:49 PM |
BFAST and read error correction (with SAET or similar tool) | javijevi | Bioinformatics | 4 | 01-27-2010 01:46 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I'd like to introduce BBNorm, a member of the BBTools package.
BBNorm is a kmer-based normalization tool for NGS reads, which may also be used for error-correction and generating kmer-frequency plots. It is extremely fast and memory-efficient due to the use of atomic counters and probabilistic data structures. First, what is normalization? Many assemblers perform poorly in the presence of too much data, and data with irregular coverage, such as MDA-amplified single cells or metagenomes. And even if an assembler performs well with these datasets (Spades, for example, does a very good job with single cells, though still benefits from normalization in my tests), more data will increase the runtime and memory usage, potentially to the point that the data cannot be assembled. Subsampling randomly discards some percentage of the reads, to reduce the average coverage, and is computationally cheap (you can do that quickly with reformat). However, if you have one area with 10000x coverage and another area with 10x coverage, subsampling to 1% will reduce the 10000x area to 100x - a good level for some assemblers - but the 10x area to 0.1x, which cannot be assembled. So with irregular coverage, it is not ideal! Normalization discards reads with a probability based on the local coverage. So, for example, if you normalize to a target 40x coverage, reads in regions with 10x coverage will all be retained; in an 80x region they will be discarded with a 0.5 probability; and in a 10000x area they will be discarded with a 0.004 probability. As a result, after normalization, areas with coverage above the target will be reduced to the target, and areas below the target will be left untouched. This generally makes the data much easier to assemble and typically increases continuity (L50) by a substantial amount. To normalize to 40x coverage with BBNorm, and discard reads with an apparent depth under 2x (which typically indicates the reads have errors): bbnorm.sh in=reads.fq out=normalized.fq target=40 mindepth=2 Error-correction is also useful in many cases when you have sufficient depth and wish to avoid false-positive variant calls, or achieve a higher mapping rate, or want to merge paired reads via overlap, or assemble high error-rate data with an assembler that is not very tolerant of errors, or reduce the memory usage of a DeBruijn assembler. Like normalization, error-correction is not universally a good idea - JGI does not normalize and error-correct all data prior to use, for example - but it is highly beneficial in many situations. Also, BBNorm only corrects substitution errors, not indels, since that is the error mode that occurs in Illumina data. In other words, it will NOT error-correct PacBio or 454 data, which feature indel errors. BBNorm is currently being used for error-correction by the OLC assembler Omega. To error-correct reads with BBNorm: ecc.sh in=reads.fq out=corrected.fq To error-correct and normalize at the same time, just add the flag "ecc" when running BBNorm. Lastly, BBNorm can be used for producing kmer frequency histograms, and binning reads based on coverage depth. The histograms are useful to determine things like the ploidy of an organism, the genome size and coverage, the heterozygousity rate, and the presence of contaminants (which typically have drastically different coverage than the genome of interest). To generate a kmer frequency histogram: khist.sh in=reads.fq hist=histogram.txt You can also use the "hist" flag during normalization or error-correction, for the histogram of input reads, for free; "histout" will generate the histogram of output reads, but cost an additional pass. So, what is the difference between bbnorm.sh, ecc.sh, and khist.sh? They all call the same program, just with different default parameters (you can see the exact parameters by looking at the bottom of the shellscripts). If you specify all parameters manually, they are equivalent. How does BBNorm work, and why is it better than other tools? BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers. For error-correction, situations where there are adjacent kmers in a read with drastically different frequencies - for example, differing by a factor of 180 - are detected; the offending base is altered to a different base, if doing so will restore the kmer to a similar frequency as its adjacent kmers. BBNorm is memory-efficient because it does not explicitly store kmers - everything is in a probabilistic data structure called a count-min sketch. As a result, BBNorm will never run out of memory, slow down, or use disk, no matter how much data you have or how big the genome is. Rather, the accuracy will decline as the table's loading increases - but because kmers are not explicitly stored, it can store several times more than an explicit data structure (such as Google Sparse Hash). And for normalization, the reduction in accuracy at extremely high loading does not matter, because the median is used - so even if multiple kmers within a read have an incorrectly high count, they will not even be considered, and thus the results will not be affected at all. As a result - in practice, you should use all available memory even for a tiny genome with a small number of reads; but even for a huge genome with very high coverage, BBNorm will still work, and produce good results quickly on a computer with limited memory. Speedwise, BBNorm is multithreaded in all stages, using atomic counters which do not require locking - this allows it to scale efficiently with processor core counts. BBTools has another program functionally related to BBNorm, "kmercountexact.sh". It does NOT use probabilistic data structures, and uses locking rather than atomic counters, and as a result may not scale as well, and will run out of memory on large datasets. However, it is still extremely fast and memory-efficient - using ~15 bytes per kmer (with an optional count-min-sketch prefilter to remove low-count error kmers). It cannot normalize or error-correct, but it can generate the exact kmer count of a dataset as well as the exact kmer frequency histogram (and do rudimentary peak calling for genome size estimation). In practice, when I am interested in kmer frequency histograms, I use KmerCountExact for isolates, and BBNorm for metagenomes. BBNorm (and all BBTools) can be downloaded here: http://sourceforge.net/projects/bbmap/ Edit: Note! Most programs in the BBMap package run in Java 6 or higher, but BBNorm requires Java 7 or higher. Last edited by Brian Bushnell; 01-29-2015 at 09:47 AM. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Berlin, Germany Join Date: Jan 2015
Posts: 137
|
![]()
How does BBNorm compare to normalize_by_median from the khmer package? The implementation (apart from language and possibly better usage of processor cors) sounds very similar.
|
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: Midwest Join Date: Aug 2013
Posts: 8
|
![]()
There are a number of similar tools now --
Digital normalization, http://ivory.idyll.org/blog/diginorm-paper-posted.html Trinity's in silico read normalization, based on Jellyfish and custom Perl scripts: http://trinityrnaseq.sourceforge.net...alization.html NeatFreq, written in Java (I think): http://www.biomedcentral.com/1471-2105/15/357/abstract Mira also contains an implementation of a similar approach. I'd love to see a comparison of the algorithms in use! I know what Trinity's approach does, but I haven't looked into NeatFreq, BBNorm, or Mira. --titus |
![]() |
![]() |
![]() |
#4 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
I'm working on publishing BBNorm, which will have comparative benchmarks versus other normalization tools, but in my informal testing it's way faster and yields better assemblies than the two other normalizers I have tested. The specific way that the decision is made on whether or not to discard a read has a huge impact on the end results, as does the way in which pairs are handled, and exactly how kmers are counted, and how a kmer's frequency is used to estimate a read's depth. BBNorm has various heuristics in place for each of these that substantially improved assemblies compared to leaving the heuristic disabled; my earlier description of discarding a read or not based on the median frequency of the read's kmers is actually a gross oversimplification. Also, using error-correction in conjunction with normalization leads to different results, as it can make it easier to correctly determine the depth of a read. I guess I would say the the theory is similar, but the implementation is probably very different than other normalizers. |
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Iowa Join Date: Oct 2012
Posts: 41
|
![]()
Hi Brian,
I'm trying to do some normalization but I want to set my target coverage to 10X rather than 40X. Is there any way to change that in BBNorm? I tried target=10, but it still says 40X on the run description. |
![]() |
![]() |
![]() |
#6 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
By default, BBNorm will run 2 passes. The first pass will normalize to some depth higher than the ultimate desired depth, and the second pass will normalize to the target depth. This allows, in the first pass, preferential discarding of reads that are low quality. So the result from the second pass should still be a target of 10x.
You can instead set "passes=1" which will aim for the target on the first pass and not do a second pass. This is slightly faster but will typically yield data with more errors. Neither is universally better, though. If you are going to target a depth of 10x, it's important to also reduce "mindepth" - by default it is 6, which is appropriate for 40x but not for 10x. Probably 2 would be better. Everything with apparent depth below that gets discarded. |
![]() |
![]() |
![]() |
#7 |
Member
Location: UK Join Date: Jan 2010
Posts: 49
|
![]()
Hi Brian,
This tool looks great. Is there a way to accept multiple fastq.gz files for inputs? I want to run all my reads (multiple fastq.gz) through bbnorm. |
![]() |
![]() |
![]() |
#8 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
At this point, BBNorm does not accept multiple input files (other than dual files for paired reads). You would have to concatenate them first:
cat a.fastq.gz b.fastq.gz > all.fa.gz ...which works fine for gzipped files. Most of my programs can accept piped input from stdin, but not BBNorm since it needs to read the files twice. |
![]() |
![]() |
![]() |
#9 |
Junior Member
Location: south korea Join Date: Jun 2011
Posts: 3
|
![]()
Dear Brain Bushnell
BBNORM can used to normalize MATE pair sequences by Nextra kit such as (2k - 20K) to reduce the input size? |
![]() |
![]() |
![]() |
#10 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Yes, it can. BBNorm will (by default, it can be changed) discard pairs based on the depth of the lower mate, so if read 1 has high coverage and read 2 has low coverage, the pair will not be discarded. If both are high depth, they will be discarded.
|
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: south korea Join Date: Jun 2011
Posts: 3
|
![]()
Dear Brian Bushnell
Thank you for your valuable response and tool, your tools reduced my 80% of time I have few more doubts, Please write your suggestion If the libraries are from the different platform such as (HiSeq, Miseq and NextSeq) or different insert size such as (2k 4k 8k ....) which is the best method to normalize? 1) Pool together and perform normalization or Sequencing Platform dependent normalization? Another issue if i perform pre-processing the read length will vary according to sequencing artifacts. 2) So, before/after pre-processing is better for normalization? 3) If i want to use only 40X from 120X from the given genome (estimated size : 1.2GB) the normalized data should be <=(40*1.2GB) or the BNORM will give more than that? 3) Can i used for RNA-Seq libraries before perform Denovo assembly? will it affect the isoform detection or chance to miss transcripts ? Thank you |
![]() |
![]() |
![]() |
#12 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Sorry, I somehow missed your post!
1) This is kind of tricky. Typically, though, I would recommend normalizing data independently if it is different (such as different insert size) since it has a different use, and you don't want it all mixed together anyway. If it is the same type - for example, 2x150bp reads with short inserts - then I would normalize it all together regardless of whether it came from a different platform or library, because it will all be used the same way. 2) I recommend pre-processing (adapter trimming, contaminant removal, quality-trimming or filtering) prior to normalization, because those processes all remove spurious kmers that make it harder to determine read depth, and thus improve the normalization results. 3) If you target 40x coverage for a 1.2Gbp genome, BBNorm should output approximately 20*1.2Gbp of data. Normally it will go a little bit over to try to ensure everywhere has at least 40x. 4) Normalizing RNA-seq data can certainly be done prior to assembly. But if you have 2 isoforms of a gene - one that uses exons 1, 2, and 3, and one that only uses exons 1 and 3, and one of them is expressed 100x more highly than the other, then after normalization, the less-expressed isoform may not get assembled, only the more abundant one. So there are definite disadvantages. But, it's worth trying if you get a bad assembly from the raw data. |
![]() |
![]() |
![]() |
#13 |
Junior Member
Location: south korea Join Date: Jun 2011
Posts: 3
|
![]()
Dear Brian Bushnell
Thank You !!! Can you please refer the article, which explain BBNORM methodology in detail. For complete understanding and to code citation |
![]() |
![]() |
![]() |
#14 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I am currently collaborating with another person on writing BBNorm's paper and we plan to submit it in March. I will post here once it gets accepted.
|
![]() |
![]() |
![]() |
#15 |
Junior Member
Location: Arlington, TX Join Date: Mar 2013
Posts: 5
|
![]()
Hi Brian and others,
I am wondering what depth you need and what depth is optimal (if the two differ) for proper read error correcting using BBMap or any other error correcting program. The Quake website mentioned >15x coverage but a quick round of Googling hasn't given me much more than that. The reason I ask is because I have a couple lanes of MiSeq data (600 cycle PE sequencing), which individually total to about 3x coverage of my genome each. Therefore, a kmer based error correction wouldn't work well, even if I were to concatenate the two together. We do have an additional HiSeq lane (100bp PE) and a few GAII lanes (so 50-60x coverage total), so we have the option of concatenating all of the datasets together (though one GAII lane isn't paired-end). However, then we would have the separate the individual lanes back out, since we next plan to merge the MiSeq reads to create longer, 454-like reads. Therefore, my second question is about what workflow would be best to accomplish this task? Are there some settings in ecc.sh or the like that would allow decent error correction with low coverage? Or alternatively, is there an easy way of separating data from different lanes if we were to concatenate a bunch together to give the coverage necessary to confidently correct? Thanks in advance for the help. |
![]() |
![]() |
![]() |
#16 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
At default settings, BBNorm will correct using kmers that occur a minimum of 22 times (echighthresh flag). Whether 50x is sufficient to do a good job depends on things like the ploidy of the organism, how heterozygous it is, and the uniformity of your coverage. 50x of fairly uniform coverage is plenty for a haploid organism, or a diploid organism like human with a low het rate, but not for a wild tetraploid plant. You can always reduce the setting, of course, but I would not put it below ~16x typically. You can't get good error-correction with 3x depth nor matter what you do. Bear in mind that the longer the kmer is compared to read length, the lower the kmer depth will be compared to read depth.
To deal with multiple different data sources, you can run BBNorm with the "extra" flag to use additional data to build kmer frequency tables but not as output, like this: ecc.sh in=miseq.fq out=miseq_ecc.fq extra=gaII_1.fq,gaII_2.fq,gaII_3.fq That would take extra processing time, since all the data would have to be reprocessed for every output file you generate. Alternately, you can do this: bbrename.sh miseq.fq addprefix prefix=miseq bbrename.sh gaII_1.fq addprefix prefix=gaII_1 ...etc Then cat all the files together, and error-correct them: ecc.sh in=combined.fq out=ecc.fq ordered int=f Then demultiplex: demuxbyname.sh in=ecc.fq out=demuxed_%.fq names=miseq,gaII_1 int=f (note the % symbol; it will be replaced by a name) That will keep all the read order the same. So, if all the reads are initially either single-ended or interleaved (i.e. one file per library) pairs will be kept together, and you can de-interleave them afterward if you want. You can convert between 2-file and interleaved with reformat.sh. |
![]() |
![]() |
![]() |
#17 |
Junior Member
Location: South Korea Join Date: Mar 2015
Posts: 1
|
![]()
Dear Brian,
Is it possible to use BBNorm to normalize filtered subreads from PacBio sequencing? BBNorm seems to work well for my Illumina MiSeq data. But, when I tried it for PacBio filtered subreads (fastq format), the resulting output file was empty. Actually, I'm working on PacBio sequencing data from MDA-amplified single cell genomes. When I assembled the data using HGAP (SMRT Analysis), the results were not good (too many rather short contigs). Some of my colleagues told me that uneven coverage might be the cause of bad assembly. So, I've been trying to normalize raw data. Thanks. |
![]() |
![]() |
![]() |
#18 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.
PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work. PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads. |
![]() |
![]() |
![]() |
#19 |
Junior Member
Location: France Join Date: Sep 2012
Posts: 2
|
![]()
Hi Brian
I am trying to assemble a genome from single cell amplification using spades and subsequently using tetramer + coverage analysis (i.e. CONCOCT) to remove "contaminating reads" which seem to be created at the sequencing. If I understood correctly bbnorm will keep enough information to allow coverage information to be used subsequently to bin contigs. Is that the case? I also would like to know your opinion on how bbnorm compares to the strategy described in http://www.sciencemag.org/cgi/conten.../6047/1296/DC1 I quote: "High-depth k-mers,presumably derived from MDA amplification bias, cause problems in the assembly, especially if the k-mer depth varies in orders of magnitude for different regions of the genome. We removed reads representing high-abundance k-mers (>64x k-mer depth) and trimmed reads that contain unique k-mers. The results of the k-mer based coverage normalization are shown in table S8." |
![]() |
![]() |
![]() |
#20 |
Junior Member
Location: Midwest Join Date: Aug 2013
Posts: 8
|
![]()
Hi suzumar, if you apply any kind of coverage normalization (whether with khmer or bbnorm) prior to assembly, you won't be able to use those reads to calculate the coverage of the contigs. However, you can go back and use the unnormalized reads to calculate the contig coverage.
Re the science paper, I don't know what they actually did; if you can find the software and/or detailed instructions in their methods, I'd be happy to comment. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|