SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries Brian Bushnell Bioinformatics 64 03-28-2020 03:54 AM
LSC - a fast PacBio long read error correction tool. LSC Bioinformatics 9 08-21-2015 06:06 AM
LSC - a fast PacBio long read error correction tool. LSC Pacific Biosciences 55 02-14-2014 05:34 AM
Reptile error correction tool: fastq not readable stepa_t Bioinformatics 2 07-25-2013 06:49 PM
BFAST and read error correction (with SAET or similar tool) javijevi Bioinformatics 4 01-27-2010 12:46 PM

Reply
 
Thread Tools
Old 11-02-2016, 10:14 AM   #41
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by evanname View Post
Brian, thank you so much for the excellent tools!

Is it possible to say at what level the error correction would be able to distinguish between sequencing errors and heterogeneity in the source sample?

For example, if the source was a 500bp PCR product and 2% of the molecules had a substitution at base 100, would BBnorm flag that as an error? Is there an approximate percent heterogeneity at any particular base that serves as the dividing line between 'error' and 'SNP'?

Thanks!
Sorry for the very late reply, but anyway -

I recommend using Tadpole for error-correction now; it substantially better than BBNorm because it uses exact kmer counts and algorithms designed to take advantage of the exact counts. I now only use BBNorm for normalization and plotting kmer-frequency histograms of datasets too big to fit into memory, but not for error-correction.

I don't recommend doing error-correction at all on data for which you are hoping to find rare SNPs. That said, by default, BBNorm determines a base to be in error if there is at least a 1:140 ratio of kmer counts between it and the adjacent kmers, so a 2% SNP should be safe. Tadpole, on the other hand, defaults to a 1:16 ratio for detecting errors, which is much more aggressive and would wipe out a 2% SNP. Why is it more aggressive? Well... I tried to optimize the parameters for the best Spades assemblies, and Spades seems to perform best with pretty aggressive error-correction. You can change that threshold, though.
Brian Bushnell is offline   Reply With Quote
Old 01-12-2017, 05:33 AM   #42
jov14
Member
 
Location: Germany

Join Date: Oct 2014
Posts: 16
Default A question reagarding the partitioning option of BBnorm

Hi,
I want to preferentially assemble the genome of a low abundant community member from a metagenome, so I am interested in the partitioning option of BBnorm.

I have some questions on how to choose the best parameters though:

-for the other bbnorm workflows (normalization, filtering, error correction) you recommend the "prefilter" option. Is this also recommendable for the partitioning workflow? (Because this option is used in most of the example-usages of BBnorm in the documentation EXCEPT the partitioning workflow)

-from the description, I assumed that by giving "outlow, outmid and outhigh" arguments, the usual normalization workflow would be overridden and ALL reads would be grouped into one of these categories. However the preliminary output of BBnorm states that a "target depth" of 100 and a "min depth" of 5 is being applied. Does that mean that all reads below a coverage of five will be discarded? Do I need to adjust the "mindepth" parameter as well?

-Our job-submission pipeline requires the specification of a maximum RAM usage for all scripts started. However bbnorm keeps exceeding this value (which leads to a termination of the job). I kept increasing the memory limit of BBnorm using the "-Xmx" argument upto 200G, but always bbNorm exceeds the alloted limit (even if using the "prefilter" option above).
Do I have consider any additional memory requirements of the script, in addition to the "-Xmx" limit? How would I determine how much memory is needed?
(The dataset consists of about 84.547.019 read-pairs
loglog.sh calculated a "Cardiality" of 5.373.179.884, but I do not know how exactly to interpret this value).

Thanks for any suggestions.
jov14 is offline   Reply With Quote
Old 01-12-2017, 09:02 AM   #43
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Whether or not to use "prefilter" just depends on the amount of memory you have rather than the workflow. It basically makes BBNorm take twice as long but increases accuracy in cases where you have a very large dataset compared to memory - so, there's no penalty for using it, and it always increases accuracy, but the increase is trivial if you have a lot of memory. So if you have lots of ram or a small dataset you don't need it.

In your case the dataset has approximately 5 billion unique kmers (which is what the output of loglog.sh means).

As for BBNorm's memory use:

-Xmx is a Java flag that specifies how much much heap memory Java will use. This is most, but not all, of the memory that your job will use - there is some overhead. Normally BBNorm will auto-detect how much memory is available and everything should be fine without you needing to specify -Xmx, but that depends on the job manager and system configuration. If you manually specify memory with -Xmx, it must be lower than your requested memory for the scheduler, not higher. I recommend about 84% for our cluster, but this depends. So, basically, if you submit requesting a 100G, then set -Xmx84g. If this gets killed by the scheduler, then decrease -Xmx rather than increasing it.

For 5 billion unique kmers, I recommend using the prefilter flag. The overall command would be something like:

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=10 highbindepth=80

Even though BBNorm will mention "target depth" and "min depth", those values will not affect your outputs - they only affect reads that go to the "out=" stream (which you did not specify), not reads that go to the "outlow=" and so forth. Sorry it's a litttle confusing.
Brian Bushnell is offline   Reply With Quote
Old 01-12-2017, 09:06 AM   #44
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Do you have a paper or something like this which explain the algorithm behind bbnorm ?
moistplus is offline   Reply With Quote
Old 01-12-2017, 09:14 AM   #45
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I've described the algorithm in some detail in /bbmap/docs/guides/BBNormGuide.txt. I also wrote this a while back:

Quote:
Overview

This program accepts input files of single or paired reads in fasta or fastq format, correcting substitution-type errors and normalizing the read depth to some desired target before outputting the result. All stages are multithreaded, allowing very high processing speed while still (optionally) maintaining strictly deterministic output.

Phase 1: Gather Kmer Frequencies

An input file of sequence data is read and processed. Each read is translated into a set of all constituent kmers of fixed k (default 31). Each kmer’s count is incremented in a shared table (a count-min sketch) whenever it is seen, so at the end of this phase, the frequencies of all kmers are known.

Phase 2: Correct and Normalize Reads

The input file is read a second time. Each read is again translated into an array of kmers, and each kmer’s count is read from the table. An error-free read is expected to have a relatively smooth profile of kmer counts, so each read is scanned for the presence of adjacent kmers with discrepant counts. For such a pair of adjacent kmers, the one with the high count is considered a “good kmer” (or genomic kmer) and the one with the low count is considered a possible “error kmer”. The single base covered by the error kmer but not the good kmer is considered the suspect “error base”. In addition to absolute cutoffs for counts of good and bad kmers, data with very high coverage is handled with a relative cutoff for the ratio of adjacent kmer counts.

All 4 possible replacements of the error base (A, C, G, T) are considered. For each replacement, the kmer covering the error base is regenerated, and its count read from the table. If exactly one of the four replacements has a count sufficiently high to be considered a good kmer and the other three are sufficiently low to be considered error kmers, then the error base is replaced accordingly, and the error is considered corrected. Otherwise, the error cannot be corrected; any prior corrections are rolled back, and the read is output unchanged.

If normalization is desired, the kmer counts from correction are re-used to determine whether a read should be discarded. If the median count is below a cutoff, the read is discarded as noise. Reads between the lower cutoff and the target depth are all retained. Otherwise, the median is above the target depth and the read is discarded with probability 1-(target/median). For paired reads, the greater median is compared to the cutoff, and the lesser median is compared to the target; the pair is either kept or discarded together. Normalization may be run using multiple passes for greater precision.
Note that I don't recommend BBNorm for error-correction anymore, though, since Tadpole does a much better job (which is possible because it uses exact kmer counts). So I just use BBNorm for normalization and depth partitioning.
Brian Bushnell is offline   Reply With Quote
Old 01-13-2017, 12:09 AM   #46
jov14
Member
 
Location: Germany

Join Date: Oct 2014
Posts: 16
Default

@Brian Bushnell
Thanks a lot. Now BBnorm completed successfully.
jov14 is offline   Reply With Quote
Old 07-27-2018, 10:26 PM   #47
kokyriakidis
Member
 
Location: Thessaloniki, Greece

Join Date: Jul 2018
Posts: 12
Default

Hello! I have RNA-Seq data and I am processing them with BBtools for adapter trimming, quality trimming, contaminant filtering, error correction. I want to use these data as support in denovo genome assembly with Abyss, which has this ability. Should I normalize them? What is the target number that I should set in bbnorm? Can someone explain how I calculate the right number?

Last edited by kokyriakidis; 07-28-2018 at 12:38 AM.
kokyriakidis is offline   Reply With Quote
Old 12-04-2018, 05:01 PM   #48
shimingt
Junior Member
 
Location: singapore

Join Date: May 2016
Posts: 9
Default

Hello there,

I have performed a metagenomics sequencing on my samples on an Illumina Hi-Seq. How do I determine the target coverage value to use?

Thanks in advance.
shimingt is offline   Reply With Quote
Old 01-27-2019, 04:17 AM   #49
mayabritstein
Junior Member
 
Location: Israel

Join Date: Nov 2015
Posts: 2
Default Exception in thread "Thread-175" java.lang.AssertionError:

Hello,

Thanks for the great software!

I want to use bbnorm to normalize single end RNAseq library reads.

I'm using the software on my university cluster (linux)

this is the exception message I get:

Exception in thread "Thread-175" java.lang.AssertionError:
NB501112:39:HKM3VBGXY:4:11401:7344:1020 1:N:0:CCAGTT 1 -1 + -1 -1 1000000000000000000 1 0 0 CTTTACATCAAATCCTAATGTAGTTACAGGTGATTCAATTAATCTATCACCTAATGATTGTGAACGTTG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE/EEE . 35 . .

null
at stream.ReadStreamByteWriter.writeFastq(ReadStreamByteWriter.java:460)
at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:97)
at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:42)
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:28)


this is the script I used:
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 24
#SBATCH -p hive7d
#SBATCH -J ecc_bbnormSE # Job name
#SBATCH --mem=128000
#SBATCH --time=2-23:00:00 # Runtime in D-HH:MM:SS
#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=mayabritstein@gmail.com # Email to send notifications to

. /etc/profile.d/modules.sh
module purge
module load java/jre1.8.0

export PATH=/data/home/steindler/mbritstei/programs/anaconda2/bin:$PATH

source activate bbtools

fastq_DATA=/data/home/steindler/mbritstei/Petrosia_transcriptomes/transcriptome_assembly/All_Single_end


bbnorm.sh in=$fastq_DATA/petrosia_SE.fastq out= out2=petrosia_SE_normalized_ecc.fastq target=100 min=5 ecc prefilter


source deactivate



## Can you please tell me what is wrong?

# thanks!!

Maya
mayabritstein is offline   Reply With Quote
Old 01-27-2019, 04:34 AM   #50
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

@Maya: Can you explicitly add -Xmx128g (is that what you are asking for) threads=24 in your bbnorm.sh command?

I also don't understand this part: in=$fastq_DATA/petrosia_SE.fastq out= out2=petrosia_SE_normalized_ecc.fastq There should just be one "out="?

Normalizing RNAseq data is not appropriate. You are going to lose vital count information. See the section on "When not to normalize" in BBNorm guide here.

Last edited by GenoMax; 01-27-2019 at 04:38 AM.
GenoMax is offline   Reply With Quote
Old 01-27-2019, 06:23 AM   #51
mayabritstein
Junior Member
 
Location: Israel

Join Date: Nov 2015
Posts: 2
Default

Thanks @GenoMax

I did not see that "out= " there... will try again, also with the suggested flags.

I'm using the normalization just for assembly, not for quantification.
mayabritstein is offline   Reply With Quote
Old 04-10-2020, 02:42 PM   #52
silverfox
Junior Member
 
Location: Perú

Join Date: Jul 2019
Posts: 6
Default

Quote:
Originally Posted by Brian Bushnell View Post
I'd like to introduce BBNorm, a member of the BBTools package.

BBNorm is a kmer-based normalization tool for NGS reads, which may also be used for error-correction and generating kmer-frequency plots. It is extremely fast and memory-efficient due to the use of atomic counters and probabilistic data structures.

First, what is normalization? Many assemblers perform poorly in the presence of too much data, and data with irregular coverage, such as MDA-amplified single cells or metagenomes. And even if an assembler performs well with these datasets (Spades, for example, does a very good job with single cells, though still benefits from normalization in my tests), more data will increase the runtime and memory usage, potentially to the point that the data cannot be assembled.

Subsampling randomly discards some percentage of the reads, to reduce the average coverage, and is computationally cheap (you can do that quickly with reformat). However, if you have one area with 10000x coverage and another area with 10x coverage, subsampling to 1% will reduce the 10000x area to 100x - a good level for some assemblers - but the 10x area to 0.1x, which cannot be assembled. So with irregular coverage, it is not ideal!

Normalization discards reads with a probability based on the local coverage. So, for example, if you normalize to a target 40x coverage, reads in regions with 10x coverage will all be retained; in an 80x region they will be discarded with a 0.5 probability; and in a 10000x area they will be discarded with a 0.004 probability. As a result, after normalization, areas with coverage above the target will be reduced to the target, and areas below the target will be left untouched. This generally makes the data much easier to assemble and typically increases continuity (L50) by a substantial amount.

To normalize to 40x coverage with BBNorm, and discard reads with an apparent depth under 2x (which typically indicates the reads have errors):

bbnorm.sh in=reads.fq out=normalized.fq target=40 mindepth=2

Error-correction is also useful in many cases when you have sufficient depth and wish to avoid false-positive variant calls, or achieve a higher mapping rate, or want to merge paired reads via overlap, or assemble high error-rate data with an assembler that is not very tolerant of errors, or reduce the memory usage of a DeBruijn assembler. Like normalization, error-correction is not universally a good idea - JGI does not normalize and error-correct all data prior to use, for example - but it is highly beneficial in many situations. Also, BBNorm only corrects substitution errors, not indels, since that is the error mode that occurs in Illumina data. In other words, it will NOT error-correct PacBio or 454 data, which feature indel errors. BBNorm is currently being used for error-correction by the OLC assembler Omega.

To error-correct reads with BBNorm:

ecc.sh in=reads.fq out=corrected.fq

To error-correct and normalize at the same time, just add the flag "ecc" when running BBNorm.

Lastly, BBNorm can be used for producing kmer frequency histograms, and binning reads based on coverage depth. The histograms are useful to determine things like the ploidy of an organism, the genome size and coverage, the heterozygousity rate, and the presence of contaminants (which typically have drastically different coverage than the genome of interest).

To generate a kmer frequency histogram:

khist.sh in=reads.fq hist=histogram.txt

You can also use the "hist" flag during normalization or error-correction, for the histogram of input reads, for free; "histout" will generate the histogram of output reads, but cost an additional pass.

So, what is the difference between bbnorm.sh, ecc.sh, and khist.sh? They all call the same program, just with different default parameters (you can see the exact parameters by looking at the bottom of the shellscripts). If you specify all parameters manually, they are equivalent.

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers. For error-correction, situations where there are adjacent kmers in a read with drastically different frequencies - for example, differing by a factor of 180 - are detected; the offending base is altered to a different base, if doing so will restore the kmer to a similar frequency as its adjacent kmers.

BBNorm is memory-efficient because it does not explicitly store kmers - everything is in a probabilistic data structure called a count-min sketch. As a result, BBNorm will never run out of memory, slow down, or use disk, no matter how much data you have or how big the genome is. Rather, the accuracy will decline as the table's loading increases - but because kmers are not explicitly stored, it can store several times more than an explicit data structure (such as Google Sparse Hash). And for normalization, the reduction in accuracy at extremely high loading does not matter, because the median is used - so even if multiple kmers within a read have an incorrectly high count, they will not even be considered, and thus the results will not be affected at all. As a result - in practice, you should use all available memory even for a tiny genome with a small number of reads; but even for a huge genome with very high coverage, BBNorm will still work, and produce good results quickly on a computer with limited memory.

Speedwise, BBNorm is multithreaded in all stages, using atomic counters which do not require locking - this allows it to scale efficiently with processor core counts.

BBTools has another program functionally related to BBNorm, "kmercountexact.sh". It does NOT use probabilistic data structures, and uses locking rather than atomic counters, and as a result may not scale as well, and will run out of memory on large datasets. However, it is still extremely fast and memory-efficient - using ~15 bytes per kmer (with an optional count-min-sketch prefilter to remove low-count error kmers). It cannot normalize or error-correct, but it can generate the exact kmer count of a dataset as well as the exact kmer frequency histogram (and do rudimentary peak calling for genome size estimation). In practice, when I am interested in kmer frequency histograms, I use KmerCountExact for isolates, and BBNorm for metagenomes.

BBNorm (and all BBTools) can be downloaded here:

http://sourceforge.net/projects/bbmap/

Edit: Note! Most programs in the BBMap package run in Java 6 or higher, but BBNorm requires Java 7 or higher.
Hi!!! A question, please.

BBNorm can also be used to normalize the coverage of PacBio reads??

Thank you in advance!

edited:
I just read that it indeed can be used on PacBio reads but doesn't perform error-correction! that's fine for me I have HiFi PacBio reads and wanted to try this normalization step

Last edited by silverfox; 04-10-2020 at 03:00 PM.
silverfox is offline   Reply With Quote
Old 04-12-2020, 04:57 PM   #53
silverfox
Junior Member
 
Location: Perú

Join Date: Jul 2019
Posts: 6
Default normalized reads file empty

Hi Brian and everyone! I'm using bbnorm but i keep getting a problem.

I ran the command:

$ bbnorm.sh -Xmx64g t=18 in=pt.raw.fastq out=pt.raw.normalized.fq target=90 mindepth=2

everything looked good but when the proccess ended, I realized the file pt.raw.normalized.fq was empty

edited:

I just run the following command:

$ bbnorm.sh in=pt.raw.fastq out=pt.raw.normalized3.fq target=90 min=2

But at the end, my pt.raw.normalized3.fq file was still empty, like before T-T

I think the problem could be here:

In the second pass says

HTML Code:
Made hash table:        hashes = 3       mem = 65.66 GB         cells = 35.25B          used = 0.000%

Estimated unique kmers:         0

Table creation time:            17.804 seconds.
Started output threads.
Table read time:                0.012 seconds.          0.00 kb/sec
Total reads in:                 0               NaN% Kept
Total bases in:                 0               NaN% Kept
Error reads in:                 0               NaN%
Error type 1:                   0               NaN%
Error type 2:                   0               NaN%
Error type 3:                   0               NaN%
Total kmers counted:            0
Please, can someone tell me what did I do wrong?
Thanks a lot in advance!

Editing:

I just found one of your comments:
Quote:
Originally Posted by Brian Bushnell View Post
BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.

PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work.

PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads.
I will try to normalize my HiFi PacBio (PacBio CCS) reads then

Can I not reduce the kmer length? (default=31)

Last edited by silverfox; 04-12-2020 at 06:25 PM.
silverfox is offline   Reply With Quote
Old 08-12-2020, 12:51 PM   #54
uloeber
Member
 
Location: Germany

Join Date: Mar 2013
Posts: 44
Default

Hi all,
does any one has experience with processing ancient DNA using bbtools? My question is how short the kmer length for bbnorm should be, regarding issues with the fragmented DNA. Thanks in advance!
uloeber is offline   Reply With Quote
Reply

Tags
assembly, bbnorm, bbtools, error correction, kmer frequency histogram, kmer genome size, kmercountexact, kmerfreq, metagenome assembly, normalization, quality control, single cell, soap, spades, subsample, velvet

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO