SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries Brian Bushnell Bioinformatics 55 11-14-2017 04:48 AM
LSC - a fast PacBio long read error correction tool. LSC Bioinformatics 9 08-21-2015 07:06 AM
LSC - a fast PacBio long read error correction tool. LSC Pacific Biosciences 55 02-14-2014 06:34 AM
Reptile error correction tool: fastq not readable stepa_t Bioinformatics 2 07-25-2013 07:49 PM
BFAST and read error correction (with SAET or similar tool) javijevi Bioinformatics 4 01-27-2010 01:46 PM

Reply
 
Thread Tools
Old 11-02-2016, 11:14 AM   #41
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by evanname View Post
Brian, thank you so much for the excellent tools!

Is it possible to say at what level the error correction would be able to distinguish between sequencing errors and heterogeneity in the source sample?

For example, if the source was a 500bp PCR product and 2% of the molecules had a substitution at base 100, would BBnorm flag that as an error? Is there an approximate percent heterogeneity at any particular base that serves as the dividing line between 'error' and 'SNP'?

Thanks!
Sorry for the very late reply, but anyway -

I recommend using Tadpole for error-correction now; it substantially better than BBNorm because it uses exact kmer counts and algorithms designed to take advantage of the exact counts. I now only use BBNorm for normalization and plotting kmer-frequency histograms of datasets too big to fit into memory, but not for error-correction.

I don't recommend doing error-correction at all on data for which you are hoping to find rare SNPs. That said, by default, BBNorm determines a base to be in error if there is at least a 1:140 ratio of kmer counts between it and the adjacent kmers, so a 2% SNP should be safe. Tadpole, on the other hand, defaults to a 1:16 ratio for detecting errors, which is much more aggressive and would wipe out a 2% SNP. Why is it more aggressive? Well... I tried to optimize the parameters for the best Spades assemblies, and Spades seems to perform best with pretty aggressive error-correction. You can change that threshold, though.
Brian Bushnell is offline   Reply With Quote
Old 01-12-2017, 06:33 AM   #42
jov14
Member
 
Location: Germany

Join Date: Oct 2014
Posts: 14
Default A question reagarding the partitioning option of BBnorm

Hi,
I want to preferentially assemble the genome of a low abundant community member from a metagenome, so I am interested in the partitioning option of BBnorm.

I have some questions on how to choose the best parameters though:

-for the other bbnorm workflows (normalization, filtering, error correction) you recommend the "prefilter" option. Is this also recommendable for the partitioning workflow? (Because this option is used in most of the example-usages of BBnorm in the documentation EXCEPT the partitioning workflow)

-from the description, I assumed that by giving "outlow, outmid and outhigh" arguments, the usual normalization workflow would be overridden and ALL reads would be grouped into one of these categories. However the preliminary output of BBnorm states that a "target depth" of 100 and a "min depth" of 5 is being applied. Does that mean that all reads below a coverage of five will be discarded? Do I need to adjust the "mindepth" parameter as well?

-Our job-submission pipeline requires the specification of a maximum RAM usage for all scripts started. However bbnorm keeps exceeding this value (which leads to a termination of the job). I kept increasing the memory limit of BBnorm using the "-Xmx" argument upto 200G, but always bbNorm exceeds the alloted limit (even if using the "prefilter" option above).
Do I have consider any additional memory requirements of the script, in addition to the "-Xmx" limit? How would I determine how much memory is needed?
(The dataset consists of about 84.547.019 read-pairs
loglog.sh calculated a "Cardiality" of 5.373.179.884, but I do not know how exactly to interpret this value).

Thanks for any suggestions.
jov14 is offline   Reply With Quote
Old 01-12-2017, 10:02 AM   #43
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Whether or not to use "prefilter" just depends on the amount of memory you have rather than the workflow. It basically makes BBNorm take twice as long but increases accuracy in cases where you have a very large dataset compared to memory - so, there's no penalty for using it, and it always increases accuracy, but the increase is trivial if you have a lot of memory. So if you have lots of ram or a small dataset you don't need it.

In your case the dataset has approximately 5 billion unique kmers (which is what the output of loglog.sh means).

As for BBNorm's memory use:

-Xmx is a Java flag that specifies how much much heap memory Java will use. This is most, but not all, of the memory that your job will use - there is some overhead. Normally BBNorm will auto-detect how much memory is available and everything should be fine without you needing to specify -Xmx, but that depends on the job manager and system configuration. If you manually specify memory with -Xmx, it must be lower than your requested memory for the scheduler, not higher. I recommend about 84% for our cluster, but this depends. So, basically, if you submit requesting a 100G, then set -Xmx84g. If this gets killed by the scheduler, then decrease -Xmx rather than increasing it.

For 5 billion unique kmers, I recommend using the prefilter flag. The overall command would be something like:

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=10 highbindepth=80

Even though BBNorm will mention "target depth" and "min depth", those values will not affect your outputs - they only affect reads that go to the "out=" stream (which you did not specify), not reads that go to the "outlow=" and so forth. Sorry it's a litttle confusing.
Brian Bushnell is offline   Reply With Quote
Old 01-12-2017, 10:06 AM   #44
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Do you have a paper or something like this which explain the algorithm behind bbnorm ?
moistplus is offline   Reply With Quote
Old 01-12-2017, 10:14 AM   #45
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I've described the algorithm in some detail in /bbmap/docs/guides/BBNormGuide.txt. I also wrote this a while back:

Quote:
Overview

This program accepts input files of single or paired reads in fasta or fastq format, correcting substitution-type errors and normalizing the read depth to some desired target before outputting the result. All stages are multithreaded, allowing very high processing speed while still (optionally) maintaining strictly deterministic output.

Phase 1: Gather Kmer Frequencies

An input file of sequence data is read and processed. Each read is translated into a set of all constituent kmers of fixed k (default 31). Each kmer’s count is incremented in a shared table (a count-min sketch) whenever it is seen, so at the end of this phase, the frequencies of all kmers are known.

Phase 2: Correct and Normalize Reads

The input file is read a second time. Each read is again translated into an array of kmers, and each kmer’s count is read from the table. An error-free read is expected to have a relatively smooth profile of kmer counts, so each read is scanned for the presence of adjacent kmers with discrepant counts. For such a pair of adjacent kmers, the one with the high count is considered a “good kmer” (or genomic kmer) and the one with the low count is considered a possible “error kmer”. The single base covered by the error kmer but not the good kmer is considered the suspect “error base”. In addition to absolute cutoffs for counts of good and bad kmers, data with very high coverage is handled with a relative cutoff for the ratio of adjacent kmer counts.

All 4 possible replacements of the error base (A, C, G, T) are considered. For each replacement, the kmer covering the error base is regenerated, and its count read from the table. If exactly one of the four replacements has a count sufficiently high to be considered a good kmer and the other three are sufficiently low to be considered error kmers, then the error base is replaced accordingly, and the error is considered corrected. Otherwise, the error cannot be corrected; any prior corrections are rolled back, and the read is output unchanged.

If normalization is desired, the kmer counts from correction are re-used to determine whether a read should be discarded. If the median count is below a cutoff, the read is discarded as noise. Reads between the lower cutoff and the target depth are all retained. Otherwise, the median is above the target depth and the read is discarded with probability 1-(target/median). For paired reads, the greater median is compared to the cutoff, and the lesser median is compared to the target; the pair is either kept or discarded together. Normalization may be run using multiple passes for greater precision.
Note that I don't recommend BBNorm for error-correction anymore, though, since Tadpole does a much better job (which is possible because it uses exact kmer counts). So I just use BBNorm for normalization and depth partitioning.
Brian Bushnell is offline   Reply With Quote
Old 01-13-2017, 01:09 AM   #46
jov14
Member
 
Location: Germany

Join Date: Oct 2014
Posts: 14
Default

@Brian Bushnell
Thanks a lot. Now BBnorm completed successfully.
jov14 is offline   Reply With Quote
Reply

Tags
assembly, bbnorm, bbtools, error correction, kmer frequency histogram, kmer genome size, kmercountexact, kmerfreq, metagenome assembly, normalization, quality control, single cell, soap, spades, subsample, velvet

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:49 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO