SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries Brian Bushnell Bioinformatics 55 11-14-2017 04:48 AM
LSC - a fast PacBio long read error correction tool. LSC Bioinformatics 9 08-21-2015 07:06 AM
LSC - a fast PacBio long read error correction tool. LSC Pacific Biosciences 55 02-14-2014 06:34 AM
Reptile error correction tool: fastq not readable stepa_t Bioinformatics 2 07-25-2013 07:49 PM
BFAST and read error correction (with SAET or similar tool) javijevi Bioinformatics 4 01-27-2010 01:46 PM

Reply
 
Thread Tools
Old 09-05-2015, 10:28 PM   #21
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I'll read the article in a few days, and comment on it then. As Titus stated, you cannot do binning by depth after normalization - it destroys that information. Furthermore, MDA'd single cells cannot be individually binned for contaminants based on depth, as the depth is exponentially random across the genome.

I use BBNorm (with the settings target=100 min=2) to preprocess amplified single cells prior to assembly with Spades, as it vastly reduces the total runtime and memory use, meaning that the jobs are much less likely to crash or be killed. If you want to reduce contamination, though, I have a different tool called CrossBlock, which is designed to eliminate cross-contamination between multiplexed single-cell libraries. You need to first assemble all the libraries, then run CrossBlock with all of the libraries and their reads (raw, not normalized!) as input; it essentially removes contigs from assemblies that have greater coverage from another library than their own library. Incidentally, CrossBlock does in fact use BBNorm.

The latest version of Spades does not really have too much trouble with high-abundance kmers, unless they get extremely high or you have a limited amount of memory. So, you don't HAVE to normalize before running Spades, but it tends to give a comparable assemble with a small fraction of the resources - typically with slightly better continuity and slightly lower misassembly rates, with slightly lower genome recovery, but a slightly higher rate of long genes being called (according to Quast).

On the other hand, if you want to assemble MDA-amplified single-cell data with an assembler designed for isolate data, normalization is pretty much essential for a decent assembly.

Last edited by Brian Bushnell; 09-05-2015 at 10:32 PM.
Brian Bushnell is offline   Reply With Quote
Old 09-24-2015, 08:33 AM   #22
C. Olsen
Junior Member
 
Location: San Francisco/Raleigh

Join Date: Sep 2015
Posts: 1
Default Comparison paper

Hello Brian,

You mentioned the intent to submit a paper in March comparing BBNorm w/other methods. Were you able to scrape up any time to submit something? I'm keen to read it.
C. Olsen is offline   Reply With Quote
Old 09-24-2015, 10:40 AM   #23
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I have the manuscript mostly written, but it's not really ready to submit anywhere yet. However, there is a postdoc who is eager to get started on preparing it for submission, so... hopefully soon?
Brian Bushnell is offline   Reply With Quote
Old 12-04-2015, 09:09 PM   #24
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default Unique kmers from BBNorm?

Brian,

I see that in the output of BBNorm there are counts of unique kmers. Is there an option within BBNorm to export the unique kmers as a list or FASTA?

Best,

Bob
jazz710 is offline   Reply With Quote
Old 12-04-2015, 09:58 PM   #25
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Bob,

That's not possible with BBNorm, as it uses a lossy data structure called a count-min sketch to store counts. However, you can do that with KmerCountExact, which is faster than BBNorm, but less memory-efficient. Usage:

kmercountexact.sh in=reads.fq out=kmers.fasta


That will print them in fasta format; for 2-column tsv, add the flag "fastadump=f". There are also flags to suppress storing or printing of kmers with low counts.
Brian Bushnell is offline   Reply With Quote
Old 12-05-2015, 09:15 AM   #26
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.

Last edited by jazz710; 12-05-2015 at 09:17 AM. Reason: More complete
jazz710 is offline   Reply With Quote
Old 12-05-2015, 09:46 AM   #27
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by jazz710 View Post
Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.
Yes and yes!
Brian Bushnell is offline   Reply With Quote
Old 03-05-2016, 08:28 PM   #28
[email protected]
Junior Member
 
Location: perth

Join Date: Mar 2016
Posts: 2
Default

Hi Brian,

I am using bbnorm.sh to filter low-depth reads and correct them together. I have a question about it as below:
The raw pair-end reads file size(.gz): 223+229Mb;
After running bbnorm.sh:111+113+64Mb (out, out2 and outt);
So, what is different between 64Mb reads with filtered reads(164Mb=223+229-111-113-64).
This is my running shell:
bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t maxdepth=1000 lowbindepth=10 highbindepth=500 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=exclued.fq.gz

Thanks,
Xiao
348056755@qq.com is offline   Reply With Quote
Old 03-05-2016, 09:58 PM   #29
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Xiao,

The size difference is likely due to compression, and the fact that error-free reads compress better than reads with errors. Comparing the file size of compressed files tends to be confusing. If you want to know the truth, look at the actual amount of data in the files. For example, "reformat.sh in=file.fq.gz" will tell you the exact number of bases in the file.
Brian Bushnell is offline   Reply With Quote
Old 03-06-2016, 02:10 AM   #30
[email protected]
Junior Member
 
Location: perth

Join Date: Mar 2016
Posts: 2
Default

Hi Brian,

Thanks very much for your quick reply. According to you suggestion, I ran reformat.sh to calculate the exact number of bases in all files, but the reason seems not relating to "compression" very much! See below,
1)reads before running bbnorm.sh:
reads 1
Input : 2728414 reads 336952602 bases
reads 2
Input: 2728414 reads 338676300 bases

2) reads after...:
reads 1
Input: 1307784 reads 162040282 bases
reads 2
Input: 1307784 reads 162053968 bases
excluded reads
Input: 767030 reads 95017289 bases

Thanks,
Xiao
348056755@qq.com is offline   Reply With Quote
Old 03-06-2016, 09:05 PM   #31
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Oh... sorry, the explanation is a bit different here. By default BBNorm runs in 2-pass mode, which gives the best normalization. However, that generates temp files (which are later deleted). The final outputs are only from the second pass - reads discarded in the first pass would disappear completely.

For what you are doing I recommend this command:

bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t target=1000 min=10 passes=1 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=excluded.fq.gz

Then the output numbers should add up as expected.
Brian Bushnell is offline   Reply With Quote
Old 04-04-2016, 08:43 AM   #32
balaena
Member
 
Location: Germany

Join Date: Jan 2015
Posts: 29
Default

Hi Brian

I have difficulties to control the load of bbnorm on our server. Regardless which number I enter for threads=, it will always completely use all idle cores and the load average eventually goes above the number of cores.

our java is:

java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

Any solution?
Best
balaena is offline   Reply With Quote
Old 04-04-2016, 09:02 AM   #33
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,583
Default

While @Brian would be along later with an official answer I feel that this may not be directly related to BBMap. If you have pigz installed on your machine then BBMap tools use it by default to uncompress files and that program may be starting additional threads that overwhelm your system.

If pigz is installed you could turn it off by adding "pigz=f unpigz=f" to your BBMap tools commands and see if that stops the problem. Do keep using threads= option. You are not running this under a job scheduler, correct?
GenoMax is online now   Reply With Quote
Old 04-04-2016, 09:44 AM   #34
balaena
Member
 
Location: Germany

Join Date: Jan 2015
Posts: 29
Default

Hi

Thanks for the suggestion, but it didn't help. The load goes through the ceiling. I used it like so:

bbnorm.sh -Xmx100g in= in2= out= out2= target=200 mindepth=3 threads=4 pigz=f unpigz=f

I could do much more than 4 threads but just wanted to see what happens.

Best
balaena is offline   Reply With Quote
Old 04-04-2016, 09:51 AM   #35
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,583
Default

Can you test these two option "gunzip=f bf2=f" and report what happens?
GenoMax is online now   Reply With Quote
Old 04-04-2016, 10:03 AM   #36
balaena
Member
 
Location: Germany

Join Date: Jan 2015
Posts: 29
Default

Sorry. No change.
balaena is offline   Reply With Quote
Old 04-04-2016, 10:06 AM   #37
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,583
Default

Will have to wait on @Brian.

For reference: How many cores/memory is available on this system? What is the size of the dataset?
GenoMax is online now   Reply With Quote
Old 04-04-2016, 10:12 AM   #38
balaena
Member
 
Location: Germany

Join Date: Jan 2015
Posts: 29
Default

40 cores Intel(R) Xeon(R) CPU E7- 8860 @ 2.27GHz

1TB Ram

Dataset: 2x 75 million 125bp reads

Thanks!
balaena is offline   Reply With Quote
Old 04-21-2016, 01:03 PM   #39
evanname
Junior Member
 
Location: US

Join Date: Jan 2016
Posts: 1
Default

Brian, thank you so much for the excellent tools!

Is it possible to say at what level the error correction would be able to distinguish between sequencing errors and heterogeneity in the source sample?

For example, if the source was a 500bp PCR product and 2% of the molecules had a substitution at base 100, would BBnorm flag that as an error? Is there an approximate percent heterogeneity at any particular base that serves as the dividing line between 'error' and 'SNP'?

Thanks!
evanname is offline   Reply With Quote
Old 10-11-2016, 06:37 PM   #40
cerebis
Junior Member
 
Location: Sydney, AU

Join Date: Oct 2016
Posts: 1
Default Garbage collection

If you are using Oracle's JVM (or perhaps others too), what you're seeing as excess CPU consumption from bbnorm, might actually stem from garbage collection within the JVM. This really depends on an application's behaviour.

There has been a lot of work on the performance of garbage collectors in Java and there are a few to choose between.

As a quick validation test, you could try insisting on the single-threaded collector by adding the following option to the java invocation inside the bbnorm.sh script. (Sorry doesn't seem to be a means of passing that in)

Code:
-XX:+UseSerialGC
You can also specify thread limits for the parallel collector. Normally, you don't have to completely restrict it to see changes in concurrency. 4 is actually quite strict on a modern multicore CPU.

Code:
-XX:ParallelGCThreads=4 -XX:ConcGCThreads=4

Lots of further information can be found at Oracle's VM options

Keep in mind that SerialGC will mean the program will likely halt briefly at GC events. So at best you should expect a penalty on runtime if the parallel GC was already working quite hard.
cerebis is offline   Reply With Quote
Reply

Tags
assembly, bbnorm, bbtools, error correction, kmer frequency histogram, kmer genome size, kmercountexact, kmerfreq, metagenome assembly, normalization, quality control, single cell, soap, spades, subsample, velvet

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO