![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries | Brian Bushnell | Bioinformatics | 64 | 03-28-2020 04:54 AM |
LSC - a fast PacBio long read error correction tool. | LSC | Bioinformatics | 9 | 08-21-2015 07:06 AM |
LSC - a fast PacBio long read error correction tool. | LSC | Pacific Biosciences | 55 | 02-14-2014 06:34 AM |
Reptile error correction tool: fastq not readable | stepa_t | Bioinformatics | 2 | 07-25-2013 07:49 PM |
BFAST and read error correction (with SAET or similar tool) | javijevi | Bioinformatics | 4 | 01-27-2010 01:46 PM |
![]() |
|
Thread Tools |
![]() |
#21 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I'll read the article in a few days, and comment on it then. As Titus stated, you cannot do binning by depth after normalization - it destroys that information. Furthermore, MDA'd single cells cannot be individually binned for contaminants based on depth, as the depth is exponentially random across the genome.
I use BBNorm (with the settings target=100 min=2) to preprocess amplified single cells prior to assembly with Spades, as it vastly reduces the total runtime and memory use, meaning that the jobs are much less likely to crash or be killed. If you want to reduce contamination, though, I have a different tool called CrossBlock, which is designed to eliminate cross-contamination between multiplexed single-cell libraries. You need to first assemble all the libraries, then run CrossBlock with all of the libraries and their reads (raw, not normalized!) as input; it essentially removes contigs from assemblies that have greater coverage from another library than their own library. Incidentally, CrossBlock does in fact use BBNorm. The latest version of Spades does not really have too much trouble with high-abundance kmers, unless they get extremely high or you have a limited amount of memory. So, you don't HAVE to normalize before running Spades, but it tends to give a comparable assemble with a small fraction of the resources - typically with slightly better continuity and slightly lower misassembly rates, with slightly lower genome recovery, but a slightly higher rate of long genes being called (according to Quast). On the other hand, if you want to assemble MDA-amplified single-cell data with an assembler designed for isolate data, normalization is pretty much essential for a decent assembly. Last edited by Brian Bushnell; 09-05-2015 at 10:32 PM. |
![]() |
![]() |
![]() |
#22 |
Junior Member
Location: San Francisco/Raleigh Join Date: Sep 2015
Posts: 1
|
![]()
Hello Brian,
You mentioned the intent to submit a paper in March comparing BBNorm w/other methods. Were you able to scrape up any time to submit something? I'm keen to read it. |
![]() |
![]() |
![]() |
#23 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I have the manuscript mostly written, but it's not really ready to submit anywhere yet. However, there is a postdoc who is eager to get started on preparing it for submission, so... hopefully soon?
|
![]() |
![]() |
![]() |
#24 |
Member
Location: Iowa Join Date: Oct 2012
Posts: 41
|
![]()
Brian,
I see that in the output of BBNorm there are counts of unique kmers. Is there an option within BBNorm to export the unique kmers as a list or FASTA? Best, Bob |
![]() |
![]() |
![]() |
#25 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hi Bob,
That's not possible with BBNorm, as it uses a lossy data structure called a count-min sketch to store counts. However, you can do that with KmerCountExact, which is faster than BBNorm, but less memory-efficient. Usage: kmercountexact.sh in=reads.fq out=kmers.fasta That will print them in fasta format; for 2-column tsv, add the flag "fastadump=f". There are also flags to suppress storing or printing of kmers with low counts. |
![]() |
![]() |
![]() |
#26 |
Member
Location: Iowa Join Date: Oct 2012
Posts: 41
|
![]()
Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.
Last edited by jazz710; 12-05-2015 at 09:17 AM. Reason: More complete |
![]() |
![]() |
![]() |
#27 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Yes and yes!
|
![]() |
![]() |
![]() |
#28 |
Junior Member
Location: perth Join Date: Mar 2016
Posts: 2
|
![]()
Hi Brian,
I am using bbnorm.sh to filter low-depth reads and correct them together. I have a question about it as below: The raw pair-end reads file size(.gz): 223+229Mb; After running bbnorm.sh:111+113+64Mb (out, out2 and outt); So, what is different between 64Mb reads with filtered reads(164Mb=223+229-111-113-64). This is my running shell: bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t maxdepth=1000 lowbindepth=10 highbindepth=500 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=exclued.fq.gz Thanks, Xiao |
![]() |
![]() |
![]() |
#29 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hi Xiao,
The size difference is likely due to compression, and the fact that error-free reads compress better than reads with errors. Comparing the file size of compressed files tends to be confusing. If you want to know the truth, look at the actual amount of data in the files. For example, "reformat.sh in=file.fq.gz" will tell you the exact number of bases in the file. |
![]() |
![]() |
![]() |
#30 |
Junior Member
Location: perth Join Date: Mar 2016
Posts: 2
|
![]()
Hi Brian,
Thanks very much for your quick reply. According to you suggestion, I ran reformat.sh to calculate the exact number of bases in all files, but the reason seems not relating to "compression" very much! See below, 1)reads before running bbnorm.sh: reads 1 Input : 2728414 reads 336952602 bases reads 2 Input: 2728414 reads 338676300 bases 2) reads after...: reads 1 Input: 1307784 reads 162040282 bases reads 2 Input: 1307784 reads 162053968 bases excluded reads Input: 767030 reads 95017289 bases Thanks, Xiao |
![]() |
![]() |
![]() |
#31 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Oh... sorry, the explanation is a bit different here. By default BBNorm runs in 2-pass mode, which gives the best normalization. However, that generates temp files (which are later deleted). The final outputs are only from the second pass - reads discarded in the first pass would disappear completely.
For what you are doing I recommend this command: bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t target=1000 min=10 passes=1 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=excluded.fq.gz Then the output numbers should add up as expected. |
![]() |
![]() |
![]() |
#32 |
Member
Location: Germany Join Date: Jan 2015
Posts: 29
|
![]()
Hi Brian
I have difficulties to control the load of bbnorm on our server. Regardless which number I enter for threads=, it will always completely use all idle cores and the load average eventually goes above the number of cores. our java is: java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) Any solution? Best |
![]() |
![]() |
![]() |
#33 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
While @Brian would be along later with an official answer I feel that this may not be directly related to BBMap. If you have pigz installed on your machine then BBMap tools use it by default to uncompress files and that program may be starting additional threads that overwhelm your system.
If pigz is installed you could turn it off by adding "pigz=f unpigz=f" to your BBMap tools commands and see if that stops the problem. Do keep using threads= option. You are not running this under a job scheduler, correct? |
![]() |
![]() |
![]() |
#34 |
Member
Location: Germany Join Date: Jan 2015
Posts: 29
|
![]()
Hi
Thanks for the suggestion, but it didn't help. The load goes through the ceiling. I used it like so: bbnorm.sh -Xmx100g in= in2= out= out2= target=200 mindepth=3 threads=4 pigz=f unpigz=f I could do much more than 4 threads but just wanted to see what happens. Best |
![]() |
![]() |
![]() |
#35 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
Can you test these two option "gunzip=f bf2=f" and report what happens?
|
![]() |
![]() |
![]() |
#36 |
Member
Location: Germany Join Date: Jan 2015
Posts: 29
|
![]()
Sorry. No change.
|
![]() |
![]() |
![]() |
#37 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
Will have to wait on @Brian.
For reference: How many cores/memory is available on this system? What is the size of the dataset? |
![]() |
![]() |
![]() |
#38 |
Member
Location: Germany Join Date: Jan 2015
Posts: 29
|
![]()
40 cores Intel(R) Xeon(R) CPU E7- 8860 @ 2.27GHz
1TB Ram Dataset: 2x 75 million 125bp reads Thanks! |
![]() |
![]() |
![]() |
#39 |
Junior Member
Location: US Join Date: Jan 2016
Posts: 1
|
![]()
Brian, thank you so much for the excellent tools!
Is it possible to say at what level the error correction would be able to distinguish between sequencing errors and heterogeneity in the source sample? For example, if the source was a 500bp PCR product and 2% of the molecules had a substitution at base 100, would BBnorm flag that as an error? Is there an approximate percent heterogeneity at any particular base that serves as the dividing line between 'error' and 'SNP'? Thanks! |
![]() |
![]() |
![]() |
#40 |
Junior Member
Location: Sydney, AU Join Date: Oct 2016
Posts: 1
|
![]()
If you are using Oracle's JVM (or perhaps others too), what you're seeing as excess CPU consumption from bbnorm, might actually stem from garbage collection within the JVM. This really depends on an application's behaviour.
There has been a lot of work on the performance of garbage collectors in Java and there are a few to choose between. As a quick validation test, you could try insisting on the single-threaded collector by adding the following option to the java invocation inside the bbnorm.sh script. (Sorry doesn't seem to be a means of passing that in) Code:
-XX:+UseSerialGC Code:
-XX:ParallelGCThreads=4 -XX:ConcGCThreads=4 Lots of further information can be found at Oracle's VM options Keep in mind that SerialGC will mean the program will likely halt briefly at GC events. So at best you should expect a penalty on runtime if the parallel GC was already working quite hard. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|