![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries | Brian Bushnell | Bioinformatics | 64 | 03-28-2020 04:54 AM |
LSC - a fast PacBio long read error correction tool. | LSC | Bioinformatics | 9 | 08-21-2015 07:06 AM |
LSC - a fast PacBio long read error correction tool. | LSC | Pacific Biosciences | 55 | 02-14-2014 06:34 AM |
Reptile error correction tool: fastq not readable | stepa_t | Bioinformatics | 2 | 07-25-2013 07:49 PM |
BFAST and read error correction (with SAET or similar tool) | javijevi | Bioinformatics | 4 | 01-27-2010 01:46 PM |
![]() |
|
Thread Tools |
![]() |
#41 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
I recommend using Tadpole for error-correction now; it substantially better than BBNorm because it uses exact kmer counts and algorithms designed to take advantage of the exact counts. I now only use BBNorm for normalization and plotting kmer-frequency histograms of datasets too big to fit into memory, but not for error-correction. I don't recommend doing error-correction at all on data for which you are hoping to find rare SNPs. That said, by default, BBNorm determines a base to be in error if there is at least a 1:140 ratio of kmer counts between it and the adjacent kmers, so a 2% SNP should be safe. Tadpole, on the other hand, defaults to a 1:16 ratio for detecting errors, which is much more aggressive and would wipe out a 2% SNP. Why is it more aggressive? Well... I tried to optimize the parameters for the best Spades assemblies, and Spades seems to perform best with pretty aggressive error-correction. You can change that threshold, though. |
|
![]() |
![]() |
![]() |
#42 |
Member
Location: Germany Join Date: Oct 2014
Posts: 16
|
![]()
Hi,
I want to preferentially assemble the genome of a low abundant community member from a metagenome, so I am interested in the partitioning option of BBnorm. I have some questions on how to choose the best parameters though: -for the other bbnorm workflows (normalization, filtering, error correction) you recommend the "prefilter" option. Is this also recommendable for the partitioning workflow? (Because this option is used in most of the example-usages of BBnorm in the documentation EXCEPT the partitioning workflow) -from the description, I assumed that by giving "outlow, outmid and outhigh" arguments, the usual normalization workflow would be overridden and ALL reads would be grouped into one of these categories. However the preliminary output of BBnorm states that a "target depth" of 100 and a "min depth" of 5 is being applied. Does that mean that all reads below a coverage of five will be discarded? Do I need to adjust the "mindepth" parameter as well? -Our job-submission pipeline requires the specification of a maximum RAM usage for all scripts started. However bbnorm keeps exceeding this value (which leads to a termination of the job). I kept increasing the memory limit of BBnorm using the "-Xmx" argument upto 200G, but always bbNorm exceeds the alloted limit (even if using the "prefilter" option above). Do I have consider any additional memory requirements of the script, in addition to the "-Xmx" limit? How would I determine how much memory is needed? (The dataset consists of about 84.547.019 read-pairs loglog.sh calculated a "Cardiality" of 5.373.179.884, but I do not know how exactly to interpret this value). Thanks for any suggestions. |
![]() |
![]() |
![]() |
#43 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Whether or not to use "prefilter" just depends on the amount of memory you have rather than the workflow. It basically makes BBNorm take twice as long but increases accuracy in cases where you have a very large dataset compared to memory - so, there's no penalty for using it, and it always increases accuracy, but the increase is trivial if you have a lot of memory. So if you have lots of ram or a small dataset you don't need it.
In your case the dataset has approximately 5 billion unique kmers (which is what the output of loglog.sh means). As for BBNorm's memory use: -Xmx is a Java flag that specifies how much much heap memory Java will use. This is most, but not all, of the memory that your job will use - there is some overhead. Normally BBNorm will auto-detect how much memory is available and everything should be fine without you needing to specify -Xmx, but that depends on the job manager and system configuration. If you manually specify memory with -Xmx, it must be lower than your requested memory for the scheduler, not higher. I recommend about 84% for our cluster, but this depends. So, basically, if you submit requesting a 100G, then set -Xmx84g. If this gets killed by the scheduler, then decrease -Xmx rather than increasing it. For 5 billion unique kmers, I recommend using the prefilter flag. The overall command would be something like: bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=10 highbindepth=80 Even though BBNorm will mention "target depth" and "min depth", those values will not affect your outputs - they only affect reads that go to the "out=" stream (which you did not specify), not reads that go to the "outlow=" and so forth. Sorry it's a litttle confusing. |
![]() |
![]() |
![]() |
#44 |
Member
Location: Germany Join Date: Feb 2016
Posts: 40
|
![]()
Do you have a paper or something like this which explain the algorithm behind bbnorm ?
|
![]() |
![]() |
![]() |
#45 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I've described the algorithm in some detail in /bbmap/docs/guides/BBNormGuide.txt. I also wrote this a while back:
Quote:
|
|
![]() |
![]() |
![]() |
#46 |
Member
Location: Germany Join Date: Oct 2014
Posts: 16
|
![]()
@Brian Bushnell
Thanks a lot. Now BBnorm completed successfully. |
![]() |
![]() |
![]() |
#47 |
Member
Location: Thessaloniki, Greece Join Date: Jul 2018
Posts: 12
|
![]()
Hello! I have RNA-Seq data and I am processing them with BBtools for adapter trimming, quality trimming, contaminant filtering, error correction. I want to use these data as support in denovo genome assembly with Abyss, which has this ability. Should I normalize them? What is the target number that I should set in bbnorm? Can someone explain how I calculate the right number?
Last edited by kokyriakidis; 07-28-2018 at 01:38 AM. |
![]() |
![]() |
![]() |
#48 |
Junior Member
Location: singapore Join Date: May 2016
Posts: 9
|
![]()
Hello there,
I have performed a metagenomics sequencing on my samples on an Illumina Hi-Seq. How do I determine the target coverage value to use? Thanks in advance. |
![]() |
![]() |
![]() |
#49 |
Junior Member
Location: Israel Join Date: Nov 2015
Posts: 2
|
![]()
Hello,
Thanks for the great software! I want to use bbnorm to normalize single end RNAseq library reads. I'm using the software on my university cluster (linux) this is the exception message I get: Exception in thread "Thread-175" java.lang.AssertionError: NB501112:39:HKM3VBGXY:4:11401:7344:1020 1:N:0:CCAGTT 1 -1 + -1 -1 1000000000000000000 1 0 0 CTTTACATCAAATCCTAATGTAGTTACAGGTGATTCAATTAATCTATCACCTAATGATTGTGAACGTTG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE/EEE . 35 . . null at stream.ReadStreamByteWriter.writeFastq(ReadStreamByteWriter.java:460) at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:97) at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:42) at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:28) this is the script I used: #!/bin/bash #SBATCH -N 1 #SBATCH -n 24 #SBATCH -p hive7d #SBATCH -J ecc_bbnormSE # Job name #SBATCH --mem=128000 #SBATCH --time=2-23:00:00 # Runtime in D-HH:MM:SS #SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --mail-user=mayabritstein@gmail.com # Email to send notifications to . /etc/profile.d/modules.sh module purge module load java/jre1.8.0 export PATH=/data/home/steindler/mbritstei/programs/anaconda2/bin:$PATH source activate bbtools fastq_DATA=/data/home/steindler/mbritstei/Petrosia_transcriptomes/transcriptome_assembly/All_Single_end bbnorm.sh in=$fastq_DATA/petrosia_SE.fastq out= out2=petrosia_SE_normalized_ecc.fastq target=100 min=5 ecc prefilter source deactivate ## Can you please tell me what is wrong? # thanks!! Maya |
![]() |
![]() |
![]() |
#50 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
@Maya: Can you explicitly add -Xmx128g (is that what you are asking for) threads=24 in your bbnorm.sh command?
I also don't understand this part: in=$fastq_DATA/petrosia_SE.fastq out= out2=petrosia_SE_normalized_ecc.fastq There should just be one "out="? Normalizing RNAseq data is not appropriate. You are going to lose vital count information. See the section on "When not to normalize" in BBNorm guide here. Last edited by GenoMax; 01-27-2019 at 05:38 AM. |
![]() |
![]() |
![]() |
#51 |
Junior Member
Location: Israel Join Date: Nov 2015
Posts: 2
|
![]()
Thanks @GenoMax
I did not see that "out= " there... will try again, also with the suggested flags. I'm using the normalization just for assembly, not for quantification. |
![]() |
![]() |
![]() |
#52 | |
Junior Member
Location: Perú Join Date: Jul 2019
Posts: 6
|
![]() Quote:
BBNorm can also be used to normalize the coverage of PacBio reads?? Thank you in advance! edited: I just read that it indeed can be used on PacBio reads but doesn't perform error-correction! that's fine for me ![]() ![]() Last edited by silverfox; 04-10-2020 at 04:00 PM. |
|
![]() |
![]() |
![]() |
#53 | |
Junior Member
Location: Perú Join Date: Jul 2019
Posts: 6
|
![]()
Hi Brian and everyone! I'm using bbnorm but i keep getting a problem.
I ran the command: $ bbnorm.sh -Xmx64g t=18 in=pt.raw.fastq out=pt.raw.normalized.fq target=90 mindepth=2 everything looked good but when the proccess ended, I realized the file pt.raw.normalized.fq was empty ![]() edited: I just run the following command: $ bbnorm.sh in=pt.raw.fastq out=pt.raw.normalized3.fq target=90 min=2 But at the end, my pt.raw.normalized3.fq file was still empty, like before T-T I think the problem could be here: In the second pass says HTML Code:
Made hash table: hashes = 3 mem = 65.66 GB cells = 35.25B used = 0.000% Estimated unique kmers: 0 Table creation time: 17.804 seconds. Started output threads. Table read time: 0.012 seconds. 0.00 kb/sec Total reads in: 0 NaN% Kept Total bases in: 0 NaN% Kept Error reads in: 0 NaN% Error type 1: 0 NaN% Error type 2: 0 NaN% Error type 3: 0 NaN% Total kmers counted: 0 Thanks a lot in advance! Editing: I just found one of your comments: Quote:
![]() Can I not reduce the kmer length? (default=31) Last edited by silverfox; 04-12-2020 at 07:25 PM. |
|
![]() |
![]() |
![]() |
#54 |
Member
Location: Germany Join Date: Mar 2013
Posts: 44
|
![]()
Hi all,
does any one has experience with processing ancient DNA using bbtools? My question is how short the kmer length for bbnorm should be, regarding issues with the fragmented DNA. Thanks in advance! |
![]() |
![]() |
![]() |
Thread Tools | |
|
|