![]() |
BBMap (aligner for DNA/RNAseq) is now open-source and available for download.
BBMap is now available here:
https://sourceforge.net/projects/bbmap/ Links to other BBTools forum threads: BBSplit (Binning reads based on mapping to multiple references at once) BBDuk and Seal (Decontamination, filtering, adapter-trimming, kmer-masking, and alignment-free expression quantification) BBNorm (Normalization, error-correction, kmer frequency histograms, and genome size estimation) BBMerge (Paired read overlap merging and insert size calculation) Reformat (Read reformatting, deinterleaving, subsampling, etc) CalcUniqueness (Library uniqueness/saturation plots) RemoveHuman (Filtering out reads from human, or some other specific organism, with zero false positives removals) Repair (Re-pairing reads that got out of order, based on their names) CalcTrueQuality (Recalibrating quality scores of reads) Tadpole (Assembler, error-correction, read extension) KmerCompressor (Kmer set generation and set operations) Clumpify (Increases compression of gzipped/bzipped fastq files) A thread with answers to frequently asked questions about BBTools, collated by GenoMax, is here. BBMap/BBTools are now open source. Please try it out - it's a 3MB download, and written in pure Java, so installation is trivial - just unzip and run. Handles all sequencing platforms (Illumina, PacBio, 454, Sanger, Nanopore, etc) except Solid colorspace, which I removed to simplify the code. A Powerpoint comparison of performance (speed, memory, sensitivity, specificity) on various genomes, compared to bwa, bowtie2, gsnap, smalt: https://drive.google.com/file/d/0B3l...it?usp=sharing ...but in summary, BBMap is similar in speed to bwa, with much better sensitivity and specificity than any other aligner I've compared it to. It uses more memory than Burrows-Wheeler-based aligners, but in exchange, the indexing speed is many times faster. How to use There is documentation in the docs folder and displayed by shellscripts when run with no arguments. But for example: bbmap.sh ref=ecoli.fa ...will build an index and write it to the present directory bbmap.sh in=reads.fq out=mapped.sam ...will map to the indexed reference bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk ...will build an index in memory and map paired reads to it in a single command If your OS does not support shellscripts, replace 'bbmap.sh' like this: java -Xmx23g -cp /path/to/current align2.BBMap in=reads.fq out=mapped.sam ...where /path/to/current is the location of the 'current' directory, and -Xmx23g specifies the amount of memory to use. This should be set to about 85% of physical memory (the symbols 'm' or 'g' specify megs or gigs), or more, depending on your virtual memory configuration. Human reference requires around 21 GB; generally, references need around 7 bytes per base pair, and a minimum of 500 MB at default settings. However, there is a reduced memory mode ('usemodulo') that only needs half as much memory. The shellscripts are just wrappers that display usage information and set the -Xmx parameter. Please ask if you encounter any problems or need help! And there are other neat tools too, for error correction, normalization, depth-binning, reference-based binning, contaminant filtering, adapter trimming, optimal quality trimming, reformatting files, paired-read merging, deduplication of assemblies, and histogram generation for things like kmer depth and insert size. NOTE: BBMap (and all related tools) shellscripts will try to autodetect memory, but may fail (resulting in the jvm failing to start or running out of memory), depending on the system configuration. This can be overridden by adding the -Xmx30g flag to the parameter list of the shellscript (adjusted for your computer's physical memory) and it will be passed to java. |
Dear Brian,
thanks a lot for posting your aligner package. Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)? Would you recommend any specific settings for the alignment of RNA seq reads? |
Quote:
Unfortunately, the limit is still 6kb. I am working on increasing that but it may take a while. For PacBio reads, you should use the script mapPacBio8k.sh (which despite the name only goes up to 6k), and the PacBio reads should be in fasta format. Reads in fasta format that are longer than 6000bp will be automatically broken into 6000bp pieces and their names appended with "_1", "_2", and so forth. But the vast majority of the PacBio reads will be under 6kbp and thus won't be fragmented. For any Illumina reads, including merged Miseq, use bbmap.sh. It handles reads up to 500bp. The only exception is if you want to map Illumina reads to PacBio reads (for error correction), in which case you should use mapPacBio. bbmap.sh (which calls BBMap.class) has mutation penalties tuned for Illumina reads, while mapPacBio8k.sh (which calls BBMapPacBio.class) has mutation penalties tuned for PacBio reads and a longer maximum read length, and is slower. You can still use mapPacBio8k.sh for any really long reads, though, like when mapping one assembly to another. For RNA-seq (using Illumina reads), I would recommend a command line like this: bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam maxindel=100000 xstag=firststrand intronlen=10 ambig=random This will look for introns up to 100kbp long, and flag every gap longer than 10bp with cigar symbol 'N' (skipped) rather than 'D' (deletion). Ambiguously-mapped reads will go to a random top-scoring site, which is generally best for quantification. XS tags, needed by Cufflinks, will be produced according to the "first strand" library protocol (if you don't know what protocol is used, then set it to xstag=unstranded). The 'maxindel' flag can be increased or decreased as needed. Many plants and fungi tend to have very short introns mostly under 400bp, while human average is much higher and has a few that are even longer than 100kbp. It should be set to above the size of the largest introns you expect, but higher values decrease accuracy, so I don't recommend setting it above 500000 or so. The default is 16000 which is fine for the plants and fungi I've worked with but if you are working with vertebrates then 200000 would probably be appropriate. |
Thanks a lot for the advice!
I will give it a try. |
Quote:
|
Brian: Trying out BBMap and I am having problems indexing the reference. My species is Pig which has the following Fasta entry lines in the genome.fa file:
Quote:
Quote:
|
Westerman,
Sorry for the confusion. The files are called "chrom" for legacy reasons, and they don't correspond to chromosomes anymore. I should rename them "chunks". Originally, BBMap wrote one chromosome per chrom file, which is fine for complete reference genomes. But de-novo assemblies can have thousands or millions of scaffolds, so now I bundle them into chunks. Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes. |
Quote:
Quote:
Tophat accepted_hits.bam Quote:
|
By the way, BBMap's default mode is "ambig=best", meaning that any ambiguously-mapped reads will go to the first top-scoring location (ambiguous hits will be annotated "XT:A:R" rather than "XT:A:U", so you will know which ones are ambiguous). This is fine for variant-calling, but if you are doing RNA-seq expression quantification, it's usually better to set "ambig=random" or "ambig=all". Otherwise, if there are two almost-identical genes, all the reads will map to the first one.
Also, for vertebrate RNA-seq, I suggest these settings: xstag=us generate XS tags, needed by cufflinks, according to "unstranded" protocol, or if you don't know the protocol. Alternatives are "ss" (second strand) and "fs" (first strand). If you DO know the protocol, please set it as ss or fs. intronlen=10 All deletions up to this length will be annotated as 'D' in the cigar string; longer ones will be annotated as 'N' (meaning skipped). This may or may not matter to downstream tools. For DNA mapping you don't need this flag. maxindel=200000 The longest deletion (i.e., intron) that it will look for. May still find longer ones. Around 98% of human introns are shorter than 100kbp. If you do not set this, it will default to 16kbp, which is too short for vertebrate RNA-seq (though lots of plants and fungi have tiny introns of around 200bp). I'll add this recommendation to the readme for the next release. |
Thanks for the recommendations, especially the ambig= one. The help for bbmap.sh is ... confusing, at least at first glance.
One thing I think is working but giving me an improper warning message is the -Xmx parameter. From starting like so: Quote:
Quote:
|
Hmm, that's interesting, thanks for noting it. It's because I first try to detect memory, then parse the -Xmx override. I will reverse the order :)
I'll try to improve the help information, so any concrete suggestions/complaints would be very helpful. |
As for help, the major problem is the formatting on screens with 80 columns. Wrapping is hard to read plus there is too much information for a quick reference. I suggest trying to limit the help messages to 80 columns and then direct the user to a longer formatted README or extended help. I know that the current help says to look at the README.
Also parsing '-h' on the command line would be nice. While displaying help when no commands are given is good it is also useful to be able to use '-h'. As an example of the first paragraph, at the moment the help includes: Code:
maxsites=5 Maximum number of total alignments to print per read. Only relevant when secondary=t. Code:
maxsites=5 Max. number of total alignments to print |
Hi,
I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version). "This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G. If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool." "Invalid maximum heap size: -Xmx30G The specified size exceeds the maximum representable size. Could not create the Java virtual machine." |
Quote:
That means that you have a 32-bit version of Java installed. You need a 64-bit version, which you can get here. Westerman, Thanks for your feedback; I'll simplify the help menus for the next version. -Brian |
Hi Brian,
thanks a lot for the advice. I did now change the java environments several times (via aliases to the the latest JREs) and tried to use both the your java6 and java7 versions and I always end up with the same messages as before. Our OS is old (CentOS 5.9) , as is our default java (it is 64-bit nevertheless (build 1.6.0_16-b01)). Thus, it is very possible that our configuration is outdated. I will try to somehow set java "ulimits". Quote:
|
luc,
That's unfortunate. If you download and unzip a 64-bit version of java 7, then path directly to the java executable rather than relying on the default java of your environment, it should work. Otherwise, the limit is something like 1600MB (specified by the flag -Xmx1600m) if I recall correctly, which is plenty for small references like microbes and fungi. -Brian |
Has anybody experience with running bbmap on clusters?
Hello everybody,
I would like to know, if anyone has (good) experience with running bbmap alignments on clusters, which require to submit jobs via a scheduler? Because I can successfully start/finish an alignment on the university's cluster when started directly from the terminal (which is prohibited, but who cares on Sundays at 3am?), but the exactly same command written to Portable Batch System (PBS) job script and submitted via qsub to Torque/Maui scheduler fails: Code:
Exception in thread "Thread-5" java.lang.NullPointerException Since I would like to extent my alignment times beyond weekend nights :rolleyes:, I would be happy for suggestions. |
Quote:
|
Yes, I set the threads=8 option and also a memory limit of 32g for bbmap.
However I realised that I initially wrote the job script in a way that the scheduler could assign random distributed cores to the job. When I limited that only to cores of one particular node, then everything runs smoothly. So no problem with bbmap, only with me using the batch system correctly :confused:. |
problems with bbmap switches
Hi Brian,
Very decent mapper that you wrote, and supergreat that it is finally available . I was playing around with bbmap, sooo many cool features and an impressive speed! But I could not figure out a couple of things: 1) How to point to directories with path= I would like to be able to use a set of indices that I created previously and that are stored in a specific 'databases' path mounted on all nodes of our cluster. But I did not get this to work setting path= during the mapping process because it changes where bbmap searches for the reads. My second trial was to call bbmap in this folder of the references and set path to the folder of the reads, but then I always got an error that the read file was not found. The only thing that worked for me is calling bbmap from a folder which also includes the /ref folder, but this means copying both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in bbmap similarly 2) using outm I am mapping shotgun metagenomic illumina paired end reads to references that are gene databases. I was expecting to get different output for out= and outm= but the files produced are identical. I would expect to see some read pairs where only one of the reads maps to the gene database and the other not, and as far as I understood out= gives me the mapping pairs and outm= gives me the mapping pairs and the single mapping reads with their pairs. 3) how can I get sorted unmapped pairs written to a file? While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible) 4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient. Keep up the good work Harald |
All times are GMT -8. The time now is 07:57 PM. |
Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.