SEQanswers (
-   Bioinformatics (
-   -   Yes .. BBMap can do that! (

GenoMax 05-01-2015 08:57 AM

Yes .. BBMap can do that!
BBMap suite has become an essential part of the bioinformatics tool arsenal for many ...

Thank you Brian!

Over the past year Brian has doled out specific advice on how to use a component from BBMap suite in standard (and not-so-standard) ways in various threads on SeqAnswers. I thought it would be useful to try and collect some "special" use cases in one thread so they could be found in one location.

Feel free to add to this thread things I have missed. I have tried to categorize the list based on the BBMap programs. That classification is imperfect.

There are common options for most BBMap suite programs and depending on the file extension the input/output format is automatically chosen/set.

Note: For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file. This may be convenient if you use a certain set of parameters for different set of analyses.


  • Mapping Nanopore reads has a length cap of 6kbp. Reads longer than this will be broken into 6kbp pieces and mapped independently.


$ -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400

The "maxlen" flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.

  • Using Paired-end and single-end reads at the same time

BBMap itself can only run single-ended or paired-ended in a single run, but it has a wrapper that can accomplish it, like this:


$ in1=read1.fq,singletons.fq in2=read2.fq,null out=mapped.sam append
This will write all the reads to the same output file but only print the headers once. I have not tried that for bam output, only sam output

Note about alignment stats: For paired reads, you can find the total percent mapped by adding the read 1 percent (where it says "mapped: N%") and read 2 percent, then dividing by 2. The different columns tell you the count/percent of each event. Considering the cigar strings from alignment, "Match Rate" is the number of symbols indicating a reference match (=) and error rate is the number indicating substitution, insertion, or deletion (X, I, D).

  • Exact matches when mapping small reads (e.g. miRNA)

When mapping small RNA's with BBMap use the following flags to report only perfect matches.


ambig=all vslow perfectmode maxsites=1000
It should be very fast in that mode (despite the vslow flag). Vslow mainly removes masking of low-complexity repetitive kmers, which is not usually a problem but can be with extremely short sequences like microRNAs.
  • Important note about BBMap alignments

BBMap is always nondeterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread (threads=1), or map the reads as single-ended and then fix pairing afterward:

Code: in=reads.fq outu=unmapped.fq int=f in=unmapped.fq out=paired.fq fint outs=singletons.fq

In this case you'd want to only keep the paired output.

BBSplit is based on BBMap, so it is also nondeterministic in paired mode with multiple threads. BBDuk and Seal (which can be used similarly to BBSplit) are always deterministic.

  • Count k-mers/find unknown primers


$ in=reads.fq out=trimmed.fq ftr=19
This will trim all but the first 20 bases (all bases after position 19, zero-based).


$ in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f
This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort in Excel.



...etc. If the primers are 20bp long, they should be pretty obvious.
  • Convert SAM format from 1.4 to 1.3 (required for many programs)


$ in=reads.sam out=out.sam sam=1.3
  • Removing N basecalls

You can use BBDuk or Reformat with "qtrim=rl trimq=1". That will only trim trailing and leading bases with Q-score below 1, which means Q0, which means N (in either fasta or fastq format). The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale.
  • Sampling reads


$ in=reads.fq out=sampled.fq sample=3000

To sample 10% of the reads: in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1

or more concisely: in=reads#.fq out=sampled#.fq samplerate=0.1

and for exact sampling: in=reads#.fq out=sampled#.fq samplereadstarget=100k

  • Changing fasta headers

Remove anything after the first space in fasta header.

Code: in=sequences.fasta out=renamed.fasta trd
"trd" stands for "trim read description" and will truncate everything after the first whitespace.
  • Extract reads from a sam file


$ in=reads.sam out=reads.fastq
  • Verify pairing and optionally de-interleave the reads


$ in=reads.fastq verifypairing
  • Verify pairing if the reads are in separate files


$ in1=r1.fq in2=r2.fq vpair
If that completes successfully and says the reads were correctly paired, then you can simply de-interleave reads into two files like this:


$ in=reads.fastq out1=r1.fastq out2=r2.fastq
  • Base quality histograms


$ in=reads.fq qchist=qchist.txt
That stands for "quality count histogram".
  • Filter SAM/BAM file by read length


$ in=x.sam out=y.sam minlength=50 maxlength=200
  • Filter SAM/BAM file to detect/filter spliced reads


$ in=mapped.bam out=filtered.bam maxdellen=50
You can set "maxdellen" to whatever length deletion event you consider the minimum to signify splicing, which depends on the organism.
  • "Re-pair" out-of-order reads from paired-end data files


$ in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz

BBMerge now has a new flag - "outa" or "outadapter". This allows you to automatically detect the adapter sequence of reads with short insert sizes, in case you don't know what adapters were used. It works like this:


$ in=reads.fq outa=adapters.fa reads=1m
Of course, it will only work for paired reads! The output fasta file will look like this:



If you have multiplexed things with different barcodes in the adapters, the part with the barcode will show up as Ns, like this:


Note: For BBMerge with micro-RNA, you need to add the flag mininsert=17. The default is 35, which is too long for micro-RNA libraries.
  • Identifying adapters
If you have paired reads, and enough of the reads have inserts shorter than read length, you can identify adapter sequences with BBMerge, like this (they will be printed to adapters.fa):


$ in1=r1.fq in2=r2.fq outa=adapters.fa


Note: BBDuk is strictly deterministic on a per-read basis, however it does by default reorder the reads when run multithreaded. You can add the flag "ordered" to keep output reads in the same order as input reads
  • Finding reads with a specific sequence at the beginning of read


$ -Xmx1g in=reads.fq outm=matched.fq outu=unmatched.fq restrictleft=25 k=25 literal=AAAAACCCCCTTTTTGGGGGAAAAA
In this case, all reads starting with "AAAAACCCCCTTTTTGGGGGAAAAA" will end up in "matched.fq" and all other reads will end up in "unmatched.fq". Specifically, the command means "look for 25-mers in the leftmost 25 bp of the read", which will require an exact prefix match, though you can relax that if you want.

So you could bin all the reads with your known sequence, then look at the remaining reads to see what they have in common. You can do the same thing with the tail of the read using "restrictright" instead, though you can't use both restrictions at the same time.


$ in=reads.fq outm=matched.fq literal=NNNNNNCCCCGGGGGTTTTTAAAAA k=25 copyundefined
With the "copyundefined" flag, a copy of each reference sequence will be made representing every valid combination of defined letter. So instead of increasing memory or time use by 6^75, it only increases them by 4^6 or 4096 which is completely reasonable, but it only allows substitutions at predefined locations. You can use the "copyundefined", "hdist", and "qhdist" flags together for a lot of flexibility - for example, hdist=2 qhdist=1 and 3 Ns in the reference would allow a hamming distance of 6 with much lower resource requirements than hdist=6. Just be sure to give BBDuk as much memory as possible.
  • Removing illumina adapters (if exact adapters not known)

If you're not sure which adapters are used, you can add "ref=truseq.fa.gz,truseq_rna.fa.gz,nextera.fa.gz" and get them all (this will increase the amount of overtrimming, though it should still be negligible).
  • Removing illumina control sequences/phiX reads

Code: in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
  • Identify certain reads that contain a specific sequence

$ in=reads.fq out=unmatched.fq outm=matched.fq literal=ACGTACGTACGTACGTAC k=18 mm=f hdist=2
Make sure "k" is set to the exact length of the sequence. "hdist" controls the number of substitutions allowed. "outm" gets the reads that match. By default this also looks for the reverse-complement; you can disable that with "rcomp=f".
  • Extract sequences that share kmers with your sequences with BBDuk

$ in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

  • Extract sequences that contain N's with BBDuk
Code: in=reads.fq out=readsWithoutNs.fq outm=readsWithNs.fq maxns=0
If you have, say, 100bp reads and only want to separate reads containing all 100 Ns, change that to "maxns=99".

General notes for

BBDuk can operate in one of 4 kmer-matching modes:
Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time.

BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different.

  • Generate random reads in various formats


$ ref=genome.fasta out=reads.fq len=100 reads=10000
You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.

You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this:


$ maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5

$ maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5

[As of BBmap v. 36.59] gains the ability to simulate metagenomes.

coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed).

metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.

The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with before concatenating them with the other references.

  • Simulate a jump library

You can simulate a 4000bp jump library from your existing data like this.


$ cat assembly1.fa assembly2.fa > combined.fa
$ ref=combined.fa
$ reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz



$ in=ref.fasta out=reads.fastq length=200
The difference is that RandomReads will make reads in a random order from random locations, ensuring flat coverage on average, but it won't ensure 100% coverage unless you generate many fold depth. Shred, on the other hand, gives you exactly 1x depth and exactly 100% coverage (and is not capable of modelling errors). So, the use-cases are different.
  • Demultiplex fastq files when the tag is present in the fastq read header (illumina)


$ in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

  • Plotting the length distribution of reads

$ in=file out=histogram.txt bin=10 max=80000
That will plot the result in bins of size 10, with everything above 80k placed in the same bin. The defaults are set for relatively short sequences so if they are many megabases long you may need to add the flag "-Xmx8g" and increase "max=" to something much higher.

Alternatively, if these are assemblies and you're interested in continuity information (L50, N50, etc), you can run stats on each or statswrapper on all of them:

Code: in=file

Code: in=file,file,file,file…

By default, "filterbyname" discards reads with names in your name list, and keeps the rest. To include them and discard the others, do this:


$ in=003.fastq out=filter003.fq names=names003.txt include=t

If you only know the number(s) of the fasta/fastq record(s) in a file (records start at 0) then you can use the following command to extract those reads in a new file.


$ in=<file> id=<number,number,number...> out=<file>
The first read (or pair) has ID 0, the second read (or pair) has ID 1, etc.

in=<file> Specify the input file, or stdin.
out=<file> Specify the output file, or stdout.
id= Comma delimited list of numbers or ranges, in any order.
For example: id=5,93,17-31,8,0,12-13
  • Splits a sam file into forward and reverse reads

Code: mapped.sam plus.sam minus.sam unmapped.sam in=plus.sam out=plus.fq in=minus.sam out=minus.fq rcomp


BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:


$ ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq
will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq

You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.

Added feature: One can specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Reason this is useful is to use BBSplit is to have it split input into one output file per reference file.

NOTE: 1 By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags "minid=0.76 maxindel=16k minhits=1". With those parameters it is extremely sensitive.

NOTE: 2 BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps).

NOTE: 3 Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false).

NOTE: 4 BBSplit needs multiple reference files as input; one per organism, or one for target and another for everything else. It only outputs one file per reference file., on the other hand, which is similar, can use a single concatenated file, as it (by default) will output one file per reference sequence within a concatenated set of references.
  • To generate transcript coverage stats


$ in=mapped.sam normcov=normcoverage.txt normb=20 stats=stats.txt
That will generate coverage per transcript, with 20 lines per transcript, each line showing the coverage for that fraction of the transcript. "stats" will contain other information like the fraction of bases in each transcript that was covered.
  • To calculate physical coverage stats (region covered by paired-end reads)
BBMap has a "physcov" flag that allows it to report physical rather than sequenced coverage. It can be used directly in BBMap, or with pileup, if you already have a sam file. For example:


$ in=mapped.sam covstats=coverage.txt
  • Calculating coverage of the genome

Program will take sam or bam, sorted or unsorted.


$ in=mapped.sam out=stats.txt hist=histogram.txt
stats.txt will contain the average depth and percent covered of each reference sequence; the histogram will contain the exact number of bases with a each coverage level. You can also get per-base coverage or binned coverage if you want to plot the coverage. It also generates median and standard deviation, and so forth.

It's also possible to generate coverage directly from BBMap, without an intermediate sam file, like this:


$ in=reads.fq ref=reference.fasta nodisk covstats=stats.txt covhist=histogram.txt
We use this a lot in situations where all you care about is coverage distributions, which is somewhat common in metagenome assemblies. It also supports most of the flags that supports, though the syntax is slightly different to prevent collisions. In each case you can see all the possible flags by running the shellscript with no arguments.
  • To bin aligned reads


$ in=mapped.sam out=stats.txt bincov=coverage.txt binsize=1000
That will give coverage within each bin. For read density regardless of read length, add the "startcov=t" flag.


Dedupe ensures that there is at most one copy of any input sequence, optionally allowing contaminants (substrings) to be removed, and a variable hamming or edit distance to be specified. Usage:


$ in=assembly1.fa,assembly2.fa out=merged.fa
That will absorb exact duplicates and containments. You can use "hdist" and "edist" flags to allow mismatches, or get a complete list of flags by running the shellscript with no arguments.

Dedupe will merge assemblies, but it will not produce consensus sequences or join overlapping reads; it only removes sequences that are fully contained within other sequences (allowing the specified number of mismatches or edits).

Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag "uniqueonly=t" then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads.

However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this:

1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample.
2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed.

  • Generate ROC curves from any aligner

[*]index the reference


$ ref=reference.fasta

[*]Generate random reads


$ reads=100000 length=100 out=synth.fastq maxq=35 midq=25 minq=15
[*]Map to produce a sam file

...substitute this command with the appropriate one from your aligner of choice

$ in=synth.fq out=mapped.sam
[*]Generate ROC curve


$ in=mapped.sam reads=100000
  • Calculate heterozygous rate for sequence data


$ in=reads.fq khist=histogram.txt peaks=peaks.txt
You can examine the histogram manually, or use the "peaks" file which tells you the number of unique kmers in each peak on the histogram. For a diploid, the first peak will be the het peak, the second will be the homozygous peak, and the rest will be repeat peaks. The peak caller is not perfect, though, so particularly with noisy data I would only rely on it for the first two peaks, and try to quantify the higher-order peaks manually if you need to (which you generally don't).

  • Compare mapped reads between two files

To see how many mapped reads (can be mapped concordant or discordant, doesn't matter) are shared between the two alignment files and how many mapped reads are unique to one file or the other.


$ in=file1.sam out=mapped1.sam mappedonly
$ in=file2.sam out=mapped2.sam mappedonly

That gets you the mapped reads only. Then:


$ in=mapped1.sam names=mapped2.sam out=shared.sam include=t
...which gets you the set intersection;


$ in=mapped1.sam names=mapped2.sam out=only1.sam include=f
$ in=mapped2.sam names=mapped1.sam out=only2.sam include=f

...which get you the set subtractions.


$ in=old.fasta out=new.fasta
That will rename the reads as 1, 2, 3, 4, ... 222.

You can also give a custom prefix if you want. The input has to be text format, not .doc.

  • Generating “fake” paired end reads from a single end read file


$ in=reads.fastq out1=r1.fastq out2=r2.fastq length=100
That will generate fake pairs from the input file, with whatever length you want (maximum of input read length). We use it in some cases for generating a fake LMP library for scaffolding from a set of contigs. Read 1 will be from the left end, and read 2 will be reverse-complemented and from the right end; both will retain the correct original qualities. And " /1" " /2" will be suffixed after the read name.

  • Generate random reads


$ ref=genome.fasta out=reads.fq len=100 reads=10000
"seed=-1" will use a random seed; any other value will use that specific number as the seed

You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

  • Generate saturation curves to assess sequencing depth


$ in=reads.fq out=histogram.txt
It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.

The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).

The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.

It plots a point every X reads (configurable, default 25000).

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.


In light of the quality-score issues with the NextSeq platform, and the possibility of future Illumina platforms (HiSeq 3000 and 4000) also using quantized quality scores, I developed it for recalibrating the scores to ensure accuracy and restore the full range of values.


BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is and the usage is similar to or For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag.

-------------------------------------------------------------- and

Quoted from Brian's response directly.

I also wrote another pair of programs specifically for working with primer pairs, and will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence.

So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences.

I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference.


Identify type of Q-score encoding in sequence files


$ in=seq.fq.gz
sanger    fastq    gz    interleaved    150bp


Newest member of BBTools. Identify constituent k-mers.


Find all k-mers for a given sequence.

$ in=reads.fq out=kmers.txt k=4 count=t display=999
Will produce output that looks like


MISEQ05:239:000000000-A74HF:1:2110:14788:23085        ATGA=8        ATGC=6        GTCA=6        AAAT=5        AAGC=5        AATG=5        AGCA=5        ATAA=5        ATTA=5        CAAA=5        CATA=5        CATC=5        CTGC=5        AACC=4        AACG=4        AAGA=4        ACAT=4        ACCA=4        AGAA=4        ATCA=4        ATGG=4        CAAG=4        CCAA=4        CCTC=4        CTCA=4        CTGA=4        CTTC=4        GAGC=4        GGTA=4        GTAA=4        GTTA=4        AAAA=3        AAAC=3        AAGT=3        ACCG=3        ACGG=3        ACTG=3        AGAT=3        AGCT=3        AGGA=3        AGTA=3        AGTC=3        CAGC=3        CATG=3        CGAG=3        CGGA=3        CGTC=3        CTAA=3        CTCC=3        CTTA=3        GAAA=3        GACA=3        GACC=3        GAGA=3        GCAA=3        GGAC=3        TCAA=3        TGCA=3        AAAG=2        AACA=2        AATA=2        AATC=2        ACAA=2        ACCC=2        ACCT=2        ACGA=2        ACGC=2        AGAC=2        AGCG=2        AGGC=2        CAAC=2        CAGG=2        CCGC=2        GCCA=2        GCTA=2        GGAA=2        GGCA=2        TAAA=2        TAGA=2        TCCA=2        TGAA=2        AAGG=1        AATT=1        ACGT=1        AGAG=1        AGCC=1        AGGG=1        ATAC=1        ATAG=1        ATTG=1        CACA=1        CACG=1        CAGA=1        CCAC=1        CCCA=1        CCGA=1        CCTA=1        CGAC=1        CGCA=1        CGCC=1        CGCG=1        CGTA=1        CTAC=1        GAAC=1        GCGA=1        GCGC=1        GTAC=1        GTGA=1        TTAA=1

Simulate multiple mutants from a known reference (e.g. E. coli).


$ in=e_coli.fasta out=mutant.fasta id=99
$ ref=mutant.fasta out=reads.fq.gz reads=5m length=150 paired adderrors

That will create a mutant version of E.coli with 99% identity to the original, and then generate 5 million simulated read pairs from the new genome. You can repeat this multiple times; each mutant will be different.


One can partition a large dataset with into smaller subsets (example below splits data into 8 chunks).

Code: in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8

If you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:

Code: in=reads.fastq.gz out=clumped.fastq.gz
Code: in1=reads_R1.fastq.gz in2=reads_R2.fastq.gz out1=clumped_R1.fastq.gz out2=clumped_R2.fastq.gz
  • can now mark/remove sequence duplicates (optical/PCR/otherwise) from NGS data

This does NOT require alignments so it should prove more useful compared to Picard MarkDuplicates. Relevant options for command are listed below.


dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.

Note: If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically.


Fuse will automatically reverse-complement read 2. Pad (N) amount can be adjusted as necessary. This will for example create a full size amplicon that can be used for alignments.

Code: in1=r1.fq in2=r2.fq pad=130 out=fused.fq fusepairs

Following command will produce 10 output files with an equal number of sequences and no duplication.

Code: in=X.fa out=X%.fa ways=10

SNPsaurus 05-01-2015 09:10 AM

I feel that the best strategy to solve some very difficult problems out there in the world, like proving P = NP or creating a true artificial intelligence, would be to couch the problem in terms of read mapping and then ask Brian to add that feature to BBMap.

Brian Bushnell 05-01-2015 09:14 AM

Much thanks, GenoMax! I have added a link to this on the Sourceforge page.

@SNPsaurus - I have no immediate plans to add artificial intelligence to BBMap, but that's a good suggestion; I'll add it to my todo list.

westerman 05-01-2015 10:04 AM

Perhaps Brian is an 'artificial intelligence' or an 'alien intelligence' who takes pity on us mere mortals to provide excellent tools for us to work with?

Richard Finney 05-01-2015 01:06 PM

The BB map home page ( )says ...
Short read aligner for DNA and RNA-seq data.

Is there a special mode or feature that makes it good for RNA reads?
Does it handle long splices especially well ?

Brian Bushnell 05-01-2015 01:12 PM

Yes, that's correct. BBMap handles very long deletions or splices, making it useful for both RNA-seq or situations like fast-neutron bombardment to delete entire genes. It finds splices denovo (does not use a gene annotation file) and allows multiple introns/deletions per read. The defaults are set for plant and fungal introns, which are short; when mapping vertebrate RNA-seq data, I normally use the flag maxindel=200000 (default is 16000). It also supports the necessary flags for Cufflinks custom tags (xstag=firststrand, etc).

colindaven 05-06-2015 02:40 AM

Hi Brian,
I wouldn't say plant introns are as short as 16000. We have observed reliable introns as large as 50kbp in one species (confirmed by spliced alignments of multiple RNA-seq datasets) and 150kbp in another (maize).

Great to hear BBmap works well on RNA-seq reads and auto detects splice sites though!

Brian Bushnell 05-06-2015 08:26 AM

Oh, that's good to know. My exposure to plant RNA-seq is mainly Chlamy and Arabidopsis - and you know what they say about assumptions.

sdriscoll 05-13-2015 01:17 PM

Hey Brian,
A feature many of the bench scientists I work with would love is an automated grant proposal generator. Any chance of something like that in future updates? The input could be some images/figures and a basic subject...

Brian Bushnell 05-13-2015 01:20 PM

I was thinking about an automatic paper generator, which would be more useful to me. But I guess grant proposals are even more standardized. I'll see if I can roll that into Reformat.

sdriscoll 05-13-2015 01:25 PM

Excellent. My PI will be stoked.

Zapages 05-13-2015 01:57 PM

Hi Brian,

Thank you for creating these awesome applications. I was wondering if you have any plans to create an OLC aligner or something that improves Dedupe so that it can merge multiple assemblies together while taking care of any duplicate sequences based on a certain or user specified percentage. Then finally it outputs a consensus sequence based on a user specified percent similarity and some other variables.

I believe this will be very helpful with working with viral or bacterial de-novo genome (DNA-Based) Assemblies. Some ideas. :)

Brian Bushnell 05-13-2015 02:24 PM

I would also really like such a tool! Yes, I DO have plans to do something like that, but I am not sure about the timing, since I have had these plans for almost a year now. But hopefully I will make it this year, as Dedupe is really close to being an assembler, just not quite there.

fdts 05-19-2015 03:11 AM

quality filtering
Do you recommend quality trimming before mapping using BBMap? I am talking illumina 100bp PE.
I have been searching the forum for a while but could not find a hint.


GenoMax 05-19-2015 03:52 AM


Originally Posted by fdts (Post 172776)
Do you recommend quality trimming before mapping using BBMap? I am talking illumina 100bp PE.
I have been searching the forum for a while but could not find a hint.


Unless you have a lot of data that has Q scores of 10 or less you can get away without quality trimming.

Brian Bushnell 05-19-2015 09:29 AM

To expand upon that, BBMap has the options for internally quality-trimming (and untrimming) reads, but I only use them in special cases. For example, BBMap is used a lot for filtering out contamination. To ensure high specificity, I only want to remove reads that map to contaminant sequences with high identity; but I don't want the identity calculation to include low-quality bases. So, I map with "qtrim=rl trimq=15 untrim" or similar, which will trim before mapping but then output the untrimmed read.

For normal cases such as variant calling or coverage calculation, don't quality-trim unless you have really low quality data. I would not expect that to be the case for a 100bp library.

fdts 05-20-2015 12:27 AM

Aah, I see. Thanks to both of you. Would these flags work with BBDuk then?

Data quality is decent. I will map to a de novo metagenome assembly generated from the same data. I am kind of hopeful to be able to do a little variant calling after metagenome binning and mapping. Given this situation, would you still rather not quality trim?


Brian Bushnell 05-20-2015 09:27 AM

Those flags work in BBDuk as well, except for "untrim", which is unique to BBMap.

A good variant caller will take the quality into account when weighing possible variations. Quality-trimming before mapping gives the mapper less information, and thus a higher probability of ambiguously-mapped or incorrectly-mapped reads (though adapter-trimming before mapping is always good). Even if the trimmed bases are low quality - say, 50 bases at Q13 - that's still an expected 45 matches and 5 mismatches, which can greatly increase mapping confidence in a repetitive area, and thus improve the ability to call the variations from the high-quality part of the read.

On the other hand, quality-trimming before assembly is often a good idea, though the exact threshold depends on the assembler and dataset.

duane.hassane 06-11-2015 04:58 AM

Is there a way of efficiently using the BBMap tools to cut variable length and variable sequence primer pairs from amplicon sequencing data either pre-alignment (fastq) or post-alignment (BAM)? I am thinking in regards to RainDance and Fluidigm data where there are hundreds of amplicons, many of which overlap, each with different forward/reverse primers.

Brian Bushnell 06-11-2015 08:47 AM

BBTools has is a pair of programs for cutting (and outputting to a file) the sequence between primer pairs, but not for cutting out primers themselves. Is that what you are after?

The usage is like this:

Assume your set of primers for one end is AAAAAA and AAAAAT, and for the other end is GGGGGG and GGGCCC: in=amplicons.fq out=primer1.sam literal=AAAAAA,AAAAAT in=amplicons.fq out=primer2.sam literal=GGGGGG,GGGCCC

That will find the optimal alignment for the optimal primer, and output one line per amplicon (twice). Then run this: in=amplicons.fq out=middle.fq sam1=primer1.sam sam2=primer2.sam

"middle.fq" will contain the sequence between the primers, one per amplicon, using the best alignment. I designed this to cut V4 regions from full-length 16s.

Notes: stands for MultiStateAligner, not for Multiple Sequence Alignment. All sequences can be in any orientation (forward or reverse complement).

Please let me know if that doesn't address your question; I may have misunderstood it.

GenoMax 06-11-2015 09:24 AM

I think @duane.hassane is asking to remove primers from amplicons with a common core sequence but with variable (length and/or sequence?) primers on the ends of that core sequence for each group.

duane.hassane 06-11-2015 09:44 AM

Thanks. To be more clear, I am after removing the primers themselves (either from the fastq or BAM) and preserving the intervening sequence only. In this use case, there are hundreds of different primer pairs with known mapping locations (because they are used for targeted enrichment of different exons). Thus, for each primer pair, the intervening sequence (target being enriched) is different.

luc 06-11-2015 09:58 AM

Have you tried to remove these primer sequences with bbduk?

duane.hassane 06-11-2015 10:18 AM

Yes, bbduk with restrictright and/or restrictleft set to constrain the search to the ends where primers are located. This is actually extremely fast and efficient even with hundreds of primers. Unfortunately, we have encountered some specificity issues with this approach (over- and under-trimming) leading to small gaps in coverage ... with no one set of parameters being optimal for all amplicons.

Brian Bushnell 06-11-2015 10:41 AM

Hmm, I don't currently have a better solution than BBDuk or the msa/cutprimers pair. You might get a better result if you first bin the amplicons by which primers they use, though, so that then you can trim them only using the correct primers using aggressive settings, rather than with all possible primers. You can try using Seal for that. For example: in=amplicons.fq ref=primer1.fa pattern=out%.fq k=(whatever)

That will produce one file per sequence in the primer file, with all the reads that best match it (meaning, share the most kmers). For that purpose you should only use the left primers or right primers as reference, not both at the same time. Or, ideally, the reference would look like this:

Primer pairs: {(AAAT, GGCC), (TTTT, GGGG)}

Reference file:


...etc, where each primer pair is concatenated together with a single N between the sequences.

duane.hassane 06-11-2015 10:47 AM

Ok. Will take a look at this approach. Thank you!

fdts 07-01-2015 05:50 AM

Just a comment, not sure if this is the place:
It would be helpful for downstream analysis if the refstat files that can be output using bbsplit produced 0s (zeros) for references that have no reads mapped to them instead of omitting data for these references. I was working on generating a table for visualization when I noticed missing data.

Brian Bushnell 07-01-2015 09:38 AM

Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false). That's not documented in BBSplit's shellscript, just BBMap's; I'll add it to the documentation.

fdts 07-01-2015 11:39 PM

Wonderful, thanks!

Thowell 07-11-2015 06:35 PM

I have been using bbmap (and other bbtools) for some of my analyses, and so far they are a great set of tools! Unfortunately I have come across a problem, and some preliminary searching hasn't turned up an answer.

I am trying to build bbmap into a galaxy workflow, using it to remove reads that map to a reference of a contaminant (E.coli in this case). The problem is that I just want the unmapped reads in fastq format, but because galaxy names everything with the .dat extension, I cannot tell bbmap to output in fastq format and it defaults to sam. Is there another way to specify the output format?

I'm sure I could convert the sam back to fastq, but the picard tools sam to fastq tool is throwing an error saying that there are unpaired reads in the sam. I know I can probably do this another way, but it seems silly to use another tool when bbmap can already output in the format I want!

I'm also confused as to why there would be unpaired reads, since the bbmap documentation for "outu" says:

Write only unmapped reads to this file. Does not include unmapped paired reads with a mapped mate.
Sounds to me like all the reads should be paired?

Thank you in advance, and thanks for the great tools!

Brian Bushnell 07-11-2015 07:26 PM

Thanks for noting this. Some tools (like reformat) support the "extin" and "extout" flags which let you override the default, so you could do this: in=file.dat out=file2.dat extin=.sam extout=.fq

But, BBMap doesn't support that right now. I'll add it. And I don't particularly recommend sam -> fastq conversion because the names change, since in sam format read 1 and read 2 must have identical names, whereas in fastq format they will typically have "/1" and "/2" or similar to differentiate them. Though you can do that conversion if you want.

I have not used Galaxy and don't know what's possible, but until I make this change, I would suggest one of these:

1) Use BBDuk for this filtering; its default output format is fastq and it's probably faster than BBMap anyway in this case. The syntax is very similar. On the command line, it would be something like " in=reads.fq outu=clean.fq ref=ecoli.fasta".

2) Tell BBMap "outu=stdout.fq" and pipe that to a file, if Galaxy supports pipes.

As for your question about pairing, the normal behavior in paired-mapping mode is:

"out=" will get everything.
"outm=" will get all pairs in which either of the reads mapped to the reference.
"outu=" will get all pairs in which neither read mapped to the reference.

For BBDuk, it's slightly different but essentially the same:
"out=" is the same as "outu=".
"outu", aka "out", will get all pairs in which neither had a kmer match to the reference.
"outm" will get all pairs in which either had a kmer match to the reference.
For BBDuk, this behavior can be changed with the "reib" (removeIfEitherBad) flag. The assumption of that flag's name is that the reference is contaminants being filtered against, so the default "reib=true" means any pair where either matches the contaminant is removed.

So, for both tools, if the input data is paired, the output data will also be paired - pairs are always kept together in all streams.

Thowell 07-11-2015 08:32 PM

Thank you for the quick reply Brian.

I was able to get things working with a pipe. I'm guessing the reads have to be interleaved with this method, but that will work fine for me until you can implement the alternate output flag.

Thanks again!

mendezg 09-16-2015 08:32 AM

How would you like us to cite bbduk in papers?

GenoMax 09-18-2015 04:23 PM


Originally Posted by mendezg (Post 180699)
How would you like us to cite bbduk in papers?

In the past Brian has asked people to link to the project site on source forge.

Brian Bushnell 09-20-2015 07:18 PM

Hi - sorry I somehow missed this question! Yes, as Genomax stated, please just cite it something like this (altered according the format of the journal):

"BBDuk - Bushnell B. -"

vmikk 09-24-2015 04:42 AM

Is it possible to use to cut the sequence AND to preserve the primer sites around?

Brian Bushnell 09-24-2015 08:57 AM

Not currently... but I'll plan to add a flag for that.

Brian Bushnell 09-28-2015 09:56 AM

I added the "include" flag to cutprimers. Default is "include=f". If you set "include=t" the primers will be retained for the output.

vmikk 09-28-2015 10:37 PM

Hello Brian! Thanks a lot for the implementation of this feature!
Meanwhile I thought to modify sam files from, but the out of the box functionality is much more convenient!
Thanks again!

dkainer 09-30-2015 06:36 PM


is there a way with the BB Suite to demultiplex paired-end reads based on inline barcodes, like Flexbar does?

I can see it can be done one barcode at a time by outputting matching reads based on the first 6 left bases. But can it be done in one command to demultiplex for multiple barcodes?


Brian Bushnell 09-30-2015 06:52 PM

It is almost possible to do this with Seal, which outputs reads into bins based on kmer matching. in=reads.fq pattern=%.fq k=6 restrictleft=6 mm=f ref=barcodes.fa rcomp=f

That would require a file "barcodes.fa" like this:

etc., with one fasta entry per barcode, so the output reads would be in file AACTGA.fq and so forth. This is sort of a common request, so maybe I will make it unnecessary to provide a fasta file of the barcodes. Does that matter to you either way?

However, BBDuk has the flags "skipr1" and "skipr2", which allow it to only do kmer operations on one read or the other. Seal currently lacks this, but it's essential for processing inline barcodes. I'll add it for the next release.

dkainer 09-30-2015 07:04 PM

i hadn't noticed the Seal command. Thanks for responding so fast!

So i assume that if I were to input paired-end reads to Seal with a barcodes.fa as the ref, it would try and match the barcodes in both the R1 and R2 reads? Hence the need for skipr1 and skipr2...?

Additionally, would seal let you left trim off the barcode bases from the R1 read?

Brian Bushnell 09-30-2015 07:21 PM


Originally Posted by dkainer (Post 181680)
i hadn't noticed the Seal command. Thanks for responding so fast!

So i assume that if I were to input paired-end reads to Seal with a barcodes.fa as the ref, it would try and match the barcodes in both the R1 and R2 reads? Hence the need for skipr1 and skipr2...?

That's correct.


Additionally, would seal let you left trim off the barcode bases from the R1 read?
Yes, it has a flag "ftl" (forcetrimleft) for doing that... "ftl=6" would remove the first 6 bases of all reads. Unfortunately it would do that for both read 1 and read 2. So... if you have reads in 2 files, that's fine; you just process the read1 file with "ftl=6". If they are interleaved it's more complicated - you'd have to split them first (for example, in=reads.fq out=read#.fq). I'll consider adding that the ability to only do all operations on left or right reads... it seems useful.

habib 01-24-2016 05:59 PM

Hi Brian,

I produced the following graph using for my 100bp PE Illumina reads. Could you help me interpret the graph please? What is the difference between raw_count and unique_kmers?

westerman 01-25-2016 07:58 AM

Try dividing the raw count by the depth and see that the result equals unique_kmers. That might give you a clue as to what everything means.

habib 01-25-2016 05:02 PM

Thank you westerman.

So, each unique kmers would be present in average, at 'coverage depth' many times.

How about several peaks that I observed after depth 1 (which is error kmers)? There are small peak at depth 23, bigger at 46 and highest at 83. Do they indicate that my genome is heterozygous, diploid?

Brian Bushnell 01-25-2016 07:33 PM


Originally Posted by habib (Post 188171)
Thank you westerman.

So, each unique kmers would be present in average, at 'coverage depth' many times.

How about several peaks that I observed after depth 1 (which is error kmers)? There are small peak at depth 23, bigger at 46 and highest at 83. Do they indicate that my genome is heterozygous, diploid?

It's always difficult to determine exactly what this means. With only a primary peak at 46 and smaller at 23, that indicates a heterozygous diploid. However, the peak at 83 could be a lot of things. Looking at the "unique_kmers column, the first two peaks are actually at 21 and 42, so a 3rd peak at around 84 probably indicates a tetraploid. It could also be a contaminant or an organelle such as mitochondria or chloroplast, but organelles would usually have higher coverage. It could also be 2-copy repeats in the genome. But I suspect it's tetraploid.

habib 01-25-2016 10:55 PM

Thank Brian,

actually there is also another peak at around 1000 coverage, which, as you suggest could be the organelle genomic sequences (I did not include all the data points in my previous post).

With a possibility of tetraploid, I think I am a bit in trouble in how to get a good assembly of this genome...

DNA Sorcerer 02-18-2016 11:10 AM

Hi Brian,

I tried to run bbmap but got an error, and before going into debugging wanted to see if you can tell me if I have a general configuration issue in the custer (e.g. java), so I know what to tell the sysadmin. Thanks a lot in advance.

My line is: CLARKSCV1.2.2-b/bbmap/ ref=contigs.fasta nodisk in1=scratch/s_3_1_sequence.fastq in2=scratch/s_3_2_sequence.fastq covstats=omgen.coverage

And the error message is:
Hello from inside a Grid Engine job running on cl339
Job beginning at Thu Feb 18 16:30:54 NST 2016
Job ending at Thu Feb 18 16:30:54 NST 2016
java -Djava.library.path=/home/cslamovi/CLARKSCV1.2.2-b/bbmap/jni/ -ea -Xmx1310m -cp /home/cslamovi/CLARKSCV1.2.2-b/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=contigs.fasta nodisk in1=scratch/s_3_1_sequence.fastq in2=scratch/s_3_2_sequence.fastq covstats=omgen.coverage
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(
at Method)
at java.lang.ClassLoader.loadClass(
at sun.misc.Launcher$AppClassLoader.loadClass(
at java.lang.ClassLoader.loadClass(
at java.lang.ClassLoader.loadClassInternal(

GenoMax 02-18-2016 11:25 AM

@DNA_sorcerer: @Brian would be along later today with an official answer but first thing to check is what version of java is running on your node/cluster.

Post output of


$  java -version
If I remember this right, @Brian only validates BBMap suite against java v.1.7 and 1.8.

You also may be missing a leading "/" in you file paths (scratch/s_3_1_sequence.fastq) unless "scratch" directory is in the directory from where you are running your command.

DNA Sorcerer 02-18-2016 11:40 AM

Ops, I had forgotten to load java module before. I did now, bbmap run for a couple of minutes and stopped. This is the output:


Hello from inside a Grid Engine job running on cl157
Job beginning at Thu Feb 18 17:03:05 NST 2016
Job ending at Thu Feb 18 17:03:05 NST 2016
java -Djava.library.path=/home/cslamovi/CLARKSCV1.2.2-b/bbmap/jni/ -ea -Xmx1310m -cp /home/cslamovi/CLARKSCV1.2.2-b/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=contigs.fasta nodisk in1=scratch/s_3_1_sequence.fastq in2=scratch/s_3_2_sequence.fastq covstats=omgen.coverage
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=contigs.fasta, nodisk, in1=scratch/s_3_1_sequence.fastq, in2=scratch/s_3_2_sequence.fastq, covstats=omgen.coverage]

BBMap version 35.82
Retaining first best site only for ambiguous mappings.
No output file.
Executing dna.FastaToChromArrays2 [contigs.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

Set genScaffoldInfo=true
Set genome to 1

Loaded Reference: 0.335 seconds.
Loading index for chunk 1-1, build 1
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 8.592 seconds.
Executing jgi.CoveragePileup [covhist=null, covstats=omgen.coverage, basecov=null, bincov=null, physcov=false, 32bit=false, nzo=false, twocolumn=false, secondary=false, covminscaf=0, ksb=true, binsize=1000, startcov=false, strandedcov=false, rpkm=null, normcov=null, normcovo=null, in1=scratch/s_3_1_sequence.fastq, in2=scratch/s_3_2_sequence.fastq]

Set NONZERO_ONLY to false
Set TWOCOLUMN to false
Set USE_BITSETS to true
Analyzed Index: 6.904 seconds.
Changed from ASCII-33 to ASCII-64 on input quality 97 while prescanning.
Cleared Memory: 1.798 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 1 mapping thread.

GenoMax 02-18-2016 11:46 AM

Has the grid engine job completed?

Most likely not. After printing the message above BBMap starts loading "genome" index into memory before it starts doing alignments, so be patient .. as long as the job is still "running".

If the job completed then look in the grid engine error file and let us know what is there.

Brian Bushnell 02-18-2016 06:52 PM

Also, note this line:


Started 1 mapping thread.
This means it is running using only 1 thread. Depending on your input size, it could take a while! If that's intentional, then it's fine. But BBMap is fully multithreaded so normally I request all available slots on a node when scheduling, and tell BBMap to use all of them with the flag "threads=16" (for example). It will normally autodetect the number of available processors and use all of them, but on SGE/UGE (which I think you are using), if the NSLOTS environment variable is set, it will cap the max threads at that value (so for example on a 16-core machine, if you request 1 slot, such that "echo $NSLOTS" returns 1, then unless you manually specify the number of threads it will only use 1, to ensure fairness).

I recommend you ssh into the node it's running on and run "top -c", then look for a java process and ensure it is using CPU resources (should be at 100%). You can also examine the output file; it should gradually be growing.

DNA Sorcerer 02-19-2016 06:42 AM

Thanks Brian,
I got it run on 16 threads, but after a little while I get these sort of errors:


Started 16 mapping threads.
Exception in thread "Thread-7" java.lang.AssertionError
at stream.Read.expectedErrors(
at stream.Read.avgQualityByProbability(
at stream.Read.avgQualityByProbability(
at stream.Read.avgQuality(
at align2.AbstractMapThread.quickMap(
at align2.BBMapThread.processReadPair(
Detecting finished threads: 0Exception in thread "Thread-13" java.lang.AssertionError
at stream.Read.expectedErrors(
at stream.Read.avgQualityByProbability(
at stream.Read.avgQualityByProbability(
at stream.Read.avgQuality(
at align2.AbstractMapThread.quickMap(
at align2.BBMapThread.processReadPair(
Exception in thread "Thread-12" java.lang.AssertionError
at stream.Read.expectedErrors(
at stream.Read.avgQualityByProbability(
at stream.Read.avgQualityByProbability(
at stream.Read.avgQuality(
at align2.AbstractMapThread.quickMap(
at align2.BBMapThread.processReadPair(
Before this run the program suggested using ignorebadquality, which I did for this run.

GenoMax 02-19-2016 07:02 AM

Do you know how many cores there are on the node that you are running this on (16 was an example, you may have more or less available on the hardware you have access to)? Are you providing the necessary request for using all available cores on a node in your SGE command line?

Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.

DNA Sorcerer 02-19-2016 07:06 AM


Originally Posted by GenoMax (Post 189566)
Having a multi-threaded job spread out across physical nodes would not be advisable and may lead to the kind of errors you are seeing above.

Thanks, I'll take a closer look at the cluster's manual and try again.

GenoMax 02-19-2016 07:12 AM

You don't have to use all of the cores/threads available on a physical server (some clusters may be setup in a way that require you to reserve exclusive access to a node to do that).

I generally find that threads=6 works plenty fast (BBMap is well written). You may run into some I/O limits unless you have a high performance file system that goes along with your HPC cluster.

Brian Bushnell 02-19-2016 06:04 PM

Your error message indicates a problem with the quality scores (admittedly, I should improve the error message from that assertion, but I've never seen it happen before!). Specifically, there are some N bases (no-calls) that have a quality of greater than 0. This could mean that you have an ASCII-64 file being processed as ASCII-33, or it could mean... well, something else, like bug in the base-calling software.

If the problem is just Ns with nonzero quality scores, you can fix the files by running them through Reformat (this command assumes the file is ASCII-33/Sanger): -da ibq qin=33 in=scratch/s_3_#_sequence.fastq out=fixed_#.fq

Do you happen to know what the quality encoding is of these files? For example, what platform are they from, and how old are they?

DNA Sorcerer 02-22-2016 08:49 AM

Thank Brian,

You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?

GenoMax 02-22-2016 09:09 AM


Originally Posted by DNA Sorcerer (Post 189719)
Thank Brian,

You are probably right in your suspicion. The read files are old, probably 2011 or so, and were originally .txt and I changed to .fastq. Would this have confused BBmap?

Historically there has been more than one format for Q-scores for fastq files so you need to let BBMap know which score scale to use (most new data is in Sanger fastq format). More on the encoding here:

@Brian also has a program to identify the correct encoding for your files. That example is in post # 1 of this thread.


$ in=seq.fq.gz

DNA Sorcerer 02-22-2016 12:22 PM

testformat says: illumina fastq raw single-ended 108bp

As far as I remember this was a HiSeq run.

I tried the reformat line suggested by Brian but the process stops after a while with errors. Apparently short of memory. Will try to improve that and try again.

Brian Bushnell 02-22-2016 08:16 PM

Hmm... can you post the errors? Reformat by default uses very little memory, which is all that should be needed for a correctly-formatted file containing reads. It will run out of memory if you use it without increasing the default memory allocation on extremely long sequences (over tens of megabases) such as the human genome. It will never run out of memory on a correctly-formatted Illumina fastq file.

So, it would also be helpful if you could post the results of "head" (the first 10 lines of the file).

DNA Sorcerer 02-23-2016 04:51 AM

See below. I run it for only one fo the files because doing both would go over my storage quota.


java -da -Xmx200m -cp /home/cslamovi/CLARKSCV1.2.2-b/bbmap/current/ jgi.ReformatReads -da ibq qin=33 in=scratch/s_3_1_sequence.fastq out=scratch/fixed_1.fq
Executing jgi.ReformatReads [-da, ibq, qin=33, in=scratch/s_3_1_sequence.fastq, out=scratch/fixed_1.fq]

Input is being processed as unpaired
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(
at fileIO.ByteFile1.fillBuffer(
at fileIO.ByteFile1.nextLine(
at stream.FASTQ.toReadList(
at stream.FastqReadInputStream.fillBuffer(
at stream.FastqReadInputStream.nextList(
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(
at stream.ConcurrentGenericReadInputStream$
Input: 32775200 reads 3539721600 bases
Output: 32775200 reads (100.00%) 3539721600 bases (100.00%)

Time: 807.191 seconds.
Reads Processed: 32775k 40.60k reads/sec
Bases Processed: 3539m 4.39m bases/sec
Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
at jgi.ReformatReads.process(
at jgi.ReformatReads.main(

GenoMax 02-23-2016 05:21 AM

Looks like you only ran this with 200MB of RAM. Can you try with -Xmx2g?

How old is this data BTW (in years)?

Brian Bushnell 02-23-2016 08:20 PM

Reformat should never run out of memory with the default settings and short (<200kbp) reads. I think the input file is corrupt, and should be re-downloaded. The corruption probably occurs somewhere around the 32.77 millionth read, but it's hard to be sure...

FridaJoh 03-09-2016 01:25 AM


I came across this when searching for a way to demultiplex non-overlapping paired end reads that were sequenced using combinatorial barcodes. I don't suppose there is a way of doing that somehow using seal (or other tools?).


Originally Posted by Brian Bushnell (Post 181679)
It is almost possible to do this with Seal, which outputs reads into bins based on kmer matching. in=reads.fq pattern=%.fq k=6 restrictleft=6 mm=f ref=barcodes.fa rcomp=f

That would require a file "barcodes.fa" like this:

etc., with one fasta entry per barcode, so the output reads would be in file AACTGA.fq and so forth. This is sort of a common request, so maybe I will make it unnecessary to provide a fasta file of the barcodes. Does that matter to you either way?

However, BBDuk has the flags "skipr1" and "skipr2", which allow it to only do kmer operations on one read or the other. Seal currently lacks this, but it's essential for processing inline barcodes. I'll add it for the next release.

mcauchy 03-09-2016 04:30 AM

Newbie here! I have unzipped and untared bbmap but it wont run any commands. I have a Linux virtual box in windows 10. Am I missing some software to use BBMap?

GenoMax 03-09-2016 04:39 AM


Originally Posted by mcauchy (Post 190505)
Newbie here! I have unzipped and untared bbmap but it wont run any commands. I have a Linux virtual box in windows 10. Am I missing some software to use BBMap?

What do you mean "it won't run any commands"? Can you see the shell scripts in the "bbmap" folder. Try the following command and see if it produces help output on screen after you change to bbmap directory.


$ ./

mcauchy 03-09-2016 04:57 AM

What I mean is I run:
$ ./ in1=/media/sf_D_DRIVE/champagnefastqs/srr1290816_1.fastq in2=/media/sf_D_DRIVE/champagnefastqs/srr1290816_2.fastq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq

...and get:
java -ea -Xmx-211m -cp /media/sf_D_DRIVE/bbmap/current/ jgi.SplitPairsAndSingles rp in1=/media/sf_D_DRIVE/champagnefastqs/srr1290816_1.fastq in2=/media/sf_D_DRIVE/champagnefastqs/srr1290816_2.fastq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq
Invalid maximum heap size: -Xmx-211m
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

GenoMax 03-09-2016 05:00 AM

How much memory have you allocated to the VM? You should at least have 2+ GB to have enough available for programs to run.

mcauchy 03-09-2016 05:51 AM

I have allocated 2.9Gb, which is all I have to give. It seems that is not enough. Thanks for your help.

GenoMax 03-09-2016 06:12 AM


Originally Posted by mcauchy (Post 190518)
I have allocated 2.9Gb, which is all I have to give. It seems that is not enough. Thanks for your help.

That may be true but in case BBMap was not able to allocate RAM correctly can you try running the command as follows:


$ ./ -Xmx2g in1=/media/sf_D_DRIVE/champagnefastqs/srr1290816_1.fastq in2=/media/sf_D_DRIVE/champagnefastqs/srr1290816_2.fastq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq

westerman 03-09-2016 08:49 AM

Also it seems to me that '-Xmx-211m' is odd. Why the negative 211? I am not sure that makes a difference but it might.

mcauchy 03-09-2016 09:40 AM

Didn't run for very long....

$ ./ -Xmx2g in1=/media/sf_D_DRIVE/champagnefastqs/srr1290816_1.fastq in2=/media/sf_D_DRIVE/champagnefastqs/srr1290816_2.fastq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq

java -ea -Xmx2g -cp /media/sf_D_DRIVE/bbmap/current/ jgi.SplitPairsAndSingles rp -Xmx2g in1=/media/sf_D_DRIVE/champagnefastqs/srr1290816_1.fastq in2=/media/sf_D_DRIVE/champagnefastqs/srr1290816_2.fastq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq
Executing jgi.SplitPairsAndSingles [rp, -Xmx2g, in1=/media/sf_D_DRIVE/champagnefastqs/srr1290816_1.fastq, in2=/media/sf_D_DRIVE/champagnefastqs/srr1290816_2.fastq, out1=fixed1.fq, out2=fixed2.fq, outsingle=single.fq]

Set INTERLEAVED to false
Started output stream.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(
at java.util.HashMap.addEntry(
at java.util.LinkedHashMap.addEntry(
at java.util.HashMap.put(
at jgi.SplitPairsAndSingles.process3_repair(
at jgi.SplitPairsAndSingles.process2(
at jgi.SplitPairsAndSingles.process(
at jgi.SplitPairsAndSingles.main(

GenoMax 03-09-2016 09:43 AM

How about setting the VM aside and running BBMap directly on windows 10. How much RAM is there on the machine? BBMap is written in java and will run there but you would need to take into account windows versions of the command line usage for BBMap.

mcauchy 03-09-2016 02:34 PM

I'm not sure what the syntax would be in DOS. I've tried ./, / and

Could you tell me what the command would be?

GenoMax 03-09-2016 02:51 PM


Originally Posted by mcauchy (Post 190568)
I'm not sure what the syntax would be in DOS. I've tried ./, / and

Could you tell me what the command would be?

This information is in the BBMap thread. Here is how (note: you need to put the right path to the "current" directory on your machine and the space between that and align2.BBMap):


c:\> java -Xmx3g -cp c:\path_to\current align2.BBMap in=reads.fq out=mapped.sam

Brian Bushnell 03-11-2016 06:25 PM

Hi mcauchy,

Were you able to resolve this? The shellscripts ( do not work in Windows, so the full syntax is needed. If you are still having trouble, please tell me the location of (which is probably something like C:\something\bbmap\current\


GenoMax 03-12-2016 03:16 AM

@Brian: How much memory does need?

Brian Bushnell 03-12-2016 07:22 AM


Originally Posted by GenoMax (Post 190737)
@Brian: How much memory does need?

That depends. If you have an interleaved file and the interleaving was broken because some reads were discarded, you can run it with the flag "fixinterleaving" and it only needs a trivial amount of memory (at any given time at most 2 reads need to be remembered).

For an arbitrarily disordered file or pair of files, in the worst case, it would store all reads in memory, so the amount of memory needed would be somewhat greater than the size of the uncompressed files.

But in the common case of a pair of files that are ordered correctly but some reads were deleted in each file without removing their mate, the amount of memory needed is proportional to the number of singleton reads.

GenoMax 03-12-2016 01:59 PM

@mcauchy: Consider @Brian's reply above when choosing a memory setting.

If you know that your files fall in the second category then you may want to re-do the trimming with a paired-end aware trimmer (so the reads don't get out of sync in first place) e.g. use from BBMap. That option will not require a lot of RAM and would prove convenient.

jazz710 03-14-2016 06:49 AM

Kmer Counting
Hi Brian, tried to send you a private message, but I guess you're popular:

"Brian Bushnell has exceeded their stored private messages quota and cannot accept further messages until they clear some space."


New odd question for you Brian: I'm doing some kmer counting with I'm running a kmer series on 100bp DNA-Seq reads from k=21 to k=55. Two things I noticed:

1) After k=31, the program only calculates kmer counts for even numbers. The output reads (for the example of k=55): "K was changed from 55 to 54", and it did that for every odd value of k from 33 to 55

2) The unique kmer rises as the value of k rises, but only to a point. In my two samples, it seems like after k=31 that as k increases, the number of unique kmers decreases. How can this be? Am I missing something conceptual about how kmer counts should be expected?


EDIT: Added kmer graph

GenoMax 03-14-2016 07:01 AM

@jazz710: is only designed to accept a max of k=31. I am not sure what it is doing once you go past that point.

On a different note: @Brian now works at google and continues to support BBTools in spare time (don't worry, BBTools are going to remain supported/be developed, just at a slower pace than what we were used to last year). So look for an official answer from him late tonight.

Note: People can still contact Brian via his JGI/LBL email address. That email address remains valid and can be found in the help menu for any BBTools program. Same constraint about time applies to this method as well.

jazz710 03-14-2016 07:05 AM

Thanks GenoMax!

Brian Bushnell 03-16-2016 09:14 PM

Let me clarify this:

As GenoMax stated, KmerCountExact originally only supported K=1 to 31. It now supports unlimited kmer lengths, with the provision that K=32 to 62 must be multiples of 2, 63 to 93 must be multiples of 3, and generally (N-1)*31+1 to (N)*31 must be multiples of N, which makes reverse-complementing much faster.

It is expected that the kmer counts form an arc like that with a peak at some specific K. As K increases, the number of unique kmers increases in general as longer repeats are spanned. But as K approaches read length, the number of unique kmers will drop to zero because reads will contain fewer kmers. So there is some peak between the zeros on each end, and thus, the graph basically looks as expected.

What should never happen is the discontinuity between K=31 and K=32, which is clearly an artifact of the program. I will investigate that this weekend; in the meantime, could you tell me the exact command you used, and version of BBTools? It's possible that this is due to the way kmers are filtered to exclude error kmers, which can be set in the command line, but I'll find out.

Sorry about my inbox; I've cleared some space!

jazz710 03-18-2016 06:34 AM

They were just basic commands: -k <21-55> in=<File1> in2=<File2> mincount=2 prefilter=2

The graph included in my post above is the output from the Unique kmers for each value of K. I should note that while both of my species showed the same 'jump' it wasn't at k=31 each time. I trashed most of that output because I went another way with that part of the analysis, but if it would help, I could always re-run it quickly to get you the specific output.

jazz710 03-24-2016 07:38 AM

Cross Species Mapping
Hello Geno and Brian,

I have seen BBMap referenced on a handful of forums as a good option for mapping short DNA-seq reads onto a closely related species. Bowtie et al. can't seem to handle very many SNPs.

Looking at the BBMap documentation, I'm not exactly sure what parameters I should tweak to loosen the mapping conditions. I was hoping there would simply be an hdist/edist option that I could set to 4 or 5 or so but I don't see that directly stated.

Are there a set of parameters you would recommend tweaking to get optimal mapping across species (given the obvious biological limitations)?


EDIT: Would it be editfilter=4 or 5? That would allow for 4 or 5 SNPs or indels, correct?

Brian Bushnell 03-25-2016 10:47 AM

Hi Bob,

BBMap does not have an hdist/edist option because it evaluates alignments using a scoring system based on evolutionary probability. For example, two adjacent deletions are much more likely than two non-adjacent deletions. So these are considered as "events", and BBMap makes alignments optimizing the most probable layout of events that would lead to the resulting read. For example, a 100bp read containing a 100bp deletion event could be aligned with 70 matches, and 30 bases in which 1/4 of them are matches and 3/4 of them are mismatches. Or, it could be aligned as 70 matches and 30 bases clipped (which is typical). Or, it could be aligned as 70 matches, a 100bp deletion, and 30 matches. Which is what BBMap does by default.

Every alignment in BBMap has a score, and scores below the minimum are discarded. You can specify that with, for example, "minid=70", which will discard alignments with identity below roughly 70% (it is "roughly" because BBMap discards things based on probability rather than percent identity, which is an inferior metric. It still offers filters like "idfilter" for situations in which percent identity is an important metric, but I don't recommend using them for real research; they are just for publications in which people need to state "I excluded any alignment with more than X of property Y". It also offers specific filters to discard reads with more than X SNPs, or Y insertions, or Z deletions, or whatever. But again, I do not think these are useful in your case because you want cross-species alignments, which are best modeled by evolutionary distance, which BBMap was designed for.

You can increase BBMap's sensitivity using various means.

For preprocessing:

1) Adapter-trim your reads prior to alignment, using BBDuk.
2) Quality-trim your reads to ~Q10, if they have low-quality tails, also with BBDuk.
3) Error-correct reads (if you have sufficient coverage, typically >10x) using Tadpole.

For mapping:

1) Set the "slow" flag, which greatly increases sensitivity: in=x.fq out=y.fq slow or the "vslow" flag, which increases it even more.
2) Set sensitivity very high: "vslow minid=0 k=11 maxindel=200"

Each of those flags is a bit different. Maxindel is default 16000 but for inter-species mapping reducing it increases sensitivity. You can further reduce k down to, say, 9, but it greatly affects speed and values below 10 start to cause issues in repetitive areas. The default is 13. k is basically the seed length - no alignments will be considered unless the read shares k consecutive exact matches with the reference. For cross-species mapping, using "slow" or "vslow", and varying the parameters maxindel, k, and minid are the best ways of adjusting sensitivity.

jazz710 03-30-2016 05:32 AM

Thank you, as always for your thorough response! New question time:

I'm running a bbduk contamination removal and with normal settings, and the run takes ~30m-45m (it's a big dataset). However, when I set hdist=1 (rather than 0) now the job has been running for 14h and is still only to this point:

BBDuk version 35.82
Set threads to 40
Memory: max=823202m, free=788842m, used=34360m

I know hdist affects the time of the run, but is that amount what you would expect?

Best and many thanks,


GenoMax 03-30-2016 05:47 AM

That sounds unusually long (especially if the original job took only 45 min). Are the trimmed output files still increasing in size? Are you using more than one thread for this job?

jazz710 03-30-2016 05:51 AM

There is no output yet...

40 threads, 800G memory.

GenoMax 03-30-2016 05:59 AM

I have a feeling the job is hung. Is this under a job scheduler? In my experience, bbduk starts writing output even when running under a scheduler right away.

jazz710 03-30-2016 06:11 AM

It was submitted to a server with a scheduler, yes.

It's just odd that it's the exact same code and dataset, but the only difference is hdist=1...same set up with hdist=0 runs fine.

GenoMax 03-30-2016 06:22 AM

Are you able to ssh to the node this is job is on and see if the bbduk is running (something like top -H)?

jazz710 03-30-2016 06:25 AM

GenoMax 03-30-2016 06:34 AM

The java process (should correspond to bbduk) appears to be sleeping (at least in this screenshot). Does it change back to R periodically or is it staying in S?

@Brian had talked about memory requirements with hdist=1 before. It may not be applicable in your case.

jazz710 03-30-2016 06:37 AM

Based on a minute of observation, it appears to be sleeping, however the %CPU and %MEM change as I re-issue 'top'.

jazz710 03-30-2016 06:43 AM

Based on Brian's comments, I bet I have run out of memory. My reference is very large, and a 3*K increase in memory would put me over my server capacity. I will try to re-run with a smaller reference.

GenoMax 03-30-2016 06:45 AM

Unless your scheduler is buffering the entire output in some temp location you should have seen trimmed files right away (have you ever looked before if a job does that)? Did you try top -H to see all ~40 java threads?

Based on your last comment that must be the case.

susanklein 04-11-2016 07:40 PM

reads mapped per gene

can you suggest how to get raw reads mapped per gene from bbmap?



All times are GMT -8. The time now is 12:32 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.