View Single Post
Old 04-15-2014, 10:46 AM   #21
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Originally Posted by HGV View Post
Hi Brian,
Very decent mapper that you wrote, and supergreat that it is finally

I was playing around with bbmap, sooo many cool features and an impressive
Thanks! Glad you like it.

But I could not figure out a couple of things:
1) How to point to directories with path=
I would like to be able to use a set of indices that I created previously
and that are stored in a specific 'databases' path mounted on all nodes of
our cluster. But I did not get this to work setting path= during the
mapping process because it changes where bbmap searches for the reads. My
second trial was to call bbmap in this folder of the references and set
path to the folder of the reads, but then I always got an error that the
read file was not found. The only thing that worked for me is calling bbmap
from a folder which also includes the /ref folder, but this means copying
both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
bbmap similarly
BBMap can use absolute paths for all of those, but the syntax is a bit different. Let's say you have 2 builds of the human genome, HG18 and HG19, at /fastas/hg18.fa and /fastas/hg19.fa, reads at /data/reads.fq, and you want to write output to /output/. There are 3 ways you could handle this.

Version 1: Don't write an index to disk. This usually makes the startup slower, but is more I/O efficient. It's the simplest and best if you are only mapping to a reference once (like when evaluating multiple assemblies). in=/data/reads.fq ref=/fastas/hg18.fa out=/output/mapped18.sam nodisk in=/data/reads.fq ref=/fastas/hg19.fa out=/output/mapped19.sam nodisk

Version 2: Write each index to its own directory. ref=/fastas/hg18.fa path=/human/hg18/ ref=/fastas/hg19.fa path=/human/hg19/ in=/data/reads.fq path=/human/hg18/ out=/output/mapped18.sam in=/data/reads.fq path=/human/hg19/ out=/output/mapped19.sam

Version 3: Write indices to the same directory, but specify a build number. ref=/fastas/hg18.fa path=/human/ build=18 ref=/fastas/hg19.fa path=/human/ build=19 in=/data/reads.fq path=/human/ build=18 out=/output/mapped18.sam in=/data/reads.fq path=/human/ build=19 out=/output/mapped19.sam

If you don't specify a build number, the default is always 1.

2) using outm
I am mapping shotgun metagenomic illumina paired end reads to references
that are gene databases. I was expecting to get different output for out=
and outm= but the files produced are identical. I would expect to see some
read pairs where only one of the reads maps to the gene database and the
other not, and as far as I understood out= gives me the mapping pairs and
outm= gives me the mapping pairs and the single mapping reads with their
For paired reads, outm captures all reads that are mapped OR have a mate that is mapped. You can change this behavior with the flag "po" which stands for "pairedonly". The default is "po=f". If you set "po=t" then reads will be unmapped unless they are paired. So, if only one read maps, or if they both map but are not in a valid pairing configuration, they will go to "outu" instead of "outm". "out" will get both of them no matter what.
3) how can I get sorted unmapped pairs written to a file?
While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
Are you sure? It works for me... though it is not mentioned in the documentation. I'll clarify that. in=reads.fq outu1=r1.fq outu2=r2.fq
produces 2 files, r1.fq and r2.fq
You can only specify output streams 1 and 2 for fasta or fastq output, though, not sam. Also - by default, whenever you output paired reads to a single file, they will come out interleaved (read 1, read 2, read 1, read 2, etc). You can split them back into 2 files with reformat: in=reads.fq out1=read1.fq out2=read2.fq
4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.
Actually, I only track the average insert size, not the median, because that's the easiest, though I admit the median and SD would be more useful. I'll add that, since I'm tracking the information anyway (if you specify inserthistogram).

As for what you noticed with the pairlen flag, I had not noticed that - sounds like a possible bug. I will look into it.

Keep up the good work
I will. Thanks for the feedback!
Brian Bushnell is offline   Reply With Quote