![]() |
BBMap (aligner for DNA/RNAseq) is now open-source and available for download.
BBMap is now available here:
https://sourceforge.net/projects/bbmap/ Links to other BBTools forum threads: BBSplit (Binning reads based on mapping to multiple references at once) BBDuk and Seal (Decontamination, filtering, adapter-trimming, kmer-masking, and alignment-free expression quantification) BBNorm (Normalization, error-correction, kmer frequency histograms, and genome size estimation) BBMerge (Paired read overlap merging and insert size calculation) Reformat (Read reformatting, deinterleaving, subsampling, etc) CalcUniqueness (Library uniqueness/saturation plots) RemoveHuman (Filtering out reads from human, or some other specific organism, with zero false positives removals) Repair (Re-pairing reads that got out of order, based on their names) CalcTrueQuality (Recalibrating quality scores of reads) Tadpole (Assembler, error-correction, read extension) KmerCompressor (Kmer set generation and set operations) Clumpify (Increases compression of gzipped/bzipped fastq files) A thread with answers to frequently asked questions about BBTools, collated by GenoMax, is here. BBMap/BBTools are now open source. Please try it out - it's a 3MB download, and written in pure Java, so installation is trivial - just unzip and run. Handles all sequencing platforms (Illumina, PacBio, 454, Sanger, Nanopore, etc) except Solid colorspace, which I removed to simplify the code. A Powerpoint comparison of performance (speed, memory, sensitivity, specificity) on various genomes, compared to bwa, bowtie2, gsnap, smalt: https://drive.google.com/file/d/0B3l...it?usp=sharing ...but in summary, BBMap is similar in speed to bwa, with much better sensitivity and specificity than any other aligner I've compared it to. It uses more memory than Burrows-Wheeler-based aligners, but in exchange, the indexing speed is many times faster. How to use There is documentation in the docs folder and displayed by shellscripts when run with no arguments. But for example: bbmap.sh ref=ecoli.fa ...will build an index and write it to the present directory bbmap.sh in=reads.fq out=mapped.sam ...will map to the indexed reference bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk ...will build an index in memory and map paired reads to it in a single command If your OS does not support shellscripts, replace 'bbmap.sh' like this: java -Xmx23g -cp /path/to/current align2.BBMap in=reads.fq out=mapped.sam ...where /path/to/current is the location of the 'current' directory, and -Xmx23g specifies the amount of memory to use. This should be set to about 85% of physical memory (the symbols 'm' or 'g' specify megs or gigs), or more, depending on your virtual memory configuration. Human reference requires around 21 GB; generally, references need around 7 bytes per base pair, and a minimum of 500 MB at default settings. However, there is a reduced memory mode ('usemodulo') that only needs half as much memory. The shellscripts are just wrappers that display usage information and set the -Xmx parameter. Please ask if you encounter any problems or need help! And there are other neat tools too, for error correction, normalization, depth-binning, reference-based binning, contaminant filtering, adapter trimming, optimal quality trimming, reformatting files, paired-read merging, deduplication of assemblies, and histogram generation for things like kmer depth and insert size. NOTE: BBMap (and all related tools) shellscripts will try to autodetect memory, but may fail (resulting in the jvm failing to start or running out of memory), depending on the system configuration. This can be overridden by adding the -Xmx30g flag to the parameter list of the shellscript (adjusted for your computer's physical memory) and it will be passed to java. |
Dear Brian,
thanks a lot for posting your aligner package. Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)? Would you recommend any specific settings for the alignment of RNA seq reads? |
Quote:
Unfortunately, the limit is still 6kb. I am working on increasing that but it may take a while. For PacBio reads, you should use the script mapPacBio8k.sh (which despite the name only goes up to 6k), and the PacBio reads should be in fasta format. Reads in fasta format that are longer than 6000bp will be automatically broken into 6000bp pieces and their names appended with "_1", "_2", and so forth. But the vast majority of the PacBio reads will be under 6kbp and thus won't be fragmented. For any Illumina reads, including merged Miseq, use bbmap.sh. It handles reads up to 500bp. The only exception is if you want to map Illumina reads to PacBio reads (for error correction), in which case you should use mapPacBio. bbmap.sh (which calls BBMap.class) has mutation penalties tuned for Illumina reads, while mapPacBio8k.sh (which calls BBMapPacBio.class) has mutation penalties tuned for PacBio reads and a longer maximum read length, and is slower. You can still use mapPacBio8k.sh for any really long reads, though, like when mapping one assembly to another. For RNA-seq (using Illumina reads), I would recommend a command line like this: bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam maxindel=100000 xstag=firststrand intronlen=10 ambig=random This will look for introns up to 100kbp long, and flag every gap longer than 10bp with cigar symbol 'N' (skipped) rather than 'D' (deletion). Ambiguously-mapped reads will go to a random top-scoring site, which is generally best for quantification. XS tags, needed by Cufflinks, will be produced according to the "first strand" library protocol (if you don't know what protocol is used, then set it to xstag=unstranded). The 'maxindel' flag can be increased or decreased as needed. Many plants and fungi tend to have very short introns mostly under 400bp, while human average is much higher and has a few that are even longer than 100kbp. It should be set to above the size of the largest introns you expect, but higher values decrease accuracy, so I don't recommend setting it above 500000 or so. The default is 16000 which is fine for the plants and fungi I've worked with but if you are working with vertebrates then 200000 would probably be appropriate. |
Thanks a lot for the advice!
I will give it a try. |
Quote:
|
Brian: Trying out BBMap and I am having problems indexing the reference. My species is Pig which has the following Fasta entry lines in the genome.fa file:
Quote:
Quote:
|
Westerman,
Sorry for the confusion. The files are called "chrom" for legacy reasons, and they don't correspond to chromosomes anymore. I should rename them "chunks". Originally, BBMap wrote one chromosome per chrom file, which is fine for complete reference genomes. But de-novo assemblies can have thousands or millions of scaffolds, so now I bundle them into chunks. Everything is there, and when you produce a sam file, you'll see that reads are mapped to all of the different input chromosomes. |
Quote:
Quote:
Tophat accepted_hits.bam Quote:
|
By the way, BBMap's default mode is "ambig=best", meaning that any ambiguously-mapped reads will go to the first top-scoring location (ambiguous hits will be annotated "XT:A:R" rather than "XT:A:U", so you will know which ones are ambiguous). This is fine for variant-calling, but if you are doing RNA-seq expression quantification, it's usually better to set "ambig=random" or "ambig=all". Otherwise, if there are two almost-identical genes, all the reads will map to the first one.
Also, for vertebrate RNA-seq, I suggest these settings: xstag=us generate XS tags, needed by cufflinks, according to "unstranded" protocol, or if you don't know the protocol. Alternatives are "ss" (second strand) and "fs" (first strand). If you DO know the protocol, please set it as ss or fs. intronlen=10 All deletions up to this length will be annotated as 'D' in the cigar string; longer ones will be annotated as 'N' (meaning skipped). This may or may not matter to downstream tools. For DNA mapping you don't need this flag. maxindel=200000 The longest deletion (i.e., intron) that it will look for. May still find longer ones. Around 98% of human introns are shorter than 100kbp. If you do not set this, it will default to 16kbp, which is too short for vertebrate RNA-seq (though lots of plants and fungi have tiny introns of around 200bp). I'll add this recommendation to the readme for the next release. |
Thanks for the recommendations, especially the ambig= one. The help for bbmap.sh is ... confusing, at least at first glance.
One thing I think is working but giving me an improper warning message is the -Xmx parameter. From starting like so: Quote:
Quote:
|
Hmm, that's interesting, thanks for noting it. It's because I first try to detect memory, then parse the -Xmx override. I will reverse the order :)
I'll try to improve the help information, so any concrete suggestions/complaints would be very helpful. |
As for help, the major problem is the formatting on screens with 80 columns. Wrapping is hard to read plus there is too much information for a quick reference. I suggest trying to limit the help messages to 80 columns and then direct the user to a longer formatted README or extended help. I know that the current help says to look at the README.
Also parsing '-h' on the command line would be nice. While displaying help when no commands are given is good it is also useful to be able to use '-h'. As an example of the first paragraph, at the moment the help includes: Code:
maxsites=5 Maximum number of total alignments to print per read. Only relevant when secondary=t. Code:
maxsites=5 Max. number of total alignments to print |
Hi,
I do get the similar warnings and an exits on a system with more than 200 GB of RAM available (using the Java 6 version). "This system does not have ulimit set, so max memory cannot be determined. Attempting to use 4G. If this fails, please set ulimit or run this program qsubbed or from a qlogin session on Genepool." "Invalid maximum heap size: -Xmx30G The specified size exceeds the maximum representable size. Could not create the Java virtual machine." |
Quote:
That means that you have a 32-bit version of Java installed. You need a 64-bit version, which you can get here. Westerman, Thanks for your feedback; I'll simplify the help menus for the next version. -Brian |
Hi Brian,
thanks a lot for the advice. I did now change the java environments several times (via aliases to the the latest JREs) and tried to use both the your java6 and java7 versions and I always end up with the same messages as before. Our OS is old (CentOS 5.9) , as is our default java (it is 64-bit nevertheless (build 1.6.0_16-b01)). Thus, it is very possible that our configuration is outdated. I will try to somehow set java "ulimits". Quote:
|
luc,
That's unfortunate. If you download and unzip a 64-bit version of java 7, then path directly to the java executable rather than relying on the default java of your environment, it should work. Otherwise, the limit is something like 1600MB (specified by the flag -Xmx1600m) if I recall correctly, which is plenty for small references like microbes and fungi. -Brian |
Has anybody experience with running bbmap on clusters?
Hello everybody,
I would like to know, if anyone has (good) experience with running bbmap alignments on clusters, which require to submit jobs via a scheduler? Because I can successfully start/finish an alignment on the university's cluster when started directly from the terminal (which is prohibited, but who cares on Sundays at 3am?), but the exactly same command written to Portable Batch System (PBS) job script and submitted via qsub to Torque/Maui scheduler fails: Code:
Exception in thread "Thread-5" java.lang.NullPointerException Since I would like to extent my alignment times beyond weekend nights :rolleyes:, I would be happy for suggestions. |
Quote:
|
Yes, I set the threads=8 option and also a memory limit of 32g for bbmap.
However I realised that I initially wrote the job script in a way that the scheduler could assign random distributed cores to the job. When I limited that only to cores of one particular node, then everything runs smoothly. So no problem with bbmap, only with me using the batch system correctly :confused:. |
problems with bbmap switches
Hi Brian,
Very decent mapper that you wrote, and supergreat that it is finally available . I was playing around with bbmap, sooo many cool features and an impressive speed! But I could not figure out a couple of things: 1) How to point to directories with path= I would like to be able to use a set of indices that I created previously and that are stored in a specific 'databases' path mounted on all nodes of our cluster. But I did not get this to work setting path= during the mapping process because it changes where bbmap searches for the reads. My second trial was to call bbmap in this folder of the references and set path to the folder of the reads, but then I always got an error that the read file was not found. The only thing that worked for me is calling bbmap from a folder which also includes the /ref folder, but this means copying both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in bbmap similarly 2) using outm I am mapping shotgun metagenomic illumina paired end reads to references that are gene databases. I was expecting to get different output for out= and outm= but the files produced are identical. I would expect to see some read pairs where only one of the reads maps to the gene database and the other not, and as far as I understood out= gives me the mapping pairs and outm= gives me the mapping pairs and the single mapping reads with their pairs. 3) how can I get sorted unmapped pairs written to a file? While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible) 4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient. Keep up the good work Harald |
Quote:
Quote:
Version 1: Don't write an index to disk. This usually makes the startup slower, but is more I/O efficient. It's the simplest and best if you are only mapping to a reference once (like when evaluating multiple assemblies). bbmap.sh in=/data/reads.fq ref=/fastas/hg18.fa out=/output/mapped18.sam nodisk bbmap.sh in=/data/reads.fq ref=/fastas/hg19.fa out=/output/mapped19.sam nodisk Version 2: Write each index to its own directory. bbmap.sh ref=/fastas/hg18.fa path=/human/hg18/ bbmap.sh ref=/fastas/hg19.fa path=/human/hg19/ bbmap.sh in=/data/reads.fq path=/human/hg18/ out=/output/mapped18.sam bbmap.sh in=/data/reads.fq path=/human/hg19/ out=/output/mapped19.sam Version 3: Write indices to the same directory, but specify a build number. bbmap.sh ref=/fastas/hg18.fa path=/human/ build=18 bbmap.sh ref=/fastas/hg19.fa path=/human/ build=19 bbmap.sh in=/data/reads.fq path=/human/ build=18 out=/output/mapped18.sam bbmap.sh in=/data/reads.fq path=/human/ build=19 out=/output/mapped19.sam If you don't specify a build number, the default is always 1. Quote:
Quote:
bbmap.sh in=reads.fq outu1=r1.fq outu2=r2.fq produces 2 files, r1.fq and r2.fq You can only specify output streams 1 and 2 for fasta or fastq output, though, not sam. Also - by default, whenever you output paired reads to a single file, they will come out interleaved (read 1, read 2, read 1, read 2, etc). You can split them back into 2 files with reformat: reformat.sh in=reads.fq out1=read1.fq out2=read2.fq Quote:
As for what you noticed with the pairlen flag, I had not noticed that - sounds like a possible bug. I will look into it. Quote:
|
I have uploaded a new version of BBTools, v32.06. Some improvements:
Shellscripts: Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag. All programs: Added append flag; allows you to append to existing output files rather than overwriting. BBMap (bbmap.sh): Now reports insert size median and standard deviation (as long as you have the "ihist" flag set). Added 'qahist' flag, which outputs the observed quality of bases for each claimed quality value. Works for both indel and substitution error models. Added 'bhist' flag, which outputs the frequency of ACGTN by base position. Reformat and BBDuk.sh also support 'bhist' and 'qhist' flags. 'pairlen' flag now works correctly (for capping the max allowed insert size). IUPAC reference codes are now converted to N prior to mapping. Statistics are now reported both by number of reads and number of bases. Added BBWrap (bbwrap.sh), a wrapper for BBMap that allows you to map multiple sets of reads while only loading the index once (important for huge indexes and small read sets). The mapped reads can be written all to the same output file, or to one output file per input file. Slightly increased accuracy. PacBio reads can now be mapped in fastq format. BBMerge (bbmerge.sh): Now supports dual output files for unmerged reads, 'outu1' and 'outu2'. Improved accuracy. BBDuk (bbduk.sh): Much faster when used without any reference (e.g. for quality trimming). RandomReads: Much better simulated PacBio reads. Thanks to everyone at seqanswers who has given me feedback! |
Quote:
Thanks! |
Hi Brian,
I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though. Quote:
|
Quote:
The new memory-detection system should work on Linux/UGE clusters (it works here). The shellscript calculates the system's free virtual memory, free physical memory, and 'ulimit -v'. Then it will invoke Java with the minimum of those 3 values, and thus, hopefully, will never fail initialization, and never use more RAM than is allowed. Although here, at least, it's impossible to use more RAM than allowed because our cluster will detect that and kill the process. Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens. Threads is a little more difficult. For now, unless you have exclusive access to a node, I recommend that you specify the number of threads allowed (e.g. t=4 if you reserved 4 slots). There is an environment variable "NSLOTS" that apparently gets set by UGE, which I will parse in my next release, and default to using the minimum of the system's reported logical cores and NSLOTS. But it appears to not always be correct, and furthermore, hyperthreading makes the matter more confusing (BBMap does not benefit much from hyperthreading, but many of the other BBTools do, like BBNorm). Also - there are basically 4 types of programs in BBTools, resource-wise: 1) Super-heavyweight: Requests all available threads and memory by default, but can be capped with the '-Xmx' and 't' flags. That's because all are both multithreaded and require an amount of memory dependent on the input, which the shellscript can't calculate. This includes bbmap.sh, bbsplit.sh, bbwrap.sh, mapPacBio.sh, mapPacBio8k.sh, bbnorm.sh, ecc.sh, khist.sh, bbduk.sh, countkmersexact.sh, and dedupe.sh. 2) Heavyweight: Requests all available RAM but only one primary thread. Pipeline-multithreading (input thread, processing thread, output thread) cannot be disabled, so the 't' flag has no effect unless you are doing multithreaded compression using pigz. Includes pileup.sh, randomreads.sh, bbcountunique.sh, and bbmask.sh. 3) Midweight: Requests all available threads by default, but can be capped with the 't' flag; needs little memory, so the -Xmx flag is hard-coded at an appropriate value that should work for all inputs (ranging from 100m to 400m) and not affected by available memory, though you can still override it. This includes bbmerge.sh. 4) Lightweight: Needs minimal memory and only one primary thread, so again the -Xmx flag is hard-coded to a low value (at most 200m) rather than dependent on autodetection. Again, 't' flag has no effect unless you are doing multithreaded compression using pigz. This includes reformat.sh, stats.sh, statswrapper.sh, readlength.sh, bbest.sh, bbsplitpairs.sh, gradesam.sh, translate6frames.sh, and samtoroc.sh. Note that if you have pigz installed, you can accelerate all BBTools performance on gzipped input or output using the "unpigz" and "pigz" flags (which can be set to 't' or 'f'). pigz allows multithreaded .gz compression and decompression, and is both faster and more efficient (even with only a single thread) than Java or gzip. So with pigz installed, even a mostly singlethreaded program like reformat.sh can (and by default, will) use all all available threads if you write gzipped output: reformat.sh in=read1.fq.gz in2=read2.fq.gz out=interleaved.fasta.gz zl=8 ..will compress the output to gzip compression level 8. If pigz is installed, it will use pigz for both compression and decompression instead of Java, resulting in decompression using around 1.5 cores per input file and compression using all allowed cores. You can prevent this with the 't=1' flag (to cap compression at 1 thread) or 'unpigz=f' and 'pigz=f' (to disable decompression/compression in pigz processes). Due to some bugs in UGE (which can kill processes that fork in certain circumstances), pigz is by default set to false in high-memory processes and true in low-memory processes. If you have not tried pigz, I highly recommend it! It's great and will become increasingly important as core counts increase. It's a drop-in replacement for gzip - same command line, 100% inter-compatible, but multithreaded and more efficient per thread. The only downside is increased memory usage, but it's not bad - around 12 MB per thread. Compression (for genomic data) is within ~0.1% of gzip at the same compression level. As I mentioned, decompression uses at most 1.5 cores in my tests, while compression can use all cores. |
Quote:
Sorry about that; randomreads is currently not documented. I will try to remedy that tomorrow, and let you know when it's done. |
Thanks a lot Brian,
BBMap does have a lot of different tools. |
Quote:
BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions). |
Quote:
Quote:
By the way - I forgot to mention it, but I added the phiX reference and truseq adapters to the /resources/ directory, for use with BBDuk. |
Quote:
I released a new version (32.14) which now has RandomReads documented (run randomreads.sh with no arguments). By default it will use an Illumina error model, which assigns qualities in a pattern that looks roughly like the average quality histogram of Illumina reads (peaking at about position 20, and sloping down toward the head and tail). Then some bases will by randomly changed based on the quality score. To generate Illumina-like reads from e.coli: randomreads.sh ref=ecoli.fa reads=100000 length=150 maxq=40 midq=30 minq=10 out=synth.fq It can also generate paired reads if desired. To generate PacBio-like reads: randomreads.sh ref=ecoli.fa reads=100000 minlength=200 maxlength=4000 pacbio pbminrate=0.13 pbmaxrate=0.17 out=synth.fq That will add substitutions and single-base-pair indels with per-base probability ranging from 0.13 to 0.17. There are also a lot of other options for adding specific numbers or lengths of insertions, deletions, substitutions, and Ns, in the help menu. |
Thanks a lot for the randomreads instructions Brian!
|
Quote:
Because I interpetreted the Cufflinks Doc in a way, that it will use the xmtag to lower the read weight by the amount of sites it maps to and therefore decided to set ambig=all. I therefore chose to set for my RNAseq: Code:
maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=T |
Thias,
For analysis of vertebrate RNA-seq data with Cufflinks, I do recommend all of those parameters: maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=t Although "ambig=all" or "ambig=random" would both be OK. The "intronlen", "xstag" and "xmtag" are things I added specifically for Cufflinks compatibility (or the tuxedo package in general) so they are not needed if you do downstream analysis with other tools. The only ones that affect the actual mapping are "ambig" and "maxindel". xstag and xmtag just add optional tags, and intronlen changes the "D" symbol in cigar strings to "N" for deletions longer than the specified length. There's no reason why any of them should be necessary, but for whatever reason, Cufflinks needs them. Note that if you your data is strand-specific, and you know which protocol it was, then it's better to set "xstag=firststrand" or "xstag=secondstrand" than unstranded. |
Thanks Brian for your response almost at warp speed.
Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon. Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student :rolleyes:. |
Quote:
|
Quote:
Quote:
Quote:
|
Dear Brian,
I tested bbmap successfully on a small dataset and so far I'm very happy with the results :) However, I'm running into some memory problems using my real dataset: My RAM looks currently as follows: Quote:
Quote:
|
Quote:
Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know: 1) The amount of physical memory available (it looks like 64GB, is that correct?) 2) The size of the reference you are using. 3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh 4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful. You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread. Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details. Thanks, Brian |
Quote:
it was indeed increasing Xmx to 58g which did the trick. The threads (t) parameter, however, didn't help at all. So, when I run with -Xmx58g everything works fine (independent of t=8 or t=16 or leaving t=auto). When I run with -Xmx30g and t=auto the prog quits with the following error: concerning your questions: 1) The amount of physical memory available (it looks like 64GB, is that correct?) yes 2) The size of the reference you are using. its the human genome assembly 19 from UCSC (~3.1Gb) 3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh mapPacBio8k.sh 4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap see quotes below Quote:
Quote:
|
With bbmap.sh, HG19 needs about 23 GB. I had never tested HG19 with mapPacBio8k.sh but it looks like I need to refine the memory prediction. Thanks for the report!
|
Dear Brian,
I have some questions about the optional fields (TAG:TYPE:VALUE stuff) that can be added in the SAM file by bbmap. Take the following multiple-mapped read as example Quote:
Concerning XS: I run bbmap using -xstag=unstranded as I read it somewhere in this thread and I want to use cufflinks afterwards. I thought that the '+' and '-' as values refer to the strand in this field and I also have '+' and '-' values set in different reads. If this is true, where is the difference to -xstag=true? Concerning YS: This stores the end position of a read? So its pos + the alignment length of the read? Concerning YI: The identity is simple the number of (mismatches + indels) / read-length? Thanks a lot in advance for your help and this amazing amount of possibilities :D! |
Dear Brian,
I have yet another issue ;) Looking through the readme I saw that the standard state for stoptag and idtag is true. However, looking at my sam file I couldn't find any YS or YI fields. I therefore tried to set the parameter explicitly in the start command, but this resulted in an error: Quote:
|
WhatSoEver,
For the first problem, try setting the flag "sam=1.4". By default, cigar strings are presented in Sam 1.3 format, which uses the 'M' symbol. The does not mean match - it means match OR mismatch! Sam format 1.4 fixed this bad design decision by using '=' for match and 'X' for mismatch, but by default I output sam format 1.3 because many downstream tools (such as old versions of Samtools) cannot process format 1.4. The lastest Samtools can, though. If you set "sam=1.4" then it will be more obvious how many substitutions the reads have. The NM tag tells you the edit distance, but BBMap does not determine score directly from edit distance; it uses affine transforms. So 4 consecutive substitutions will be penalized less than 4 non-adjacent substitutions, for example. Please note that neither XM nor XS are defined in the sam spec; both are bowtie/tophat flags, and XM particularly is poorly defined. For that, I output the number of alignments with scores EXACTLY equal to the top score, even though when you set "ambig=all" I will output sites with scores that are within a certain threshold of the top score. So there can be more alignments printed than the value of XM. For the XS tag, you can set it to "firststrand", "secondstrand", or "unstranded". But internally, "true" is equivalent to "firststand" which is equivalent to "unstranded"; all produce identical output. "secondstrand" produces the opposite sign of "firststrand". For firststrand/unstranded, if read 1 maps to the plus strand, it gets "XS:A:+"; for secondstrand, it gets "XT:A:-". Read 2 always gets the opposite of read 1. The YS tag is the stop position - specifically, the reference location to which the rightmost base maps. So yes, it is pos+alignment length. The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is: (matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get: (100)/(100+sqrt(400)) = 100/120= 83.3% identity. If there were 50 individual single-base-pair deletions, however, the identity would be: (100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance. This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events. For the crash bug, that's my fault for not testing the interaction between "ambig=all" and "stoptag"; they don't work together. I'll fix that. In the mean time, you can set "saa=f" (secondaryalignmentasterisks=false) which will circumvent the problem. Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns. |
Thanks Brian, that helps a lot! Many thanks!! :)
There are just two little things I'd like to clearify: --------------------------------------------------------------------------------------- Quote:
Like in the following example from my data: Quote:
Quote:
Quote:
And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial). |
WhatSoEver,
If you have a cigar string like ths: 20=1I20=1D20=25D20=20I20= ...I would calculate: (20+20+20+20+20) matches / (20+1+20+root(1)+20+25+20+root(25)+20) = (100)/(20+1+20+1+20+25+20+5+20) = (100)/(132)=75.75% I know that's asymmetric - if there's a deletion, aligning the reference to the read would give a different identity than the read to the reference - but there are a couple reasons for that. First, you can get deletions of unbounded length in cigar strings, but insertions can never be longer than the read length, so it is (according to my logic) important to correct for deletions but not really anything else. Second, insertions occur INSTEAD of matches, while deletions occur IN ADDITION to matches. So if a 100bp read has 1 bp match and 99bp insertion, does it make sense to call that 1/(1+root(99)) = about 10% identity? Not really. But with a 100bp deletion in the middle, you'd have 100/110 = about 90% identity, which is maybe a bit high, but it seems reasonable for an alignment with two nearby 50bp consecutive matches, which can't arise by chance. Third, it seems to me that long deletions are more common than long insertions, but that could be discovery bias. Lastly, decreasing the identity penalty of insertions would cause a particular problem. Let's say you have a chimeric 200bp read, 100bp from somewhere in chromosome 1 and 100bp from chromosome 2. You could map it in two places, with cigars string of 100=100I and 100I100=, or 100=100X and 100X100=. My current formula would give that 50% identity for either location, whether it used the X or I symbol for the chimeric bases. But if you take the root of the insertions, then you get 100/(100+10) = 91% identity to two different sequences that have nothing in common, which does not make sense. It could be that calculating identity this way is not a good idea; I may put in a flag to remove the sqrt from deletions, but it seems to me that it gives a more useful answer in terms of short-read alignment when doing recruitment studies, which is what I actually put the idtag in for. Quote:
Quote:
|
Hey Brian,
i just briefly checked BBMap and have some questions. 1) I am working, among others, with RNA-Seq data of a cyanobacterium. Is it possible that BBMap performs a spliced alignments for my RNA-Seq data? In a small run i did not observed spliced alignments but nevertheless i was wondering about it. If so, can i prevent this case by setting 'intronlen' to 0? 2) Is there a way to apply soft- or hard-clipping? For example bwa performs for a read soft-clipping leading to 94M6S where BBMap leads to 100M although the last 6 bases are partially mismatches and therefor i want them to be clipped. 3) Is there a way to determine the maximal number of mismatches? 4) For CRAC and PAR-Clip sequencing data i want only 1 respectively 2 mismatches/indels. So when i want at most 2 indels i have to set maxindel= 2 and strictmaxindel = true. Am i right? 5) Does BBMap reports only the best alignment (or best alignments if alignments with equal quality are present) or is there a way to report also alignments that are not as good as the best alignment? Would this be the maxsites option? Thanks for your nice support! Mchicken |
Mchicken,
1) 'intronlen' does not actually affect alignment, just the way splices are reported - deletions shorter than 'intronlen' will have a 'D' symbol in the cigar string, while longer ones will have an 'N' symbol. To prevent gapped alignments, you should set 'maxindel=0', which will prevent them from being looked for, though some may still be found (or 'strictmaxindel=0' to absolutely ban all alignments with indels). But prokaryotes do have some self-splicing; these are typically very short, like 20bp or so, I think. So you may want to set 'maxindel=100' or something like that, which will prevent long ones. 2) Soft-clipping is applied by default if a read goes off the end of a scaffold. You can also force it on alignments like the one you describe with the flag 'local=t'. Also, note that if you use the flag 'sam=1.4' then cigar strings will be generated with '= for match and 'X' for mismatch, instead of 'M' for both. 3) You can use a flag like 'minid=98' to prevent alignments with lower than approximately 98% identity, which would be 2 mismatches for a 100bp read, but substitutions, deletions, and insertions are all scored differently, and whether they are contiguous or scattered also affects the score, so this is not exact. BBMap does not have any way to ensure that the best alignment with at most X mismatches is returned, like Bowtie 1 can do, because the scoring is fundamentally different. 4) "maxindel" actually controls the length of individual indels, not the number of them, which is not controlled. There's also a "maxindel2" flag that controls the sum of the length of all indels, which by default is set to double maxindel. So if you did this: maxindel=10 maxindel2=15 strictmaxindel ...then individual indels could be up to 10bp; the sum of the length of indels could be up to 15bp; and any alignment with a single indel longer than 10bp would be banned. But there is no way to limit the total number of indel events. 5) By default the best alignment is reported, and if there are multiple equally-scoring alignments, the first one (in genomic coordinates) is reported, but it is marked as ambiguous (XT:A:R tag). This corresponds to the setting 'ambig=best'. You can alternatively set 'ambig=random' or 'ambig=all' (which is what you are looking for). 'maxsites' will limit the total number of alignments displayed in 'ambig=all' mode, but won't have any effect in default mode. Thanks for using BBMap! -Brian |
Escaping spaces using the shell script
Hi Brian,
I can't seem to work out how to pass a filepath with spaces in it to BBMap using the shell script. Nothing that I normally would use seems to work... e.g. Code:
~/bbmap/bbmap.sh ref="../Reference\ genome/myref.fasta" build=2 in1=./Reads1.fastq.gz in2=./Reads2.fastq.gz out=./mapped.sam Code:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter: genome/myref.fasta Thanks! |
Have you pre-built the index for "myref.fasta"?
Code:
$ bbmap.sh ref=/path_to_your_genome.fa |
On second reading of your post, it appears that you have a space in your directory name and you are asking about escaping that? Let me test.
Edit: Standard ways of escpaing spaces in shell script are not working but directly running the java command (escaping the space) is working. Brian should be able to provide the final answer. |
@GenoMax,
Thanks, yes your second post is exactly what I was referring to. I can't seem to find a way to escape the spaces in a directory path (but I too successfully called it directly rather than through the shell script). |
simono,
In Windows, you can handle that with quotes, e.g. ref="/a/path with spaces/x.fa" In Linux... I don't know how to do it easily. Perhaps the easiest way would be to make a symlink that doesn't have any spaces. Alternatively, as in this article, you could export the path as an environment variable. -Brian |
Hello Brian,
I need someone to give me a broad hint. I have already successfully indexed some eukaryotic genomes with BBMAP. But one PhD student in our lab has sequenced a custom shRNA library and I just wanted to use my usual aligner and just lower k-mer size. But it claims that it cannot read the FASTA with the reference sequences, although it is there. It there a lower limit of scaffold sizes (or a special parameter which I didn't set) which may cause BBMAP to abort with that error or is really the FASTA file somehow corrupted? Code:
java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5 |
Quote:
Since you have pre-built the BBMap index you can just use path= to point to the indexes. |
@GenoMax: Thanks for your answer, but unfortunately this can't be the problem.
The folder in ref points to an empty directory, because I do not want to include the shRNA sequences to my regular set of indexed genomes. However the syntax with ref= and path= is exactly the syntax I used to successfully index the eukaryotic genomes before. (I used the standard k-mer size 13 and a midpad size of 200000 for the genomes though). |
Thias,
That specific error is triggered by: File f=new File(reference); if(!f.exists() || !f.isFile() || !f.canRead()){ So according to Java, either it does not exist, is not a file, or cannot be read. Perhaps the permissions are wrong, or the metadata on the file is wrong? It triggered before even looking inside the file, so the data is probably not corrupted. You could technically do this: cat /mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/shRNAs.fa | java -da -Xmx32g -cp /home/z/zepper/software/bbmap/current/ align2.BBMap build=1 overwrite=true match=long fastareadlen=500 ref=stdin.fa path=/mnt/HPC/zepper-tmp/Pia/reference/sh_sequences/bbmap_index build=1 threads=4 -Xmx32g midpad=2000 k=5 minscaf=5 ...which would circumvent Java trying to read the file. But I'd like to hear if you discover anything odd about the file permissions or metadata (like if it claims to be a directory, or something). |
Dear Brian,
I'm coming with another problem (bug?), I cannot find an answer to. I run a RNAseq mapping with bbmap and got afterwards the following results: Quote:
Quote:
As this is quite different to your program output (except for the unmapped reads), I'm wondering were the mistake is? Cheers, W |
W,
I use the "XT:A:R" tag to indicate that a read is ambiguously mapped and XT:A:U to indicate unambiguously mapped. If you say "ambig=all" then the threshold for printing secondary alignments is more liberal than the threshold for deciding an alignment is ambiguous. So, even "unambiguously" aligned reads will sometimes have secondary sites printed - the best one, and others that are "decent" even though they are sufficiently worse than the best site that I classify the best site as unambiguous. So if you add up all of the primary alignments with XT:A:U, you should get the same numbers as the program reports. Sorry that it's a custom tag (that is also used by bwa), but there's nothing in the official flag bits or official tags to indicate whether an alignment is ambiguous, and the SAM spec doesn't say anything about the presence of secondary alignments indicating ambiguity. It's really a subjective judgement. |
Sorry, but I don't get it completely.
Quote:
So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments? Quote:
Quote:
Or is it just my german misinterpretation of the strictness in the word ambiguity? :confused: ;) But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time :D |
Hello Brian
I am trying to compare some mapping tools, and I am also using BBMap I execute 4 tests using PE readed preprocessed with bbduck, cutadapt, fastx and trimmomatic respectively With the reads preprocessed using bbduck and trimmomatic the program is able to finish without any error When I run the reads preprocessed with cutadapt or fastx bbmap crashes... do you have any idea why is this happening? Here is the parameters I used and the stde output: overwrite=true fastareadlen=500 -Xmx200g threads=96 outputunmapped=f sam=1.4 BBMap version 32.15 Set overwrite to true Set threads to 96 Set DONT_OUTPUT_UNMAPPED_READS to true Retaining first best site only for ambiguous mappings. Set genome to 1 Loaded Reference: 9.674 seconds. Loading index for chunk 1-7, build 1 Generated Index: 15.609 seconds. Analyzed Index: 22.798 seconds. Started output stream: 0.345 seconds. Started output stream: 0.014 seconds. Cleared Memory: 0.146 seconds. Processing reads in paired-ended mode. Started read stream. Started 96 mapping threads. Exception in thread "Thread-110" java.lang.IndexOutOfBoundsException: Index: 172, Size: 172 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:484) at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:361) at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:205) at java.lang.Thread.run(Thread.java:679) |
Salvatore,
That error is because there are unequal numbers of reads in the read1 and read2 files (I'll add a better error message that explains it). This happens when a read-trimming tools throws away one read in a pair because it's too short, but not the other. This also ruins the pairing order, so mapping paired reads will no longer work correctly. BBDuk and Trimmomatic both keep paired reads together correctly. This is a known problem with fastx; you basically should not use it with paired reads. I think cutadapt is able to handle pairs correctly when you use the "-p" flag, but I'm not sure. You can re-pair the broken files with my repair.sh script, in the latest release of BBTools (32.27), like this: repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq |
Quote:
1) how high the best score is 2) the difference there is between the best score and the second best score 3) how many additional sites are within a certain range of the best score It's not a closed-form equation, and it's completely subjective (that's what I meant by subjective, not the SAM spec); I optimized it empirically by iteratively tweaking it to minimize false-positives and maximize true-positives. And that alone decides whether or not I classify a read as "ambiguous". When "ambig=all" is chosen, all sites with a score of at least 95% of the best score are printed, whether or not the best site is considered ambiguous. This is a bit different. I think I'll change it to make this number a parameter, and perhaps only print secondary sites for reads that are considered ambiguous, which would make more sense. Quote:
|
thank you
Quote:
Your explanation was as always very simple and clear I could trimm in PE mode using Cutadapt and then was able to map the reads Nothing to do for Fastx thanks again Salvatore |
Brian,
I'm wondering if you can point or provide me with some recommended parameters for running shRNAseq datasets on BBMap. My initial efforts using BBMap using the default parameters did yield a sam file, but when I converted it to a sorted bam files and attempted to use it on IGV, it said "does not contain any sequence names which match the current genome." I'm pretty sure I'm overlooking something. I did create the indices with the miRBase mature.fa file. That seemed to work fine, but noticed in the output that said: No index available; generating from reference genome: Any help would be appreciated. |
coryfunk,
IGV is very sensitive to the names of scaffolds; they must be identical in the fasta used for mapping and the fasta used by IGV. When I encounter that error, "does not contain any sequence names which match the current genome", it's usually because I'm using the wrong reference. so if, for example, you are mapping to a fasta of RNAs and trying to display the results in IGV against the full genome, that won't work. The methodology would be something like this: index: bbmap.sh ref=mature.fa map: bbmap.sh in=reads.fq out=mapped.sam bs=script.sh translate to bam: sh script.sh (that creates and runs a shellscript that creates a sorted indexed bam file from the sam output, if samtools is installed, though of course you can do it manually too) Now, run IGV and choose "load genome" and point it to the same fasta file that you used for indexing, then load the bam file. If it still does not work, look at the header of the sam file, and verify that the names of the scaffolds in IGV are the same as the names in the sam header. Or, actually, that's what you should do first. As for specifics about good settings to use for shRNAseq, I have no experience with it. Normally for RNAseq I suggest setting the "maxindel" flag to an appropriate value for the organism's expected intron length (say, 200000 for vertebrates or 8000 for fungi and plants with short introns). But if shRNAseq involves only short, unspliced RNAs, then you can just go with the defaults. -Brian |
rel. 33.04 fails
Hi Brian
I constantly update bbmap and now ran into problems when using the newest release 33.04 - it performs the mapping, but then fails during the Results statistics reporting, when 32.23 reports perfectly fine for exactly the same command: Quote:
Cheers Harald |
Harald,
Thanks for finding that! I added a new histogram, "qahist" (quality accuracy histogram), which correlates the claimed quality of bases with their actual quality as measured by error rate. But if you use the "qhist" flag and not the "qahist" flag, that crash occurred because an array was not getting initialized. It's fixed now; I'll upload the fixed version later today. -Brian |
I've uploaded the fixed version (33.08) to sourceforge.
|
Repair.sh
1 Attachment(s)
Hello,
I got the same error saying my files had a different number of reads and tried the suggestion of running repair.sh per this thread's reply and the documentation within the program. However, when I run repair.sh, it tells me to there are different number of reads and to run repair.sh. I don't understand why it is telling me to run the program I am trying to run. Any ideas? |
Quote:
cat file1.fq file2.fq | repair.sh in=stdin.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq overwrite=t |
Hi Brian,
I am trying to create a tool for identification of viral genetic material in a sample. It's supposed to work for RNA viruses, so RNA is reversed transcribed and the resulting DNA is sequenced. I end up with fastq files, where majority of reads comes from the host, but a significant minority comes from virus. I want to map the reads to a database of known RNA viral sequences, and I'm considering using BBMap for this purpose. I need a mapper which would be able to do local as well as end-to-end alignment... I tried bowtie2 at first, but the local version didn't work properly and I didn't manage to find anyone who'd know what's wrong, so I want to try another mapper now. The files with reads can be quite large. Do you think BBMap is a good tool for this type of purpose? Also, I looked through mapping parameters in the README file, but couldn't find seed length. Is there any reason for that? Thank you very much. |
Andrej,
BBMap should work well for this purpose, as it can handle very low identity alignments which are typical of cross-species mapping. You can adjust the seed length with the "k" parameter. The default is 13, and I don't recommend going below about 7 as it becomes exponentially slower with a shorter seed length (speed is inversely proportional to 4^k). For high sensitivity, you might try something like this: bbmap.sh in=reads.fq out=mapped.sam ref=virus.fa k=11 minratio=0.1 maxindel=200 slow ...or drop k even lower. But you still may not get any hits to your viral database unless it contains a relative of the virus in question. If you have an assembly of the host genome, you can also remove reads that map to it like this, to reduce the volume of data you're dealing with: bbmap.sh in=reads.fq outu=unmapped.fq ref=host.fa -Brian |
Hi Brian,
I have some questions about the optional histogram tables: 1) mhist: It displays per base matches/sequencing errors, doesn't it? What is included in the "Other" column? How are multiple insertions counted? Assume the following: Code:
12345--6789 I have different read length from 30bp - 1.2kbp. Is the error rate somehow adjusted to read length? 2) qhist: Assuming that Read_linear is the average base quality, what is Read_log? 3) ihist: This one is always empty?! 4) scafStats: Ambiguous reads are counted multiple times, aren't they? This means the overall ambiguousMB is in fact much smaller, isn't it? Thanks! |
Quote:
Quote:
Quote:
Code:
pos 123456789 Quote:
Quote:
Quote:
You can, alternately, use the 'lhist' flag to just plot the read lengths, regardless of mapping. Quote:
Quote:
|
pairlen= broken in 33.41
Hi Brian
I was using pairlen=1200 to limit insert sizes for paired end mapping and with the newest version 33.41 bbmap reports unkown option, while in 33.40b it used to work. What has happened? Cheers Harald |
Harald,
Thanks for noting that! I accidentally deleted the line that parsed that command. I've uploaded the fixed version, 33.42. On another topic, I'd like to announce that a second developer, Jon Rood, has begun porting certain aspects of BBTools over to C using JNI calls. The currently ported classes are BBMerge, BBMap, and Dedupe. This is optional (you can enable it at runtime with the 'usejni' flag), and the output is identical, but there is a substantial speedup: BBMap: +30% BBMerge: +60% Dedupe: up to +200% (when allowing an edit distance) If you are interested in a free speed increase, instructions for compiling the C code for OS X or Linux are in /bbmap/jni/README.txt |
Hello,
I hope this is the right place to post questions related to BBMap, but since the last reply wasn't too long ago.... I've been using BBMap to map paired-end Fastq reads where the headers have been renamed for downstream analysis ("_1" and "_2" have been added to header names for forward and reverse reads respectively). When I look at the SAM file from the mapping, the forward and reverse reads that map have nonzero POS field, but the PNEXT fields are always zero. Is this caused by my editing the read names? Bowtie2 doesn't have the same problem, and assembling with SPAdes and IDBA-UD worked normally with the edited read names. Example of the SAM entries for a read pair: Quote:
Brandon |
Brandon,
Those were not recognized as paired. BBMap recognizes only the normal Illumina naming schemes: "* /1" and "* 1:" If those reads are interleaved in a single file, use the "int=t" flag which will force BBMap to recognized them as being interleaved. |
bbmap hitstats - unambiguous Hits
Hi Brian
I was looking at the hitstats files and I realized that the %unambiguousReads can add up to more than 100%, and similarly the unambiguousReads count can be higher than the total input reads... How can that be? Cheers Harald |
Harald,
Thanks for noticing that. It works correctly for single-ended reads, but it appears that improper pairs (where one read maps to one scaffold, and the other maps to a different scaffold) are double-incrementing the counts on both scaffolds. I'll fix that in the next release. -Brian |
Thanks for the quick reply, Brian!
|
Quote:
Quote:
|
Hi,
I'm developing a tool for analysis of sequence reads from viral genetic material, and mapping to reference viral genomes is part of the process. First version used bowtie2, but now I'm trying to make a new version with better user flexibility, more options etc. And I'd like to use bbmap, as it seems to be the best for this purpose. The user manual to bbmap seems more straightforward, and also people I talked to who used both mappers told me bbmap is generally better. Now, I did some tests, e.g. I ran bbmap on made up sequences to see how it would perform. I created a fastq file with one read, and two fasta files as references. One fasta file had only one sequence similar to the read, while the other fasta file had 4 such sequences (one of which was the same as the one in the other file, and this one was most similar). Naturally, the read was always mapped to the sequence with highest similarity. The mapping quality was the same in the 2 cases. According to that, I can say that mapping quality is completely independent from other sequences in the reference, it only depends on the read and the particular reference sequence to which it was mapped. I'm not sure, though, so I wanted to ask whether this assumption is actually true. Thanks a lot. |
The mapping quality is dependent on the other reference sequences, but only if they are within some threshold of similarity (roughly 6 edits) to the best site. If you copy the same reference sequence twice in the fasta file, the read will map ambiguously and get a score of 3 or less. If you add one or two edits to one copy, the read will map to the unedited one but get a reduced score. But if they differ by, say, 10 edits, then the best mapping location will not get any score penalty. The penalty is also influenced by the number of alternative sites; for example, if there are 5 sites that are each 2 edits worse than the best site, that will give a greater score penalty than if there is only one alternative site.
|
pileup.sh inconsistent with samtools pileup
Hi Brian,
I tried using the "fastaorf" function in pileup.sh, to look at the read depth for a bunch of ORFs predicted with Prodigal. The input is in Prodigal's output format as specified. However, I get per-orf coverage results (the depthSum field) that are inconsistent with the output from samtools mpileup. Briefly: I produced a pileup file with samtools mpileup and for each orf, simply summed the read depth (4th column of the pileup file) for each position that falls within the orf. I double-checked this by converting the Prodigal output to a BED feature table, and used bedtools multicov and the original BAM file to produce a per-feature read depth. This gives per-feature read depths which are not identical to what I got by summing the depths but roughly a multiple (i.e. plotting the depths per orf from both methods against each other gives an approximately linear relationship). The output from pileup.sh, on the other hand, doesn't give anything close to a linear relationship. Is there something different in how pileup.sh calculates the per-orf coverage? Thanks a lot, Brandon |
Brandon,
I did introduce a bug recently when I added support for tracking only read start positions rather than total coverage, which manifested in some situations. It's fixed now and I just uploaded the fixed version (33.57). Would you mind downloading that and confirming whether it works correctly? Thanks! |
Hi Brian,
Thanks for your reply. Unfortunately the output seems to be the same. Should I send you the output from pileup.sh vs. the output from bedtools so that you can see what I mean? Best, Brandon Quote:
|
Quote:
|
Hi Brian,
I would be interested in knowing when you intend to publish BBmap in a paper, can you enlighten us? Cheers |
I hope to have it submitted by the end of this year.
-Brian |
Hi Brian,
Thank you for previous answer through email which I am posting: "Daniel,I am trying to find a value in mapping summary to would correspond to bowtie2's '% overall alignment rate' Also, which BBMap parameters should I tweak if I want to accurately want to reproduce bowtie2 results obtained using "-k 1 --fast-local" ? |
Daniel,
Quote:
Code:
Pairing data: pct reads num reads pct bases num bases "fast local maxindel=10" |
de-interleave fastq outtput
Can BBMap de-interleave the fastq output for reads 1 and reads 2?
Ie: if I Use "pairedonly=f outu=${B}.noplant.fastq.gz", can BBMap put the unmapped forward reads ${B}_R1.noplant.fastq.gz and the unmapped reverse reads in ${B}_R2.noplant.fastq.gz. Or should I just de-interleave the reads as a post process? |
Daniel,
You can set "outu1=" and "outu2=" or "outm1=" and "outm2=" for output in paired files. |
Issues with BBMap
Hi
I am a newbie to NGS data analysis- basically a bench scientist trying to do analysis. I have sequenced human tissue samples on the PacBio RSII and am trying to align to human genome using the bbmap tool. As mentioned in the forum, I am trying to use the MapPacBio.sh script for the alignment. Previously I was getting an error where it could not find the class align2.MapPacBio . I hard coded the path in the file and then Here is the new error I am getting: java -Djava.library.path=/opt/compsci/bbmap/33.73/jni/ -ea -Xmx51798m -cp BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88 ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa Error: Could not find or load main class build=1 Java version is "1.7.0_51" OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02) OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode) Could you please help? Thanks Prashant |
@prashant_singh. I am hardly an expert on BBTools. Brian will probably log in soon -- he is on USA west coast time as far as I know -- and so will give a better answer. But I happened to start up a mapPacBio run just as your post came in. Comparing your Java to my Java reveals:
Yours ... Code:
java Code:
java You mentioned: Quote:
|
That's correct, and yes, I'm on West Coast time :) The command line should probably be:
java -ea -Xmx51798m -cp /opt/compsci/bbmap/33.73/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88.sam ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa Note both of the things in bold: "/opt/compsci/bbmap/33.73/current/" is probably correct, but maybe not, depending on how it was installed. I'm guessing that somewhere under "/opt/compsci/bbmap/33.73/" is a subdirectory named "current" - that is what you want to point it to. For "out=/data/***/pac88.sam", I just added the ".sam" suffix. All of the programs in BBTools are extension-sensitive, so if you don't include an extension, it will just output the default format (which in this case is sam, but it's always best to specify it). |
I apologize if this has been covered already but I've just downloaded and run bbmap a few times (seems great so far and I like the control provided via the available command line options) and I have a quick question. When building the index is it possible for bbmap to ignore parts of the reference names past the first blank space? For example I'm using the Ensembl human release and the FASTA reference names look something like this:
>1 dna_sm:chromosome chromosome:GRCh37:1:1:249250621:1 REF Currently when I run alignments on this FASTA file that entire name shows up in the SAM file while instead I should only see "1". Other aligners (STAR, bowtie, bwa, etc) throw out everything after the first space when indexing leaving just the main reference name. This is particularly important for using aligners as part of a gene expression pipeline using something like eXpress since some transcriptome extraction tools (programs that parse a GTF and extract transcripts from an accompanying genome FASTA file) usually name the transcripts as >TRANSCRIPT_ID GENE_SYMBOL. eXpress will throw out everything after the first space however if the alignments contain the full reference names then no alignments are matched up to transcripts in the output. |
Yes, I put in an option for that because some programs require it. The flag is trd=t (or trimreaddescriptions=true), which will truncate all names (both read names and reference names) at the first whitespace symbol.
|
rad. thanks for that.
|
All times are GMT -8. The time now is 09:12 AM. |
Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.