![]() |
Quote:
Quote:
Version 1: Don't write an index to disk. This usually makes the startup slower, but is more I/O efficient. It's the simplest and best if you are only mapping to a reference once (like when evaluating multiple assemblies). bbmap.sh in=/data/reads.fq ref=/fastas/hg18.fa out=/output/mapped18.sam nodisk bbmap.sh in=/data/reads.fq ref=/fastas/hg19.fa out=/output/mapped19.sam nodisk Version 2: Write each index to its own directory. bbmap.sh ref=/fastas/hg18.fa path=/human/hg18/ bbmap.sh ref=/fastas/hg19.fa path=/human/hg19/ bbmap.sh in=/data/reads.fq path=/human/hg18/ out=/output/mapped18.sam bbmap.sh in=/data/reads.fq path=/human/hg19/ out=/output/mapped19.sam Version 3: Write indices to the same directory, but specify a build number. bbmap.sh ref=/fastas/hg18.fa path=/human/ build=18 bbmap.sh ref=/fastas/hg19.fa path=/human/ build=19 bbmap.sh in=/data/reads.fq path=/human/ build=18 out=/output/mapped18.sam bbmap.sh in=/data/reads.fq path=/human/ build=19 out=/output/mapped19.sam If you don't specify a build number, the default is always 1. Quote:
Quote:
bbmap.sh in=reads.fq outu1=r1.fq outu2=r2.fq produces 2 files, r1.fq and r2.fq You can only specify output streams 1 and 2 for fasta or fastq output, though, not sam. Also - by default, whenever you output paired reads to a single file, they will come out interleaved (read 1, read 2, read 1, read 2, etc). You can split them back into 2 files with reformat: reformat.sh in=reads.fq out1=read1.fq out2=read2.fq Quote:
As for what you noticed with the pairlen flag, I had not noticed that - sounds like a possible bug. I will look into it. Quote:
|
I have uploaded a new version of BBTools, v32.06. Some improvements:
Shellscripts: Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag. All programs: Added append flag; allows you to append to existing output files rather than overwriting. BBMap (bbmap.sh): Now reports insert size median and standard deviation (as long as you have the "ihist" flag set). Added 'qahist' flag, which outputs the observed quality of bases for each claimed quality value. Works for both indel and substitution error models. Added 'bhist' flag, which outputs the frequency of ACGTN by base position. Reformat and BBDuk.sh also support 'bhist' and 'qhist' flags. 'pairlen' flag now works correctly (for capping the max allowed insert size). IUPAC reference codes are now converted to N prior to mapping. Statistics are now reported both by number of reads and number of bases. Added BBWrap (bbwrap.sh), a wrapper for BBMap that allows you to map multiple sets of reads while only loading the index once (important for huge indexes and small read sets). The mapped reads can be written all to the same output file, or to one output file per input file. Slightly increased accuracy. PacBio reads can now be mapped in fastq format. BBMerge (bbmerge.sh): Now supports dual output files for unmerged reads, 'outu1' and 'outu2'. Improved accuracy. BBDuk (bbduk.sh): Much faster when used without any reference (e.g. for quality trimming). RandomReads: Much better simulated PacBio reads. Thanks to everyone at seqanswers who has given me feedback! |
Quote:
Thanks! |
Hi Brian,
I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though. Quote:
|
Quote:
The new memory-detection system should work on Linux/UGE clusters (it works here). The shellscript calculates the system's free virtual memory, free physical memory, and 'ulimit -v'. Then it will invoke Java with the minimum of those 3 values, and thus, hopefully, will never fail initialization, and never use more RAM than is allowed. Although here, at least, it's impossible to use more RAM than allowed because our cluster will detect that and kill the process. Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens. Threads is a little more difficult. For now, unless you have exclusive access to a node, I recommend that you specify the number of threads allowed (e.g. t=4 if you reserved 4 slots). There is an environment variable "NSLOTS" that apparently gets set by UGE, which I will parse in my next release, and default to using the minimum of the system's reported logical cores and NSLOTS. But it appears to not always be correct, and furthermore, hyperthreading makes the matter more confusing (BBMap does not benefit much from hyperthreading, but many of the other BBTools do, like BBNorm). Also - there are basically 4 types of programs in BBTools, resource-wise: 1) Super-heavyweight: Requests all available threads and memory by default, but can be capped with the '-Xmx' and 't' flags. That's because all are both multithreaded and require an amount of memory dependent on the input, which the shellscript can't calculate. This includes bbmap.sh, bbsplit.sh, bbwrap.sh, mapPacBio.sh, mapPacBio8k.sh, bbnorm.sh, ecc.sh, khist.sh, bbduk.sh, countkmersexact.sh, and dedupe.sh. 2) Heavyweight: Requests all available RAM but only one primary thread. Pipeline-multithreading (input thread, processing thread, output thread) cannot be disabled, so the 't' flag has no effect unless you are doing multithreaded compression using pigz. Includes pileup.sh, randomreads.sh, bbcountunique.sh, and bbmask.sh. 3) Midweight: Requests all available threads by default, but can be capped with the 't' flag; needs little memory, so the -Xmx flag is hard-coded at an appropriate value that should work for all inputs (ranging from 100m to 400m) and not affected by available memory, though you can still override it. This includes bbmerge.sh. 4) Lightweight: Needs minimal memory and only one primary thread, so again the -Xmx flag is hard-coded to a low value (at most 200m) rather than dependent on autodetection. Again, 't' flag has no effect unless you are doing multithreaded compression using pigz. This includes reformat.sh, stats.sh, statswrapper.sh, readlength.sh, bbest.sh, bbsplitpairs.sh, gradesam.sh, translate6frames.sh, and samtoroc.sh. Note that if you have pigz installed, you can accelerate all BBTools performance on gzipped input or output using the "unpigz" and "pigz" flags (which can be set to 't' or 'f'). pigz allows multithreaded .gz compression and decompression, and is both faster and more efficient (even with only a single thread) than Java or gzip. So with pigz installed, even a mostly singlethreaded program like reformat.sh can (and by default, will) use all all available threads if you write gzipped output: reformat.sh in=read1.fq.gz in2=read2.fq.gz out=interleaved.fasta.gz zl=8 ..will compress the output to gzip compression level 8. If pigz is installed, it will use pigz for both compression and decompression instead of Java, resulting in decompression using around 1.5 cores per input file and compression using all allowed cores. You can prevent this with the 't=1' flag (to cap compression at 1 thread) or 'unpigz=f' and 'pigz=f' (to disable decompression/compression in pigz processes). Due to some bugs in UGE (which can kill processes that fork in certain circumstances), pigz is by default set to false in high-memory processes and true in low-memory processes. If you have not tried pigz, I highly recommend it! It's great and will become increasingly important as core counts increase. It's a drop-in replacement for gzip - same command line, 100% inter-compatible, but multithreaded and more efficient per thread. The only downside is increased memory usage, but it's not bad - around 12 MB per thread. Compression (for genomic data) is within ~0.1% of gzip at the same compression level. As I mentioned, decompression uses at most 1.5 cores in my tests, while compression can use all cores. |
Quote:
Sorry about that; randomreads is currently not documented. I will try to remedy that tomorrow, and let you know when it's done. |
Thanks a lot Brian,
BBMap does have a lot of different tools. |
Quote:
BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions). |
Quote:
Quote:
By the way - I forgot to mention it, but I added the phiX reference and truseq adapters to the /resources/ directory, for use with BBDuk. |
Quote:
I released a new version (32.14) which now has RandomReads documented (run randomreads.sh with no arguments). By default it will use an Illumina error model, which assigns qualities in a pattern that looks roughly like the average quality histogram of Illumina reads (peaking at about position 20, and sloping down toward the head and tail). Then some bases will by randomly changed based on the quality score. To generate Illumina-like reads from e.coli: randomreads.sh ref=ecoli.fa reads=100000 length=150 maxq=40 midq=30 minq=10 out=synth.fq It can also generate paired reads if desired. To generate PacBio-like reads: randomreads.sh ref=ecoli.fa reads=100000 minlength=200 maxlength=4000 pacbio pbminrate=0.13 pbmaxrate=0.17 out=synth.fq That will add substitutions and single-base-pair indels with per-base probability ranging from 0.13 to 0.17. There are also a lot of other options for adding specific numbers or lengths of insertions, deletions, substitutions, and Ns, in the help menu. |
Thanks a lot for the randomreads instructions Brian!
|
Quote:
Because I interpetreted the Cufflinks Doc in a way, that it will use the xmtag to lower the read weight by the amount of sites it maps to and therefore decided to set ambig=all. I therefore chose to set for my RNAseq: Code:
maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=T |
Thias,
For analysis of vertebrate RNA-seq data with Cufflinks, I do recommend all of those parameters: maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=t Although "ambig=all" or "ambig=random" would both be OK. The "intronlen", "xstag" and "xmtag" are things I added specifically for Cufflinks compatibility (or the tuxedo package in general) so they are not needed if you do downstream analysis with other tools. The only ones that affect the actual mapping are "ambig" and "maxindel". xstag and xmtag just add optional tags, and intronlen changes the "D" symbol in cigar strings to "N" for deletions longer than the specified length. There's no reason why any of them should be necessary, but for whatever reason, Cufflinks needs them. Note that if you your data is strand-specific, and you know which protocol it was, then it's better to set "xstag=firststrand" or "xstag=secondstrand" than unstranded. |
Thanks Brian for your response almost at warp speed.
Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon. Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student :rolleyes:. |
Quote:
|
Quote:
Quote:
Quote:
|
Dear Brian,
I tested bbmap successfully on a small dataset and so far I'm very happy with the results :) However, I'm running into some memory problems using my real dataset: My RAM looks currently as follows: Quote:
Quote:
|
Quote:
Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know: 1) The amount of physical memory available (it looks like 64GB, is that correct?) 2) The size of the reference you are using. 3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh 4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful. You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread. Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details. Thanks, Brian |
Quote:
it was indeed increasing Xmx to 58g which did the trick. The threads (t) parameter, however, didn't help at all. So, when I run with -Xmx58g everything works fine (independent of t=8 or t=16 or leaving t=auto). When I run with -Xmx30g and t=auto the prog quits with the following error: concerning your questions: 1) The amount of physical memory available (it looks like 64GB, is that correct?) yes 2) The size of the reference you are using. its the human genome assembly 19 from UCSC (~3.1Gb) 3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh mapPacBio8k.sh 4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap see quotes below Quote:
Quote:
|
With bbmap.sh, HG19 needs about 23 GB. I had never tested HG19 with mapPacBio8k.sh but it looks like I need to refine the memory prediction. Thanks for the report!
|
All times are GMT -8. The time now is 09:01 AM. |
Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.