SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 07:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 04:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 02:10 AM

Reply
 
Thread Tools
Old 04-15-2014, 10:46 AM   #21
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by HGV View Post
Hi Brian,
Very decent mapper that you wrote, and supergreat that it is finally
available.

I was playing around with bbmap, sooo many cool features and an impressive
speed!
Thanks! Glad you like it.

Quote:
But I could not figure out a couple of things:
1) How to point to directories with path=
I would like to be able to use a set of indices that I created previously
and that are stored in a specific 'databases' path mounted on all nodes of
our cluster. But I did not get this to work setting path= during the
mapping process because it changes where bbmap searches for the reads. My
second trial was to call bbmap in this folder of the references and set
path to the folder of the reads, but then I always got an error that the
read file was not found. The only thing that worked for me is calling bbmap
from a folder which also includes the /ref folder, but this means copying
both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
bbmap similarly
BBMap can use absolute paths for all of those, but the syntax is a bit different. Let's say you have 2 builds of the human genome, HG18 and HG19, at /fastas/hg18.fa and /fastas/hg19.fa, reads at /data/reads.fq, and you want to write output to /output/. There are 3 ways you could handle this.

Version 1: Don't write an index to disk. This usually makes the startup slower, but is more I/O efficient. It's the simplest and best if you are only mapping to a reference once (like when evaluating multiple assemblies).
bbmap.sh in=/data/reads.fq ref=/fastas/hg18.fa out=/output/mapped18.sam nodisk
bbmap.sh in=/data/reads.fq ref=/fastas/hg19.fa out=/output/mapped19.sam nodisk


Version 2: Write each index to its own directory.
bbmap.sh ref=/fastas/hg18.fa path=/human/hg18/
bbmap.sh ref=/fastas/hg19.fa path=/human/hg19/

bbmap.sh in=/data/reads.fq path=/human/hg18/ out=/output/mapped18.sam
bbmap.sh in=/data/reads.fq path=/human/hg19/ out=/output/mapped19.sam


Version 3: Write indices to the same directory, but specify a build number.
bbmap.sh ref=/fastas/hg18.fa path=/human/ build=18
bbmap.sh ref=/fastas/hg19.fa path=/human/ build=19

bbmap.sh in=/data/reads.fq path=/human/ build=18 out=/output/mapped18.sam
bbmap.sh in=/data/reads.fq path=/human/ build=19 out=/output/mapped19.sam


If you don't specify a build number, the default is always 1.

Quote:
2) using outm
I am mapping shotgun metagenomic illumina paired end reads to references
that are gene databases. I was expecting to get different output for out=
and outm= but the files produced are identical. I would expect to see some
read pairs where only one of the reads maps to the gene database and the
other not, and as far as I understood out= gives me the mapping pairs and
outm= gives me the mapping pairs and the single mapping reads with their
pairs.
For paired reads, outm captures all reads that are mapped OR have a mate that is mapped. You can change this behavior with the flag "po" which stands for "pairedonly". The default is "po=f". If you set "po=t" then reads will be unmapped unless they are paired. So, if only one read maps, or if they both map but are not in a valid pairing configuration, they will go to "outu" instead of "outm". "out" will get both of them no matter what.
Quote:
3) how can I get sorted unmapped pairs written to a file?
While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
Are you sure? It works for me... though it is not mentioned in the documentation. I'll clarify that.
bbmap.sh in=reads.fq outu1=r1.fq outu2=r2.fq
produces 2 files, r1.fq and r2.fq
You can only specify output streams 1 and 2 for fasta or fastq output, though, not sam. Also - by default, whenever you output paired reads to a single file, they will come out interleaved (read 1, read 2, read 1, read 2, etc). You can split them back into 2 files with reformat:
reformat.sh in=reads.fq out1=read1.fq out2=read2.fq
Quote:
4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.
Actually, I only track the average insert size, not the median, because that's the easiest, though I admit the median and SD would be more useful. I'll add that, since I'm tracking the information anyway (if you specify inserthistogram).

As for what you noticed with the pairlen flag, I had not noticed that - sounds like a possible bug. I will look into it.

Quote:
Keep up the good work
Harald
I will. Thanks for the feedback!
Brian Bushnell is offline   Reply With Quote
Old 04-16-2014, 05:02 PM   #22
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I have uploaded a new version of BBTools, v32.06. Some improvements:


Shellscripts:

Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.


All programs:


Added append flag; allows you to append to existing output files rather than overwriting.


BBMap (bbmap.sh):

Now reports insert size median and standard deviation (as long as you have the "ihist" flag set).

Added 'qahist' flag, which outputs the observed quality of bases for each claimed quality value. Works for both indel and substitution error models.

Added 'bhist' flag, which outputs the frequency of ACGTN by base position. Reformat and BBDuk.sh also support 'bhist' and 'qhist' flags.

'pairlen' flag now works correctly (for capping the max allowed insert size).

IUPAC reference codes are now converted to N prior to mapping.

Statistics are now reported both by number of reads and number of bases.

Added BBWrap (bbwrap.sh), a wrapper for BBMap that allows you to map multiple sets of reads while only loading the index once (important for huge indexes and small read sets). The mapped reads can be written all to the same output file, or to one output file per input file.

Slightly increased accuracy.

PacBio reads can now be mapped in fastq format.


BBMerge (bbmerge.sh):


Now supports dual output files for unmerged reads, 'outu1' and 'outu2'.

Improved accuracy.


BBDuk (bbduk.sh):

Much faster when used without any reference (e.g. for quality trimming).


RandomReads:


Much better simulated PacBio reads.


Thanks to everyone at seqanswers who has given me feedback!
Brian Bushnell is offline   Reply With Quote
Old 04-16-2014, 05:33 PM   #23
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,579
Default

Quote:
Originally Posted by Brian Bushnell View Post

Shellscripts:

Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.
Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

Thanks!
GenoMax is offline   Reply With Quote
Old 04-16-2014, 07:08 PM   #24
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 308
Default

Hi Brian,

I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.

Quote:
Originally Posted by Brian Bushnell View Post
I have uploaded a new version of

RandomReads:


Much better simulated PacBio reads.
!
luc is offline   Reply With Quote
Old 04-16-2014, 07:14 PM   #25
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by GenoMax View Post
Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

Thanks!
For clusters, I recommend setting the -Xmx flag to indicate the amount of memory you requested and 'threads' (or 't') to the number of slots you have requested. We (JGI) run on clusters too, but most of the nodes are exclusive-only, so all hardware resources are available to the user.

The new memory-detection system should work on Linux/UGE clusters (it works here). The shellscript calculates the system's free virtual memory, free physical memory, and 'ulimit -v'. Then it will invoke Java with the minimum of those 3 values, and thus, hopefully, will never fail initialization, and never use more RAM than is allowed. Although here, at least, it's impossible to use more RAM than allowed because our cluster will detect that and kill the process.

Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.

Threads is a little more difficult. For now, unless you have exclusive access to a node, I recommend that you specify the number of threads allowed (e.g. t=4 if you reserved 4 slots). There is an environment variable "NSLOTS" that apparently gets set by UGE, which I will parse in my next release, and default to using the minimum of the system's reported logical cores and NSLOTS. But it appears to not always be correct, and furthermore, hyperthreading makes the matter more confusing (BBMap does not benefit much from hyperthreading, but many of the other BBTools do, like BBNorm).

Also - there are basically 4 types of programs in BBTools, resource-wise:

1) Super-heavyweight:
Requests all available threads and memory by default, but can be capped with the '-Xmx' and 't' flags. That's because all are both multithreaded and require an amount of memory dependent on the input, which the shellscript can't calculate. This includes bbmap.sh, bbsplit.sh, bbwrap.sh, mapPacBio.sh, mapPacBio8k.sh, bbnorm.sh, ecc.sh, khist.sh, bbduk.sh, countkmersexact.sh, and dedupe.sh.

2) Heavyweight:
Requests all available RAM but only one primary thread. Pipeline-multithreading (input thread, processing thread, output thread) cannot be disabled, so the 't' flag has no effect unless you are doing multithreaded compression using pigz. Includes pileup.sh, randomreads.sh, bbcountunique.sh, and bbmask.sh.

3) Midweight:
Requests all available threads by default, but can be capped with the 't' flag; needs little memory, so the -Xmx flag is hard-coded at an appropriate value that should work for all inputs (ranging from 100m to 400m) and not affected by available memory, though you can still override it. This includes bbmerge.sh.

4) Lightweight:
Needs minimal memory and only one primary thread, so again the -Xmx flag is hard-coded to a low value (at most 200m) rather than dependent on autodetection. Again, 't' flag has no effect unless you are doing multithreaded compression using pigz. This includes reformat.sh, stats.sh, statswrapper.sh, readlength.sh, bbest.sh, bbsplitpairs.sh, gradesam.sh, translate6frames.sh, and samtoroc.sh.

Note that if you have pigz installed, you can accelerate all BBTools performance on gzipped input or output using the "unpigz" and "pigz" flags (which can be set to 't' or 'f'). pigz allows multithreaded .gz compression and decompression, and is both faster and more efficient (even with only a single thread) than Java or gzip. So with pigz installed, even a mostly singlethreaded program like reformat.sh can (and by default, will) use all all available threads if you write gzipped output:

reformat.sh in=read1.fq.gz in2=read2.fq.gz out=interleaved.fasta.gz zl=8

..will compress the output to gzip compression level 8. If pigz is installed, it will use pigz for both compression and decompression instead of Java, resulting in decompression using around 1.5 cores per input file and compression using all allowed cores. You can prevent this with the 't=1' flag (to cap compression at 1 thread) or 'unpigz=f' and 'pigz=f' (to disable decompression/compression in pigz processes).

Due to some bugs in UGE (which can kill processes that fork in certain circumstances), pigz is by default set to false in high-memory processes and true in low-memory processes.

If you have not tried pigz, I highly recommend it! It's great and will become increasingly important as core counts increase. It's a drop-in replacement for gzip - same command line, 100% inter-compatible, but multithreaded and more efficient per thread. The only downside is increased memory usage, but it's not bad - around 12 MB per thread. Compression (for genomic data) is within ~0.1% of gzip at the same compression level. As I mentioned, decompression uses at most 1.5 cores in my tests, while compression can use all cores.

Last edited by Brian Bushnell; 04-17-2014 at 09:46 AM.
Brian Bushnell is offline   Reply With Quote
Old 04-16-2014, 07:45 PM   #26
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by luc View Post
Hi Brian,

I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.
Luc,

Sorry about that; randomreads is currently not documented. I will try to remedy that tomorrow, and let you know when it's done.
Brian Bushnell is offline   Reply With Quote
Old 04-16-2014, 10:15 PM   #27
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 308
Default

Thanks a lot Brian,

BBMap does have a lot of different tools.
luc is offline   Reply With Quote
Old 04-17-2014, 07:51 AM   #28
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,579
Default

Quote:
Originally Posted by Brian Bushnell View Post
Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.
With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.

BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).
GenoMax is offline   Reply With Quote
Old 04-17-2014, 09:42 AM   #29
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by GenoMax View Post
With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.
Hmm, thanks for testing. Sounds like LSF is setting the ulimit too low, probably for 1 slot no matter how many slots you reserve, I'm guessing.

Quote:
BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).
Thanks for noticing that; I've uploaded a new version 32.07 which fixes it. They work fine for me but some versions of bash can't process DOS encoding.

By the way - I forgot to mention it, but I added the phiX reference and truseq adapters to the /resources/ directory, for use with BBDuk.
Brian Bushnell is offline   Reply With Quote
Old 04-24-2014, 10:17 AM   #30
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by luc View Post
Hi Brian,

I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.
Luc,

I released a new version (32.14) which now has RandomReads documented (run randomreads.sh with no arguments).

By default it will use an Illumina error model, which assigns qualities in a pattern that looks roughly like the average quality histogram of Illumina reads (peaking at about position 20, and sloping down toward the head and tail). Then some bases will by randomly changed based on the quality score. To generate Illumina-like reads from e.coli:
randomreads.sh ref=ecoli.fa reads=100000 length=150 maxq=40 midq=30 minq=10 out=synth.fq
It can also generate paired reads if desired.

To generate PacBio-like reads:
randomreads.sh ref=ecoli.fa reads=100000 minlength=200 maxlength=4000 pacbio pbminrate=0.13 pbmaxrate=0.17 out=synth.fq

That will add substitutions and single-base-pair indels with per-base probability ranging from 0.13 to 0.17.

There are also a lot of other options for adding specific numbers or lengths of insertions, deletions, substitutions, and Ns, in the help menu.
Brian Bushnell is offline   Reply With Quote
Old 04-24-2014, 03:58 PM   #31
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 308
Default

Thanks a lot for the randomreads instructions Brian!
luc is offline   Reply With Quote
Old 04-26-2014, 09:22 AM   #32
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default

Quote:
Originally Posted by Brian Bushnell View Post
bbmap.sh in1=read1.fq.gz in2=read2.fq.gz ref=zv9.fa out=mapped.sam maxindel=200000
@ Brian: Is maxindel the only bbmap parameter you recommend to change for mapping reads for subsequent Cufflinks analysis?

Because I interpetreted the Cufflinks Doc in a way, that it will use the xmtag to lower the read weight by the amount of sites it maps to and therefore decided to set ambig=all.

I therefore chose to set for my RNAseq:

Code:
maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=T
Was I misled?
Thias is offline   Reply With Quote
Old 04-26-2014, 10:11 AM   #33
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Thias,

For analysis of vertebrate RNA-seq data with Cufflinks, I do recommend all of those parameters:
maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=t

Although "ambig=all" or "ambig=random" would both be OK. The "intronlen", "xstag" and "xmtag" are things I added specifically for Cufflinks compatibility (or the tuxedo package in general) so they are not needed if you do downstream analysis with other tools. The only ones that affect the actual mapping are "ambig" and "maxindel". xstag and xmtag just add optional tags, and intronlen changes the "D" symbol in cigar strings to "N" for deletions longer than the specified length. There's no reason why any of them should be necessary, but for whatever reason, Cufflinks needs them.

Note that if you your data is strand-specific, and you know which protocol it was, then it's better to set "xstag=firststrand" or "xstag=secondstrand" than unstranded.

Last edited by Brian Bushnell; 04-26-2014 at 10:13 AM.
Brian Bushnell is offline   Reply With Quote
Old 04-26-2014, 02:41 PM   #34
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Wink

Thanks Brian for your response almost at warp speed.

Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
  • ambig=all:
    All best alignments are reported. All but one are flagged with the secondary alignment bit (0x100) and the xmtag (an integer ?) additionally reports the number of best alignments for the tuxedo package.
  • ambig=random:
    From the best alignments, one is randomly chosen. This alignment is reported as if there would not have been any secondary alignments. Only the xmtag > 1 indicates, that there would have been other best alignments as well, but the coordinates of those alignments are not present in the output.

I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon.

Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student .
Thias is offline   Reply With Quote
Old 04-26-2014, 06:36 PM   #35
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Thias View Post
Thanks Brian for your response almost at warp speed.

Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
  • ambig=all:
    All best alignments are reported. All but one are flagged with the secondary alignment bit (0x100) and the xmtag (an integer ?) additionally reports the number of best alignments for the tuxedo package.
  • ambig=random:
    From the best alignments, one is randomly chosen. This alignment is reported as if there would not have been any secondary alignments. Only the xmtag > 1 indicates, that there would have been other best alignments as well, but the coordinates of those alignments are not present in the output.

I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon.

Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student .
That sounds correct. ambig=all reports a randomly selected primary site and all other equal-scoring sites as secondary. ambig=random only reports a primary site, selected at random from top sites. If you use "random" then possibly (probably?) you shouldn't include the xmtag, as that might confuse cufflinks. Also, ambiguously-mapped reads get the "XT:A:R" tag. Note that any tag that starts with "X" is not part of the official sam specification.
Brian Bushnell is offline   Reply With Quote
Old 04-28-2014, 01:30 AM   #36
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default

Quote:
Originally Posted by Brian Bushnell View Post
If you use "random" then possibly (probably?) you shouldn't include the xmtag, as that might confuse cufflinks.
This in in accordance with my gut feeling, since I interpet the Cufflinks docs in the way, that the algorithm adjusts the read weight based on the xmtag:

Quote:
By default, Cufflinks will uniformly divide each multi-mapped read to all of the positions it maps to. In other words, a read mapping to 10 positions will count as 10% of a read at each position.
However Cufflinks in a later stage also needs the secondary alignment positions for the fine grained abundance estimation:

Quote:
Cufflinks will then re-estimate the abundances dividing each multi-mapped read probabalistically based on the initial abundance estimation of the genes it maps to
But thanks Brian for working this out!

Last edited by Thias; 04-28-2014 at 01:31 AM. Reason: Typo
Thias is offline   Reply With Quote
Old 05-07-2014, 12:57 AM   #37
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Dear Brian,

I tested bbmap successfully on a small dataset and so far I'm very happy with the results
However, I'm running into some memory problems using my real dataset:


My RAM looks currently as follows:
Quote:
total used free shared buffers cached
Mem: 66066316 42861240 23205076 0 3700 42478768
When I now run bbmap with different -Xmx setting or also without the parameter, it always quits with:

Quote:
Max Memory = 28633 MB (or other, depending on the value I used)
Available Memory = 10830 MB (ranging from 8Gb - 12Gb)
Reducing threads from 16 to 15 due to low system memory.
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
To me it looks like a bug within the memory detection or am I missing something?
WhatsOEver is offline   Reply With Quote
Old 05-07-2014, 10:11 AM   #38
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by WhatsOEver View Post
Dear Brian,

I tested bbmap successfully on a small dataset and so far I'm very happy with the results
However, I'm running into some memory problems using my real dataset:

...

To me it looks like a bug within the memory detection or am I missing something?
WhatsOEver,

Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know:
1) The amount of physical memory available (it looks like 64GB, is that correct?)
2) The size of the reference you are using.
3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful.

You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread.

Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details.

Thanks,
Brian
Brian Bushnell is offline   Reply With Quote
Old 05-08-2014, 08:29 AM   #39
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Quote:
Originally Posted by Brian Bushnell View Post
WhatsOEver,

Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know:
1) The amount of physical memory available (it looks like 64GB, is that correct?)
2) The size of the reference you are using.
3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful.

You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread.

Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details.

Thanks,
Brian
Hi Brian,

it was indeed increasing Xmx to 58g which did the trick. The threads (t) parameter, however, didn't help at all. So, when I run with -Xmx58g everything works fine (independent of t=8 or t=16 or leaving t=auto). When I run with -Xmx30g and t=auto the prog quits with the following error:

concerning your questions:
1) The amount of physical memory available (it looks like 64GB, is that correct?)
yes
2) The size of the reference you are using.
its the human genome assembly 19 from UCSC (~3.1Gb)
3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
mapPacBio8k.sh
4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
see quotes below

Quote:
java -ea -Xmx30g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 in=./121128_1.fastq out=/projects/testData/bbmap_121128_1_mapped.sam xstag=unstranded xmtag=t ambig=all maxindel=200000 qin=33 -Xmx30g
Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, in=./121128_1.fastq, out=/projects/testData/bbmap_121128_1_mapped.sam, xstag=unstranded, xmtag=t, ambig=all, maxindel=200000, qin=33, maxindel=150000, -Xmx30g]

BBMap version 32.14
Set overwrite to true
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
Set fastq input ASCII offset to 33
Retaining all best sites for ambiguous mappings.
Set genome to 1

Loaded Reference: 6.143 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 9.693 seconds.
Analyzed Index: 24.885 seconds.
Started output stream: 0.038 seconds.
Cleared Memory: 0.530 seconds.

Max Memory = 28633 MB
Available Memory = 10826 MB
Reducing threads from 16 to 15 due to low system memory.
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at align2.MultiStateAligner9PacBio.<init>(MultiStateAligner9PacBio.java:73)
at align2.MSA.makeMSA(MSA.java:43)
at align2.AbstractMapThread.<init>(AbstractMapThread.java:130)
at align2.BBMapThreadPacBio.<init>(BBMapThreadPacBio.java:100)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:378)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6

**************************************************************************

Warning! 15 mapping threads did not terminate normally.
Please check the error log; the output may be corrupt or incomplete.

**************************************************************************


Exception in thread "main" java.lang.RuntimeException: Aborting due to prior error.
at align2.AbstractMapper.abort(AbstractMapper.java:69)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:381)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
When I run it with -Xmx30g and t=8 I get the same heap space error, although its not complaining that there is not enough memory available

Quote:
java -ea -Xmx30g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 t=8 in=./121128_1.fastq out=/projects/testData/bbmap_121128_1_mapped.sam xstag=unstranded xmtag=t ambig=all maxindel=200000 qin=33 -Xmx30g
Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, t=8, in=./121128_1.fastq, out=/projects/testData/bbmap_121128_1_mapped.sam, xstag=unstranded, xmtag=t, ambig=all, maxindel=200000, qin=33, maxindel=150000, -Xmx30g]

BBMap version 32.14
Set overwrite to true
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
Set threads to 8
Set fastq input ASCII offset to 33
Retaining all best sites for ambiguous mappings.
Set genome to 1

Loaded Reference: 5.773 seconds.
Loading index for chunk 1-7, build 1
Generated Index: 9.148 seconds.
Analyzed Index: 27.428 seconds.
Started output stream: 0.015 seconds.
Cleared Memory: 0.511 seconds.
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at align2.MultiStateAligner9PacBio.<init>(MultiStateAligner9PacBio.java:73)
at align2.MSA.makeMSA(MSA.java:43)
at align2.AbstractMapThread.<init>(AbstractMapThread.java:130)
at align2.BBMapThreadPacBio.<init>(BBMapThreadPacBio.java:100)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:378)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
Detecting finished threads: 0, 1, 2, 3, 4, 5

**************************************************************************

Warning! 8 mapping threads did not terminate normally.
Please check the error log; the output may be corrupt or incomplete.

**************************************************************************


Exception in thread "main" java.lang.RuntimeException: Aborting due to prior error.
at align2.AbstractMapper.abort(AbstractMapper.java:69)
at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:381)
at align2.BBMapPacBio.main(BBMapPacBio.java:33)
WhatsOEver is offline   Reply With Quote
Old 05-08-2014, 09:44 AM   #40
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

With bbmap.sh, HG19 needs about 23 GB. I had never tested HG19 with mapPacBio8k.sh but it looks like I need to refine the memory prediction. Thanks for the report!
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:58 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO