Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Brian,

    thanks a lot for the advice. I did now change the java environments several times (via aliases to the the latest JREs) and tried to use both the your java6 and java7 versions and I always end up with the same messages as before. Our OS is old (CentOS 5.9) , as is our default java (it is 64-bit nevertheless (build 1.6.0_16-b01)). Thus, it is very possible that our configuration is outdated.
    I will try to somehow set java "ulimits".


    Originally posted by Brian Bushnell View Post
    luc,

    That means that you have a 32-bit version of Java installed. You need a 64-bit version, which you can get here.

    Westerman,

    Thanks for your feedback; I'll simplify the help menus for the next version.

    -Brian
    Last edited by luc; 03-10-2014, 11:10 AM.

    Comment


    • #17
      luc,

      That's unfortunate. If you download and unzip a 64-bit version of java 7, then path directly to the java executable rather than relying on the default java of your environment, it should work. Otherwise, the limit is something like 1600MB (specified by the flag -Xmx1600m) if I recall correctly, which is plenty for small references like microbes and fungi.

      -Brian

      Comment


      • #18
        Has anybody experience with running bbmap on clusters?

        Hello everybody,

        I would like to know, if anyone has (good) experience with running bbmap alignments on clusters, which require to submit jobs via a scheduler?

        Because I can successfully start/finish an alignment on the university's cluster when started directly from the terminal (which is prohibited, but who cares on Sundays at 3am?), but the exactly same command written to Portable Batch System (PBS) job script and submitted via qsub to Torque/Maui scheduler fails:

        Code:
        Exception in thread "Thread-5" java.lang.NullPointerException
             at align2.Block.<init>(Block.java:33)
             at align2.Block.read(Block.java:156)
             at align2.IndexMaker4$BlockMaker.makeArrays(IndexMaker4.java:139)
             at align2.IndexMaker4$BlockMaker.run(IndexMaker4.java:127)
        Generated Index:    3.816 seconds.
        Exception in thread "main" java.lang.NullPointerException
             at align2.BBIndex.analyzeIndex(BBIndex.java:116)
             at align2.BBMap.loadIndex(BBMap.java:389)
             at align2.BBMap.main(BBMap.java:30)
        I guess bbmap recognizes that there are more cores available on the system, than I am allowing for the job via the job script and tries to use them nevertheless?

        Since I would like to extent my alignment times beyond weekend nights , I would be happy for suggestions.

        Comment


        • #19
          Originally posted by Thias View Post
          I guess bbmap recognizes that there are more cores available on the system, than I am allowing for the job via the job script and tries to use them nevertheless?
          Are you explicitly indicating number of threads to use for the bbmap command by using the "threads=x" option (BTW: this option is set to "auto" by default so unless you specify a number bbmap will try to use all cores)?

          Comment


          • #20
            Yes, I set the threads=8 option and also a memory limit of 32g for bbmap.

            However I realised that I initially wrote the job script in a way that the scheduler could assign random distributed cores to the job. When I limited that only to cores of one particular node, then everything runs smoothly. So no problem with bbmap, only with me using the batch system correctly .
            Last edited by Thias; 03-25-2014, 07:12 AM.

            Comment


            • #21
              problems with bbmap switches

              Hi Brian,
              Very decent mapper that you wrote, and supergreat that it is finally
              available .

              I was playing around with bbmap, sooo many cool features and an impressive
              speed! But I could not figure out a couple of things:
              1) How to point to directories with path=
              I would like to be able to use a set of indices that I created previously
              and that are stored in a specific 'databases' path mounted on all nodes of
              our cluster. But I did not get this to work setting path= during the
              mapping process because it changes where bbmap searches for the reads. My
              second trial was to call bbmap in this folder of the references and set
              path to the folder of the reads, but then I always got an error that the
              read file was not found. The only thing that worked for me is calling bbmap
              from a folder which also includes the /ref folder, but this means copying
              both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
              bbmap similarly
              2) using outm
              I am mapping shotgun metagenomic illumina paired end reads to references
              that are gene databases. I was expecting to get different output for out=
              and outm= but the files produced are identical. I would expect to see some
              read pairs where only one of the reads maps to the gene database and the
              other not, and as far as I understood out= gives me the mapping pairs and
              outm= gives me the mapping pairs and the single mapping reads with their
              pairs.
              3) how can I get sorted unmapped pairs written to a file?
              While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
              4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.

              Keep up the good work
              Harald

              Comment


              • #22
                Originally posted by HGV View Post
                Hi Brian,
                Very decent mapper that you wrote, and supergreat that it is finally
                available.

                I was playing around with bbmap, sooo many cool features and an impressive
                speed!
                Thanks! Glad you like it.

                But I could not figure out a couple of things:
                1) How to point to directories with path=
                I would like to be able to use a set of indices that I created previously
                and that are stored in a specific 'databases' path mounted on all nodes of
                our cluster. But I did not get this to work setting path= during the
                mapping process because it changes where bbmap searches for the reads. My
                second trial was to call bbmap in this folder of the references and set
                path to the folder of the reads, but then I always got an error that the
                read file was not found. The only thing that worked for me is calling bbmap
                from a folder which also includes the /ref folder, but this means copying
                both reads and refs accross the filesystem wherever I need them. We were mostly using bowtie2 up to now and in bowtie2 I can point to absolute paths for references, reads and outputs. Would be cool to be able to handle files in
                bbmap similarly
                BBMap can use absolute paths for all of those, but the syntax is a bit different. Let's say you have 2 builds of the human genome, HG18 and HG19, at /fastas/hg18.fa and /fastas/hg19.fa, reads at /data/reads.fq, and you want to write output to /output/. There are 3 ways you could handle this.

                Version 1: Don't write an index to disk. This usually makes the startup slower, but is more I/O efficient. It's the simplest and best if you are only mapping to a reference once (like when evaluating multiple assemblies).
                bbmap.sh in=/data/reads.fq ref=/fastas/hg18.fa out=/output/mapped18.sam nodisk
                bbmap.sh in=/data/reads.fq ref=/fastas/hg19.fa out=/output/mapped19.sam nodisk


                Version 2: Write each index to its own directory.
                bbmap.sh ref=/fastas/hg18.fa path=/human/hg18/
                bbmap.sh ref=/fastas/hg19.fa path=/human/hg19/

                bbmap.sh in=/data/reads.fq path=/human/hg18/ out=/output/mapped18.sam
                bbmap.sh in=/data/reads.fq path=/human/hg19/ out=/output/mapped19.sam


                Version 3: Write indices to the same directory, but specify a build number.
                bbmap.sh ref=/fastas/hg18.fa path=/human/ build=18
                bbmap.sh ref=/fastas/hg19.fa path=/human/ build=19

                bbmap.sh in=/data/reads.fq path=/human/ build=18 out=/output/mapped18.sam
                bbmap.sh in=/data/reads.fq path=/human/ build=19 out=/output/mapped19.sam


                If you don't specify a build number, the default is always 1.

                2) using outm
                I am mapping shotgun metagenomic illumina paired end reads to references
                that are gene databases. I was expecting to get different output for out=
                and outm= but the files produced are identical. I would expect to see some
                read pairs where only one of the reads maps to the gene database and the
                other not, and as far as I understood out= gives me the mapping pairs and
                outm= gives me the mapping pairs and the single mapping reads with their
                pairs.
                For paired reads, outm captures all reads that are mapped OR have a mate that is mapped. You can change this behavior with the flag "po" which stands for "pairedonly". The default is "po=f". If you set "po=t" then reads will be unmapped unless they are paired. So, if only one read maps, or if they both map but are not in a valid pairing configuration, they will go to "outu" instead of "outm". "out" will get both of them no matter what.
                3) how can I get sorted unmapped pairs written to a file?
                While outm1=reads.f.fq and outm2=reads.r.fq gives me the mapped pairs, outu always writes everything to a single file (no outu2= possible)
                Are you sure? It works for me... though it is not mentioned in the documentation. I'll clarify that.
                bbmap.sh in=reads.fq outu1=r1.fq outu2=r2.fq
                produces 2 files, r1.fq and r2.fq
                You can only specify output streams 1 and 2 for fasta or fastq output, though, not sam. Also - by default, whenever you output paired reads to a single file, they will come out interleaved (read 1, read 2, read 1, read 2, etc). You can split them back into 2 files with reformat:
                reformat.sh in=reads.fq out1=read1.fq out2=read2.fq
                4) I was trying to limit the insert sizes allowed in the paired end mapping with e.g. pairlen=1000, but the output still reported exactly the same mappings with insert sizes way higher, often in the multi kb range for PE-reads. This also greatly affects the average insert size reported... What am I doing wrong? It would also be very cool to get the standard deviation put out, as well as the median. One can calculate these things from the very useful histogram files that inserthistogram=file outputs, but that is not as convenient.
                Actually, I only track the average insert size, not the median, because that's the easiest, though I admit the median and SD would be more useful. I'll add that, since I'm tracking the information anyway (if you specify inserthistogram).

                As for what you noticed with the pairlen flag, I had not noticed that - sounds like a possible bug. I will look into it.

                Keep up the good work
                Harald
                I will. Thanks for the feedback!

                Comment


                • #23
                  I have uploaded a new version of BBTools, v32.06. Some improvements:


                  Shellscripts:

                  Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.


                  All programs:


                  Added append flag; allows you to append to existing output files rather than overwriting.


                  BBMap (bbmap.sh):

                  Now reports insert size median and standard deviation (as long as you have the "ihist" flag set).

                  Added 'qahist' flag, which outputs the observed quality of bases for each claimed quality value. Works for both indel and substitution error models.

                  Added 'bhist' flag, which outputs the frequency of ACGTN by base position. Reformat and BBDuk.sh also support 'bhist' and 'qhist' flags.

                  'pairlen' flag now works correctly (for capping the max allowed insert size).

                  IUPAC reference codes are now converted to N prior to mapping.

                  Statistics are now reported both by number of reads and number of bases.

                  Added BBWrap (bbwrap.sh), a wrapper for BBMap that allows you to map multiple sets of reads while only loading the index once (important for huge indexes and small read sets). The mapped reads can be written all to the same output file, or to one output file per input file.

                  Slightly increased accuracy.

                  PacBio reads can now be mapped in fastq format.


                  BBMerge (bbmerge.sh):


                  Now supports dual output files for unmerged reads, 'outu1' and 'outu2'.

                  Improved accuracy.


                  BBDuk (bbduk.sh):

                  Much faster when used without any reference (e.g. for quality trimming).


                  RandomReads:


                  Much better simulated PacBio reads.


                  Thanks to everyone at seqanswers who has given me feedback!

                  Comment


                  • #24
                    Originally posted by Brian Bushnell View Post

                    Shellscripts:

                    Much better memory detection; should be able to correctly detect free physical and virtual memory in Linux so you won't have to set the -Xmx flag.
                    Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

                    Thanks!

                    Comment


                    • #25
                      Hi Brian,

                      I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.

                      Originally posted by Brian Bushnell View Post
                      I have uploaded a new version of

                      RandomReads:


                      Much better simulated PacBio reads.
                      !

                      Comment


                      • #26
                        Originally posted by GenoMax View Post
                        Brian: I am not nitpicking but many of us run BBMap on clusters via queuing systems so setting that flag (and the threads=n) is critical (to prevent BBMap from overstepping the queue resource limits).

                        Thanks!
                        For clusters, I recommend setting the -Xmx flag to indicate the amount of memory you requested and 'threads' (or 't') to the number of slots you have requested. We (JGI) run on clusters too, but most of the nodes are exclusive-only, so all hardware resources are available to the user.

                        The new memory-detection system should work on Linux/UGE clusters (it works here). The shellscript calculates the system's free virtual memory, free physical memory, and 'ulimit -v'. Then it will invoke Java with the minimum of those 3 values, and thus, hopefully, will never fail initialization, and never use more RAM than is allowed. Although here, at least, it's impossible to use more RAM than allowed because our cluster will detect that and kill the process.

                        Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.

                        Threads is a little more difficult. For now, unless you have exclusive access to a node, I recommend that you specify the number of threads allowed (e.g. t=4 if you reserved 4 slots). There is an environment variable "NSLOTS" that apparently gets set by UGE, which I will parse in my next release, and default to using the minimum of the system's reported logical cores and NSLOTS. But it appears to not always be correct, and furthermore, hyperthreading makes the matter more confusing (BBMap does not benefit much from hyperthreading, but many of the other BBTools do, like BBNorm).

                        Also - there are basically 4 types of programs in BBTools, resource-wise:

                        1) Super-heavyweight:
                        Requests all available threads and memory by default, but can be capped with the '-Xmx' and 't' flags. That's because all are both multithreaded and require an amount of memory dependent on the input, which the shellscript can't calculate. This includes bbmap.sh, bbsplit.sh, bbwrap.sh, mapPacBio.sh, mapPacBio8k.sh, bbnorm.sh, ecc.sh, khist.sh, bbduk.sh, countkmersexact.sh, and dedupe.sh.

                        2) Heavyweight:
                        Requests all available RAM but only one primary thread. Pipeline-multithreading (input thread, processing thread, output thread) cannot be disabled, so the 't' flag has no effect unless you are doing multithreaded compression using pigz. Includes pileup.sh, randomreads.sh, bbcountunique.sh, and bbmask.sh.

                        3) Midweight:
                        Requests all available threads by default, but can be capped with the 't' flag; needs little memory, so the -Xmx flag is hard-coded at an appropriate value that should work for all inputs (ranging from 100m to 400m) and not affected by available memory, though you can still override it. This includes bbmerge.sh.

                        4) Lightweight:
                        Needs minimal memory and only one primary thread, so again the -Xmx flag is hard-coded to a low value (at most 200m) rather than dependent on autodetection. Again, 't' flag has no effect unless you are doing multithreaded compression using pigz. This includes reformat.sh, stats.sh, statswrapper.sh, readlength.sh, bbest.sh, bbsplitpairs.sh, gradesam.sh, translate6frames.sh, and samtoroc.sh.

                        Note that if you have pigz installed, you can accelerate all BBTools performance on gzipped input or output using the "unpigz" and "pigz" flags (which can be set to 't' or 'f'). pigz allows multithreaded .gz compression and decompression, and is both faster and more efficient (even with only a single thread) than Java or gzip. So with pigz installed, even a mostly singlethreaded program like reformat.sh can (and by default, will) use all all available threads if you write gzipped output:

                        reformat.sh in=read1.fq.gz in2=read2.fq.gz out=interleaved.fasta.gz zl=8

                        ..will compress the output to gzip compression level 8. If pigz is installed, it will use pigz for both compression and decompression instead of Java, resulting in decompression using around 1.5 cores per input file and compression using all allowed cores. You can prevent this with the 't=1' flag (to cap compression at 1 thread) or 'unpigz=f' and 'pigz=f' (to disable decompression/compression in pigz processes).

                        Due to some bugs in UGE (which can kill processes that fork in certain circumstances), pigz is by default set to false in high-memory processes and true in low-memory processes.

                        If you have not tried pigz, I highly recommend it! It's great and will become increasingly important as core counts increase. It's a drop-in replacement for gzip - same command line, 100% inter-compatible, but multithreaded and more efficient per thread. The only downside is increased memory usage, but it's not bad - around 12 MB per thread. Compression (for genomic data) is within ~0.1% of gzip at the same compression level. As I mentioned, decompression uses at most 1.5 cores in my tests, while compression can use all cores.
                        Last edited by Brian Bushnell; 04-17-2014, 08:46 AM.

                        Comment


                        • #27
                          Originally posted by luc View Post
                          Hi Brian,

                          I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.
                          Luc,

                          Sorry about that; randomreads is currently not documented. I will try to remedy that tomorrow, and let you know when it's done.

                          Comment


                          • #28
                            Thanks a lot Brian,

                            BBMap does have a lot of different tools.

                            Comment


                            • #29
                              Originally posted by Brian Bushnell View Post
                              Our clusters use UGE (formerly known as SGE) and set 'ulimit' according to the qsub memory request. I don't know if that is universal or just a local policy, but at least here, users should be able to use the shellscripts without specifying -Xmx, and they should work. I'd be happy if someone elsewhere tried it without setting -Xmx and reported what happens.
                              With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.

                              BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).

                              Comment


                              • #30
                                Originally posted by GenoMax View Post
                                With LSF, not setting the -Xmx flag is leading to default allocation of 4G (at least on our cluster, things may be set up differently at other locations) which results in Java running out of heap space. I am also not using exclusive nodes option so far and it seems to be working fine.
                                Hmm, thanks for testing. Sounds like LSF is setting the ulimit too low, probably for 1 slot no matter how many slots you reserve, I'm guessing.

                                BTW: The latest code for BBMap (32.06) seems to have shell script files in DOS format (wasn't the case before on previous versions).
                                Thanks for noticing that; I've uploaded a new version 32.07 which fixes it. They work fine for me but some versions of bash can't process DOS encoding.

                                By the way - I forgot to mention it, but I added the phiX reference and truseq adapters to the /resources/ directory, for use with BBDuk.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                81 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X