Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gap5 1.2.0 release (assembly viewer & editor)

    I've just made the latest Gap5 release on sourceforge, as prebuilt linux binaries (32-bit and 64-bit intel binaries only currently). Source is there too, but it's warty and a pain to build right now.

    Since the previous release I've put a lot of work into making the file format compact, so that it's now typically smaller than the equivalent BAM file while still retaining editing capability. (Although editing hasn't been rigorously tested yet and there are still many things to add.)

    I've made a start at adding library analysis and supporting annotations. They're in the file format now and have minimalistic visualisation interfaces to test they're working, but these will get fleshed out during later 1.2.x releases.

    James

    Overview

    Gap5 is ultimately the replacement for Gap4 in that it aims to be a sequence assembly viewer and editor for finishing experiments. As such it provides tools for comparing, joining and breaking contigs as well as the smaller details of individual base editing.

    It is designed to be compact in file size, and generally very low in CPU and memory usage. In cpu/memory it's typically comparable to MapView or samtools tview. In file size (it still needs it's own format) it's usually slightly smaller than BAM format.

    Right now it is very much work in progress. As far as a viewer goes it's already a useful tool in it's own right, but as an editor there's still lots of missing features. It's likely there are some major bugs in the editor too as it's not had a lot of testing yet.

    I'd recommend using the "-ro" flag to gap5 (read-only mode) unless you really do need to be editing too.

    To get started, firstly you need a Gap5 database. It cannot read your old Gap4 ones. You an construct new Gap5 databases out of ACE, MAQ, BAM or BAF format files, or convert your old Gap4 databases via caftools and the supplied caf2baf script. Eg:

    cd /nfs/repository/d0022/bE171C14/
    gap2caf -project BE171C14 -version 0 | caf2baf > /tmp/BE171C14.baf
    cd /tmp
    tg_index -p -B -o BE171C14 BE171C14.baf
    gap5 BE171C14

    It's worth having an idea of the depth of your data too. If you have a very shallow assembly, try using (for example) the "-z 256k" option to tg_index to speed up processing and reduce file size. See below for the full details.

    tg_index

    This converts various alignment file formats into a gap5 database file (or pair of files infact). The input formats currently supported are maq (short/long), bam, ace, baf, and some old "aln" text format. I have a caf2baf conversion tool if people need it too, but it's not natively supported by tg_index.

    Usage:
    tg_index -o dbname [options] -format_code input_filename

    format_code is one of

    -b
    BAM
    -m
    MAQ short
    -M
    MAQ long
    -A
    ACE
    -B
    BAF

    In additon to this there are a variety of other options:

    -a
    Append mode. With this the database is appended to instead of overwritten.

    -n
    Requests that new contigs are made when appending, even if they match the names of existing contigs. (By default it'll merge data into the same contigs, but if padding is different then this will cause issues.

    -p, -P
    turn on (or off) read-pairing. This is on by default, but it uses up memory to identify the pairs (by name). If you know you have single-ended data then using -P will speed up indexing and save memory.

    -T
    Requests building a B+Tree of sequence names. This permits random access governed by a name rather than position, eg to jump specifically to sequence "foo" in the Gao5 editor. The index isn't build by default as it's rather slow. (I have plans on improving this though.)

    -o 'db_name'
    Specifies the output database name is to be db_name.

    -z 'size'
    This governs the bin size for the range-query binning system. By default 'size' is 4k, but it's worth increasing this if your coverage is very low. Ideally you want a few thousand sequences per bin to strike a happy balance between speed and I/O efficiency.

    So a typical example usage maybe:
    tg_index -z 64k -o rmdup_g5 -m rmdup.map

    gap5

    This is the actual viewer or editor. The main displays you'll want to familiarise yourself with are the Contig List, Contig Editor and Template Displays.

    Initially you may (or may not, depending on how many there are) see the "contig selector" window. Note that this is currently bugged when the total contig length goes beyond 2Gb - eg whole human alignments. It's probably worth using the Contig List window instead in this case. You can forcibly turn on or turn off displaying the contig selector at startup using -csel and -no_csel command line options.

    -ro
    Use this command line option to disable editing abilities. It opens the file in read-only mode guaranteeing that you cannot change the data.
    -csel
    Forces the contig selector to be shown at startup
    -no_csel
    Forces the contig selector to not be shown at startup.

    Downloads

    The executables are distributed via sourceforge at:

    A fully developed set of DNA sequence assembly (Gap4 and Gap5), editing and analysis tools (Spin) for Unix, Linux, MacOSX and MS Windows.


    Code, for those that really care, is also there via:

    The world's largest development and download repository of Open Source code and applications

    The world's largest development and download repository of Open Source code and applications


    Screenshots:



    This shows a graphical overview of a mixed assembly. The colours indicate mapping quality and/or template status (single ended, paired but spanning contigs). The Y status indicates the insert size - hence clearly seeing solexa vs capillary libraries in this plot.




    An example of the contig editor. This is a mix of 454 and capillary data made by MIRA. The MIRA tags are visible here as the coloured fragments.




    Another editor screenshot, showing grey scales for base quality and mapping quality (in the "names" panel to the left, now just an ascii art representation of the alignments). Also shown are a couple traces for capillary sequences as this is from a mixed capillary/solexa assembly. It can show 454 traces too, and in theory solexa ones but we're no longer keeping processed trace data here (only raw).

    James
    Last edited by jkbonfield; 06-11-2009, 12:40 AM.

  • #2
    Hi James,

    i managed to compile caftools, io_lib, searched for Caftools.pm and finally fail to index a BAF file created by caf2baf.

    I have a CAF written by MIRA assembler which I have converted.

    Now I just wanted to index,

    $ tg_index -p -B -o proj_gap5 proj.baf
    tg_index.bin: /scratch/local2/SVEN/software_test_installs/gap5/gap5-1.2.0-linux-x86_64/lib/linux-x86_64-binaries/libpng12.so.0: no version information available (required by /scratch/local2/SVEN/software_test_installs/gap5/gap5-1.2.0-linux-x86_64/lib/linux-x86_64-binaries/libtk_utils.so)

    g_index: Short Read Alignment Indexer, version 1.1.3

    Author: James Bonfield ([email protected])
    2007-2009, Wellcome Trust Sanger Institute

    Loading proj.baf...
    2%Resizing HacheTable tg_cache to 4096
    2*Aborted


    Any idea where to look or what to do?

    Thanks,
    Sven

    Comment


    • #3
      You can ignore the libpng whinge - I'm currently working on improving the staden package dependencies and build environment (about time!) and that's something which should go away.

      The abort though is more worrying. I'm assuming this assembly has a lot of contigs (based on the comment about increasing tg_cache - that's to do with the number of objects held in memory), but I've tested a variety of assemblies so far.

      Is there a smaller .baf file you could send me perhaps that recreates the error (to sanger.ac.uk addr)? Obviously it doesn't get as far as to 3% into the file.

      Also, what linux OS & release are you using please?

      James

      Comment


      • #4
        The bug has been found and fixed (I hope) in the CVS tree. Many thanks to Sven for providing some test data to trigger the issue. (Specifically it occurs when using BAF files with trace names that do not share a common prefix with the reading names.)

        I'll build a new version on Monday most likely, also including an updated caf2baf script too. In searching for it I found a few other oddities when compiling with full optimisation which I'm fixing at the same time. The 1.2.0 release was accidentally built with full debugging and unoptimised code, although it's still sufficiently fast that it's not desparately noticable.

        Sorry for the error.

        James

        Comment


        • #5
          I've now built binaries for 1.2.1 too and placed on sourceforge at:

          A fully developed set of DNA sequence assembly (Gap4 and Gap5), editing and analysis tools (Spin) for Unix, Linux, MacOSX and MS Windows.


          For those that downloaded the previous version - there's little change except for:

          1) Bug fixed handling of trace names (only an issue sometimes, and only if importing from BAF or ACE).
          2) Improved caf2baf perl script
          3) Rebuilt everything with optimisation turned on, so it'll be slightly faster now. (The difference probably isn't as significant as you'd think as a lot of time was spent in already optimised third party libraries such as zlib.)

          James

          Comment


          • #6
            Brilliant, we have been awaiting Gap5 since it was marketed so heavily by Illumina when they began marking the Genome Analyser (before the software was ready of course).
            I am running a 64 bit Ubuntu 9.04 machine and I managed to get the tg_index working by creating symbolic links for the libssl libraries which by default are 0.9.8, not the 0.9.7 required by the script.
            I have run into problems with the gap5 script though and get the following error message:
            couldn't load file "libtgap.so": /gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/linux-x86_64-binaries/libtgap.so: undefined symbol: set_dna_lookup
            while executing
            "load libtgap.so g5"
            (file "/gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/gap5/gap.tcl" line 504)
            This one is beyond what I can fix. Any ideas?
            Cheers,
            Lesley

            Comment


            • #7
              Sorry to hear it's failing to load for you.

              Can you please try setting the STADEN_DEBUG environment variable and running again? It won't make it work, but it may indicate the cause of the problem.

              The set_dna_lookup symbol is in the libseq_utils.so library that is shipped with the program, in gap5-1.2.1/lib/linux-x86_64-binaries directory. The gap5 wrapper script it meant to automatically set up LD_LIBRARY_PATH and similar to include this directory, but maybe it's still missing a dependency.

              (It will say soemthing like "couldn't load file "libiwidgets.so", but that's normal as it doesn't exist for anyone, but I'm expecting there is also some complaint about symbols in libseq_utils.so judging by your error.)

              I built it using a rather aging Debian system (Sarge), due to our systems here being somewhat behind the time in OS releases, hence the issues with library versions. I can rebuild it on etch and make another test available if needed though, but I don't have access to Ubuntu currently.

              James

              PS. My current project though is to simplify the build system so people can just download the source and type "make" to get something working, but it's a non-trivial process due to some poorly supported dependencies right now.

              Comment


              • #8
                Hi James,
                I am getting the same error as Lesley... I am also on an Ubuntu box. Here is the debug output for the error

                load libitcl3.3.so =>
                load libitk3.3.so =>
                load libiwidgets.so => couldn't load file "libiwidgets.so": libiwidgets.so: cannot open shared object file: No such file or directory
                load libgap5.so => couldn't load file "libgap5.so": libg2c.so.0: cannot open shared object file: No such file or directory
                couldn't load file "libtgap.so": /home/rtewhey/genome/staden/linux-x86_64-bin/../lib/linux-x86_64-binaries/libtgap.so: undefined symbol: set_dna_lookup
                while executing
                "load libtgap.so g5"
                (file "/home/rtewhey/genome/staden/linux-x86_64-bin/../lib/gap5/gap.tcl" line 504)

                best


                Update:
                After installing libg2c0 package from the package manager and then http://packages.debian.org/lenny/amd...dc++5/download everything seems to be up and running on ubuntu.
                Last edited by torrey; 06-17-2009, 03:09 PM. Reason: Update

                Comment


                • #9
                  gap5 aborts

                  Hi,

                  I ran tg_index on a Consed ace file assembly of illumina reads & it appeared to work without any errors.

                  But when I open the converted gap5.aux file there is just the Refseq sequence in the contig & no sequences. When I try to edit contig I get this message.

                  Level 1: EditContig2 io=0xf731d0 .cedialog .cedialog.id
                  Thu 18 Jun 12:56:57 2009 signal_handler: Program terminated unexpectedly with signal 11.
                  Thu 18 Jun 12:56:57 2009 signal_handler: This is probably a bug.
                  Thu 18 Jun 12:56:57 2009 signal_handler: Please report all bug reports at https://sourceforge.net/projects/staden/
                  Aborted

                  can anyone help?

                  alig

                  Comment


                  • #10
                    Originally posted by torrey View Post
                    load libgap5.so => couldn't load file "libgap5.so": libg2c.so.0: cannot open shared object file: No such file or directory
                    I'm rather suprised I still have a dependency on that, but the Gap5 build was derived from Gap4 so perhaps I still link against that library despite not using it. (It's the GNU FORTRAN 77 run-time library.)

                    Update:
                    After installing libg2c0 package from the package manager and then http://packages.debian.org/lenny/amd...dc++5/download everything seems to be up and running on ubuntu.
                    Ah, pretty much as I expected then. I'll try and reduce these dependencies, and more importantly document them too, for the next release. Ideal would be a proper .deb package, but that requires a lot more work and ultimately ends up needing umpteen variants for every linux OS out there which I'd rather avoid.

                    Thanks for the feedback.

                    James

                    Comment


                    • #11
                      Originally posted by alig View Post
                      Hi,

                      I ran tg_index on a Consed ace file assembly of illumina reads & it appeared to work without any errors.

                      But when I open the converted gap5.aux file there is just the Refseq sequence in the contig & no sequences. When I try to edit contig I get this message.
                      Sorry to see that. I did recently spot and fix a bug involving long annotations (aka tags); anything longer than 1K could potentially cause a crash. The fix is in the CVS tree already and will make it into the 1.2.2 release whenever that happens.

                      However I doubt that's the case here. Gap5 currently doesn't have special support for reference sequences, so the only way they'd appear in the assembly is if there is a single very long sequence added to the assembly itself. In theory this should work just fine, but isn't something I've tested much.

                      Would it be possible to obtain a temporary copy of your data set to debug the program with?

                      Thanks,

                      James

                      Comment


                      • #12
                        Is there a Gap5 Manual? or will there be.

                        Comment


                        • #13
                          There's no manual at present. I'm trying to keep it similar to gap4 for usage, so the gap4 manual could be heavily pilfered for suitable documentation.

                          The main new display in gap5 though is the template display. Gap4 had it, but it was slow and very cluttered for anything but the smallest of contigs. I'll have to right docs from scratch for that, however it's still being heavily modified too.

                          Given the bad lack of documentation though, I'm happy to field questions. If I get too many it'll just encourage me to write a manual. :-)

                          James

                          Comment


                          • #14
                            Thanks for the quick reply. We still get errors as below:

                            gap5-1.2.1-linux-x86_64/linux-x86_64-bin/gap5 GiardiaV

                            load libitcl3.3.so =>
                            load libitk3.3.so =>
                            load libiwidgets.so => couldn't load file "libiwidgets.so": libiwidgets.so: cannot open shared object file: No such file or directory
                            load libgap5.so => couldn't load file "libgap5.so": libstdc++.so.5: cannot open shared object file: No such file or directory
                            couldn't load file "libtgap.so": /gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/linux-x86_64-binaries/libtgap.so: undefined symbol: set_dna_lookup
                            while executing
                            "load libtgap.so g5"
                            (file "/gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/gap5/gap.tcl" line 504)

                            I am running Ubuntu on a 64 bit machine and even installing the libg2c0 package did not work. The other issue is that Ubuntu installs libstdc++.so.6 and I am highly reluctant to change that because of the other software I am running.
                            I hope you can find the answer,
                            Cheers,
                            Lesley

                            Comment


                            • #15
                              load libiwidgets.so => couldn't load file "libiwidgets.so":
                              This line you can ignore - the file shouldn't exist anyway.

                              load libgap5.so => couldn't load file "libgap5.so": libstdc++.so.5: cannot open shared object file: No such file or directory

                              ...

                              I am running Ubuntu on a 64 bit machine and even installing the libg2c0 package did not work. The other issue is that Ubuntu installs libstdc++.so.6 and I am highly reluctant to change that because of the other software I am running.
                              You absolutely shouldn't replace the systen libstdc++ version. The above error though is the one which is preventing gap5 from launching. A work around is to find a copy of this library version from somewhere else and simply place it in the unpackged gap5-1.2.1/lib/linux-x86_64-libraries directory. It won't affect any system tools then, but gap5 will find it on start up.

                              The next build (done very shortly I hope) will solve this; gap5 actually doesn't yet use any C++ and I erroneously link against this library simply because gap4 previously did and I copied it's build configuration without thinking. (The same applies for libg2c0 too.)

                              James

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              82 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X