SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Gap5 - new release (staden-2.0.0b6) jkbonfield Bioinformatics 3 08-13-2014 06:04 AM
bowtie assembly on RNA seq viewer jjk RNA Sequencing 1 09-30-2011 07:41 AM
Tablet assembly viewer valei Bioinformatics 2 03-07-2011 08:09 AM
velvet assembly viewer? mdachoh Bioinformatics 0 01-09-2010 06:20 AM
454 Assembly viewer pr0t3us 454 Pyrosequencing 2 11-06-2008 09:04 PM

Reply
 
Thread Tools
Old 06-11-2009, 01:34 AM   #1
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default Gap5 1.2.0 release (assembly viewer & editor)

I've just made the latest Gap5 release on sourceforge, as prebuilt linux binaries (32-bit and 64-bit intel binaries only currently). Source is there too, but it's warty and a pain to build right now.

Since the previous release I've put a lot of work into making the file format compact, so that it's now typically smaller than the equivalent BAM file while still retaining editing capability. (Although editing hasn't been rigorously tested yet and there are still many things to add.)

I've made a start at adding library analysis and supporting annotations. They're in the file format now and have minimalistic visualisation interfaces to test they're working, but these will get fleshed out during later 1.2.x releases.

James

Overview

Gap5 is ultimately the replacement for Gap4 in that it aims to be a sequence assembly viewer and editor for finishing experiments. As such it provides tools for comparing, joining and breaking contigs as well as the smaller details of individual base editing.

It is designed to be compact in file size, and generally very low in CPU and memory usage. In cpu/memory it's typically comparable to MapView or samtools tview. In file size (it still needs it's own format) it's usually slightly smaller than BAM format.

Right now it is very much work in progress. As far as a viewer goes it's already a useful tool in it's own right, but as an editor there's still lots of missing features. It's likely there are some major bugs in the editor too as it's not had a lot of testing yet.

I'd recommend using the "-ro" flag to gap5 (read-only mode) unless you really do need to be editing too.

To get started, firstly you need a Gap5 database. It cannot read your old Gap4 ones. You an construct new Gap5 databases out of ACE, MAQ, BAM or BAF format files, or convert your old Gap4 databases via caftools and the supplied caf2baf script. Eg:

cd /nfs/repository/d0022/bE171C14/
gap2caf -project BE171C14 -version 0 | caf2baf > /tmp/BE171C14.baf
cd /tmp
tg_index -p -B -o BE171C14 BE171C14.baf
gap5 BE171C14

It's worth having an idea of the depth of your data too. If you have a very shallow assembly, try using (for example) the "-z 256k" option to tg_index to speed up processing and reduce file size. See below for the full details.

tg_index

This converts various alignment file formats into a gap5 database file (or pair of files infact). The input formats currently supported are maq (short/long), bam, ace, baf, and some old "aln" text format. I have a caf2baf conversion tool if people need it too, but it's not natively supported by tg_index.

Usage:
tg_index -o dbname [options] -format_code input_filename
format_code is one of

-b
BAM
-m
MAQ short
-M
MAQ long
-A
ACE
-B
BAF
In additon to this there are a variety of other options:

-a
Append mode. With this the database is appended to instead of overwritten.
-n
Requests that new contigs are made when appending, even if they match the names of existing contigs. (By default it'll merge data into the same contigs, but if padding is different then this will cause issues.
-p, -P
turn on (or off) read-pairing. This is on by default, but it uses up memory to identify the pairs (by name). If you know you have single-ended data then using -P will speed up indexing and save memory.
-T
Requests building a B+Tree of sequence names. This permits random access governed by a name rather than position, eg to jump specifically to sequence "foo" in the Gao5 editor. The index isn't build by default as it's rather slow. (I have plans on improving this though.)
-o 'db_name'
Specifies the output database name is to be db_name.
-z 'size'
This governs the bin size for the range-query binning system. By default 'size' is 4k, but it's worth increasing this if your coverage is very low. Ideally you want a few thousand sequences per bin to strike a happy balance between speed and I/O efficiency.
So a typical example usage maybe:
tg_index -z 64k -o rmdup_g5 -m rmdup.map
gap5

This is the actual viewer or editor. The main displays you'll want to familiarise yourself with are the Contig List, Contig Editor and Template Displays.

Initially you may (or may not, depending on how many there are) see the "contig selector" window. Note that this is currently bugged when the total contig length goes beyond 2Gb - eg whole human alignments. It's probably worth using the Contig List window instead in this case. You can forcibly turn on or turn off displaying the contig selector at startup using -csel and -no_csel command line options.

-ro
Use this command line option to disable editing abilities. It opens the file in read-only mode guaranteeing that you cannot change the data.
-csel
Forces the contig selector to be shown at startup
-no_csel
Forces the contig selector to not be shown at startup.
Downloads

The executables are distributed via sourceforge at:

https://sourceforge.net/project/show...kage_id=317355

Code, for those that really care, is also there via:

http://staden.cvs.sourceforge.net/vi...aden/src/tgap/
http://staden.cvs.sourceforge.net/vi...taden/src/gap5

Screenshots:



This shows a graphical overview of a mixed assembly. The colours indicate mapping quality and/or template status (single ended, paired but spanning contigs). The Y status indicates the insert size - hence clearly seeing solexa vs capillary libraries in this plot.




An example of the contig editor. This is a mix of 454 and capillary data made by MIRA. The MIRA tags are visible here as the coloured fragments.




Another editor screenshot, showing grey scales for base quality and mapping quality (in the "names" panel to the left, now just an ascii art representation of the alignments). Also shown are a couple traces for capillary sequences as this is from a mixed capillary/solexa assembly. It can show 454 traces too, and in theory solexa ones but we're no longer keeping processed trace data here (only raw).

James

Last edited by jkbonfield; 06-11-2009 at 01:40 AM.
jkbonfield is offline   Reply With Quote
Old 06-12-2009, 05:08 AM   #2
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Hi James,

i managed to compile caftools, io_lib, searched for Caftools.pm and finally fail to index a BAF file created by caf2baf.

I have a CAF written by MIRA assembler which I have converted.

Now I just wanted to index,

$ tg_index -p -B -o proj_gap5 proj.baf
tg_index.bin: /scratch/local2/SVEN/software_test_installs/gap5/gap5-1.2.0-linux-x86_64/lib/linux-x86_64-binaries/libpng12.so.0: no version information available (required by /scratch/local2/SVEN/software_test_installs/gap5/gap5-1.2.0-linux-x86_64/lib/linux-x86_64-binaries/libtk_utils.so)

g_index: Short Read Alignment Indexer, version 1.1.3

Author: James Bonfield (jkb@sanger.ac.uk)
2007-2009, Wellcome Trust Sanger Institute

Loading proj.baf...
2%Resizing HacheTable tg_cache to 4096
2*Aborted


Any idea where to look or what to do?

Thanks,
Sven
sklages is offline   Reply With Quote
Old 06-12-2009, 05:57 AM   #3
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

You can ignore the libpng whinge - I'm currently working on improving the staden package dependencies and build environment (about time!) and that's something which should go away.

The abort though is more worrying. I'm assuming this assembly has a lot of contigs (based on the comment about increasing tg_cache - that's to do with the number of objects held in memory), but I've tested a variety of assemblies so far.

Is there a smaller .baf file you could send me perhaps that recreates the error (to sanger.ac.uk addr)? Obviously it doesn't get as far as to 3% into the file.

Also, what linux OS & release are you using please?

James
jkbonfield is offline   Reply With Quote
Old 06-12-2009, 09:31 AM   #4
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

The bug has been found and fixed (I hope) in the CVS tree. Many thanks to Sven for providing some test data to trigger the issue. (Specifically it occurs when using BAF files with trace names that do not share a common prefix with the reading names.)

I'll build a new version on Monday most likely, also including an updated caf2baf script too. In searching for it I found a few other oddities when compiling with full optimisation which I'm fixing at the same time. The 1.2.0 release was accidentally built with full debugging and unoptimised code, although it's still sufficiently fast that it's not desparately noticable.

Sorry for the error.

James
jkbonfield is offline   Reply With Quote
Old 06-15-2009, 04:34 AM   #5
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

I've now built binaries for 1.2.1 too and placed on sourceforge at:

https://sourceforge.net/project/show...ease_id=689951

For those that downloaded the previous version - there's little change except for:

1) Bug fixed handling of trace names (only an issue sometimes, and only if importing from BAF or ACE).
2) Improved caf2baf perl script
3) Rebuilt everything with optimisation turned on, so it'll be slightly faster now. (The difference probably isn't as significant as you'd think as a lot of time was spent in already optimised third party libraries such as zlib.)

James
jkbonfield is offline   Reply With Quote
Old 06-16-2009, 03:35 PM   #6
Lesley
Junior Member
 
Location: New Zealand

Join Date: Jan 2008
Posts: 8
Default

Brilliant, we have been awaiting Gap5 since it was marketed so heavily by Illumina when they began marking the Genome Analyser (before the software was ready of course).
I am running a 64 bit Ubuntu 9.04 machine and I managed to get the tg_index working by creating symbolic links for the libssl libraries which by default are 0.9.8, not the 0.9.7 required by the script.
I have run into problems with the gap5 script though and get the following error message:
couldn't load file "libtgap.so": /gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/linux-x86_64-binaries/libtgap.so: undefined symbol: set_dna_lookup
while executing
"load libtgap.so g5"
(file "/gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/gap5/gap.tcl" line 504)
This one is beyond what I can fix. Any ideas?
Cheers,
Lesley
Lesley is offline   Reply With Quote
Old 06-17-2009, 01:23 AM   #7
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Sorry to hear it's failing to load for you.

Can you please try setting the STADEN_DEBUG environment variable and running again? It won't make it work, but it may indicate the cause of the problem.

The set_dna_lookup symbol is in the libseq_utils.so library that is shipped with the program, in gap5-1.2.1/lib/linux-x86_64-binaries directory. The gap5 wrapper script it meant to automatically set up LD_LIBRARY_PATH and similar to include this directory, but maybe it's still missing a dependency.

(It will say soemthing like "couldn't load file "libiwidgets.so", but that's normal as it doesn't exist for anyone, but I'm expecting there is also some complaint about symbols in libseq_utils.so judging by your error.)

I built it using a rather aging Debian system (Sarge), due to our systems here being somewhat behind the time in OS releases, hence the issues with library versions. I can rebuild it on etch and make another test available if needed though, but I don't have access to Ubuntu currently.

James

PS. My current project though is to simplify the build system so people can just download the source and type "make" to get something working, but it's a non-trivial process due to some poorly supported dependencies right now.
jkbonfield is offline   Reply With Quote
Old 06-17-2009, 03:34 PM   #8
torrey
Junior Member
 
Location: California

Join Date: Mar 2009
Posts: 5
Default

Hi James,
I am getting the same error as Lesley... I am also on an Ubuntu box. Here is the debug output for the error

load libitcl3.3.so =>
load libitk3.3.so =>
load libiwidgets.so => couldn't load file "libiwidgets.so": libiwidgets.so: cannot open shared object file: No such file or directory
load libgap5.so => couldn't load file "libgap5.so": libg2c.so.0: cannot open shared object file: No such file or directory
couldn't load file "libtgap.so": /home/rtewhey/genome/staden/linux-x86_64-bin/../lib/linux-x86_64-binaries/libtgap.so: undefined symbol: set_dna_lookup
while executing
"load libtgap.so g5"
(file "/home/rtewhey/genome/staden/linux-x86_64-bin/../lib/gap5/gap.tcl" line 504)

best


Update:
After installing libg2c0 package from the package manager and then http://packages.debian.org/lenny/amd...dc++5/download everything seems to be up and running on ubuntu.

Last edited by torrey; 06-17-2009 at 04:09 PM. Reason: Update
torrey is offline   Reply With Quote
Old 06-17-2009, 09:56 PM   #9
alig
Member
 
Location: adelaide

Join Date: Sep 2008
Posts: 43
Default gap5 aborts

Hi,

I ran tg_index on a Consed ace file assembly of illumina reads & it appeared to work without any errors.

But when I open the converted gap5.aux file there is just the Refseq sequence in the contig & no sequences. When I try to edit contig I get this message.

Level 1: EditContig2 io=0xf731d0 .cedialog .cedialog.id
Thu 18 Jun 12:56:57 2009 signal_handler: Program terminated unexpectedly with signal 11.
Thu 18 Jun 12:56:57 2009 signal_handler: This is probably a bug.
Thu 18 Jun 12:56:57 2009 signal_handler: Please report all bug reports at https://sourceforge.net/projects/staden/
Aborted

can anyone help?

alig
alig is offline   Reply With Quote
Old 06-18-2009, 01:37 AM   #10
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by torrey View Post
load libgap5.so => couldn't load file "libgap5.so": libg2c.so.0: cannot open shared object file: No such file or directory
I'm rather suprised I still have a dependency on that, but the Gap5 build was derived from Gap4 so perhaps I still link against that library despite not using it. (It's the GNU FORTRAN 77 run-time library.)

Quote:
Update:
After installing libg2c0 package from the package manager and then http://packages.debian.org/lenny/amd...dc++5/download everything seems to be up and running on ubuntu.
Ah, pretty much as I expected then. I'll try and reduce these dependencies, and more importantly document them too, for the next release. Ideal would be a proper .deb package, but that requires a lot more work and ultimately ends up needing umpteen variants for every linux OS out there which I'd rather avoid.

Thanks for the feedback.

James
jkbonfield is offline   Reply With Quote
Old 06-18-2009, 01:41 AM   #11
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by alig View Post
Hi,

I ran tg_index on a Consed ace file assembly of illumina reads & it appeared to work without any errors.

But when I open the converted gap5.aux file there is just the Refseq sequence in the contig & no sequences. When I try to edit contig I get this message.
Sorry to see that. I did recently spot and fix a bug involving long annotations (aka tags); anything longer than 1K could potentially cause a crash. The fix is in the CVS tree already and will make it into the 1.2.2 release whenever that happens.

However I doubt that's the case here. Gap5 currently doesn't have special support for reference sequences, so the only way they'd appear in the assembly is if there is a single very long sequence added to the assembly itself. In theory this should work just fine, but isn't something I've tested much.

Would it be possible to obtain a temporary copy of your data set to debug the program with?

Thanks,

James
jkbonfield is offline   Reply With Quote
Old 06-18-2009, 04:47 AM   #12
coldturkey
Member
 
Location: Brisbane

Join Date: Nov 2008
Posts: 51
Default

Is there a Gap5 Manual? or will there be.
coldturkey is offline   Reply With Quote
Old 06-18-2009, 10:44 AM   #13
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

There's no manual at present. I'm trying to keep it similar to gap4 for usage, so the gap4 manual could be heavily pilfered for suitable documentation.

The main new display in gap5 though is the template display. Gap4 had it, but it was slow and very cluttered for anything but the smallest of contigs. I'll have to right docs from scratch for that, however it's still being heavily modified too.

Given the bad lack of documentation though, I'm happy to field questions. If I get too many it'll just encourage me to write a manual. :-)

James
jkbonfield is offline   Reply With Quote
Old 06-18-2009, 04:24 PM   #14
Lesley
Junior Member
 
Location: New Zealand

Join Date: Jan 2008
Posts: 8
Default

Thanks for the quick reply. We still get errors as below:

gap5-1.2.1-linux-x86_64/linux-x86_64-bin/gap5 GiardiaV

load libitcl3.3.so =>
load libitk3.3.so =>
load libiwidgets.so => couldn't load file "libiwidgets.so": libiwidgets.so: cannot open shared object file: No such file or directory
load libgap5.so => couldn't load file "libgap5.so": libstdc++.so.5: cannot open shared object file: No such file or directory
couldn't load file "libtgap.so": /gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/linux-x86_64-binaries/libtgap.so: undefined symbol: set_dna_lookup
while executing
"load libtgap.so g5"
(file "/gap5-1.2.1-linux-x86_64/linux-x86_64-bin/../lib/gap5/gap.tcl" line 504)

I am running Ubuntu on a 64 bit machine and even installing the libg2c0 package did not work. The other issue is that Ubuntu installs libstdc++.so.6 and I am highly reluctant to change that because of the other software I am running.
I hope you can find the answer,
Cheers,
Lesley
Lesley is offline   Reply With Quote
Old 06-19-2009, 01:12 AM   #15
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
load libiwidgets.so => couldn't load file "libiwidgets.so":
This line you can ignore - the file shouldn't exist anyway.

Quote:
load libgap5.so => couldn't load file "libgap5.so": libstdc++.so.5: cannot open shared object file: No such file or directory

...

I am running Ubuntu on a 64 bit machine and even installing the libg2c0 package did not work. The other issue is that Ubuntu installs libstdc++.so.6 and I am highly reluctant to change that because of the other software I am running.
You absolutely shouldn't replace the systen libstdc++ version. The above error though is the one which is preventing gap5 from launching. A work around is to find a copy of this library version from somewhere else and simply place it in the unpackged gap5-1.2.1/lib/linux-x86_64-libraries directory. It won't affect any system tools then, but gap5 will find it on start up.

The next build (done very shortly I hope) will solve this; gap5 actually doesn't yet use any C++ and I erroneously link against this library simply because gap4 previously did and I copied it's build configuration without thinking. (The same applies for libg2c0 too.)

James
jkbonfield is offline   Reply With Quote
Old 06-19-2009, 01:46 AM   #16
coldturkey
Member
 
Location: Brisbane

Join Date: Nov 2008
Posts: 51
Default

I James,
Thanks for your quick reply. I have a question regarding the template display. Could you post the colour code for the reads in the display

regards

Brian
coldturkey is offline   Reply With Quote
Old 06-19-2009, 02:25 AM   #17
darked89
Member
 
Location: Barcelona, Spain

Join Date: Jun 2009
Posts: 36
Smile

re: libg2c0 on Ubuntu:

Quote:
Originally Posted by torrey View Post
Update:
After installing libg2c0 package from the package manager and then http://packages.debian.org/lenny/amd...dc++5/download everything seems to be up and running on ubuntu.
On Ubuntu 9.04 (workstation version on i386) simply installing libg2c0 works. Thanks for the tip

dk
darked89 is offline   Reply With Quote
Old 06-19-2009, 02:30 AM   #18
darked89
Member
 
Location: Barcelona, Spain

Join Date: Jun 2009
Posts: 36
Default

@jkbonfield

Dear James,

Would it be possible to also put some test database(s) with sequences from various sequencing platforms? Something small just to play with?

Best,

dk

Last edited by darked89; 06-19-2009 at 02:46 AM.
darked89 is offline   Reply With Quote
Old 06-19-2009, 06:45 AM   #19
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by coldturkey View Post
I James,
Thanks for your quick reply. I have a question regarding the template display. Could you post the colour code for the reads in the display
Certainly.

The grey scale levels indicate the mapping quality for the template, from dark grey (bad) to white (good). You can control whether it colours this using the average of each end sequence, the minimum of the two, or the maximum of the two by using the menu towards the top left. Additionally the Filter button allows you to filter out poor quality (or peversely good quality) mapping reads.

However this is only the base colour. It will be changed to one of the following if it fits into another category:

Blue: single ended read. Or more specifically, only *this* end is in the Gap5 database. (That doesn't necessarily mean it was a single ended library.)

Red: Inconsistent orientation. Ie instead of ---> <--- it finds ---> --->, <--- <--- or <--- --->. This needs some more work as certain library construction types (eg 454) will yield ---> ---> as standard. My plan here is to auto-analyse each library and attempt to work out the normal orientation. Then anything abnormal will be labelled as red.

Orange: paired, but the other end isn't visible. It's not actually possible with ease to tell whether the other end is in another contig (most likely) or whether it is a long way away in this contig - significantly off the edge of the screen in other words. The ">>Acc" button (accurate mode) attempts to work out this distinction correctly, but it's slow as it has to read more data in order to figure out the correct location.

The reason for this is that Gap5, being an editor, doesn't actually track *where* the other end is, only the record number. If it was found in the same range query then we know they form a pair, if not we require more searching to figure this out. The reason I don't cache such data is that it would mean contig joining or contig breaking suddenly requires millions of edits (to check whether the intra-contig or inter-contig spanning status needs changing).

Having said that, I just noticed the >>Acc button crashed Gap5 for me on one of my test databases. Oops!

For reads (2nd menu from left; set to "reads") the colours simply indicate forward vs reverse reads. It's not possible in all file formats though to distinguish which is which so sometimes it's simply a case of the "left" one being one colour while the "right" one is another. The main use of this is simply to see the extent that the physical sequence data extends along the insert.

Finally I'd recommend using the Y-spread function when in Template Size mode (left-most menubutton). This mode is the default template display view where the Y coordinate is governed by the size of the insert. Thus structural variations stand out. However you can't easily tell how many lines are stacked up on the same pixel row - it maybe an isolated template or it may be a whole stack of templates with very similar start/end positions. The Y-spread slider basically adds a bit of pseudo-randomness to the Y coordinate making it easier to distinguish between 1 template and lots of templates stacked up at the same coordinates.

James
jkbonfield is offline   Reply With Quote
Old 06-19-2009, 09:23 AM   #20
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by darked89 View Post
Would it be possible to also put some test database(s) with sequences from various sequencing platforms? Something small just to play with?
I through together something *very* small - a sanger human BAC clone (capillary sequencing) and a thin slice from mitochondrial genome out of one of the 1000genome submissions.

Code:
-rw-r--r--  1 badger hsg 5402975 Jun 19 16:59 DJ105K17.baf
-rw-r--r--  1 badger hsg 1566608 Jun 19 16:59 DJ105K17_g5
-rw-r--r--  1 badger hsg    8512 Jun 19 16:59 DJ105K17_g5.aux
-rw-r--r--  1 badger hsg  348101 Jun 19 16:59 mt.bam
-rw-r--r--  1 badger hsg  227824 Jun 19 16:59 mt_g5
-rw-r--r--  1 badger hsg    1120 Jun 19 16:59 mt_g5.aux
The files are in ftp://ftp.sanger.ac.uk/pub/badger/tmp/

They're produced using:

Code:
tg_index -o DJ105K17_g5 -B DJ105K17.baf
tg_index -o mt_g5 -b mt.bam
Obviously I could put bigger data sets there, but it's hard to find something the "right size". (I was looking for S.Suis - a nice 2Mb organism, but failed dismally to find our original test set for this and I don't want the hassle of figuring out what's publically released and what isn't.)

If I get a more appropriate data set I can place that there too. It'd be good to get some standard "moderate sized" test sets for developers to experiment with and compare/contrast tools on. Preferably with snps, structural variations, etc.

James
jkbonfield is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO