SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: ?-RA: a parallel sparse index for genomic read alignment. Newsbot! Literature Watch 0 01-27-2012 11:01 PM
Tool which splits bam files into specific genomic intervals? Alex Coventry Bioinformatics 2 09-15-2011 02:23 AM
easy way to compute FPKMs from bam files? lpn RNA Sequencing 1 07-15-2011 05:41 AM
Restriction enzymes for genomic DNA library prep (e.g. NlaIII) 454newbie Illumina/Solexa 7 07-07-2011 07:14 PM
Dmel Genomic Features files chrishawk Bioinformatics 1 11-23-2010 10:00 AM

Reply
 
Thread Tools
Old 02-13-2012, 05:26 AM   #1
xApple
Member
 
Location: Behind you.

Join Date: Feb 2012
Posts: 12
Default Easy library to read and write genomic files

Hi guys !

I just finished work on a python package that can read most of the genomic file formats such as BED, WIG, GFF, BedGraph in a simple and standard way. I made a post about in /r/bioinformatics and several people told me I should also announce it in the seqanswers.com forum. So here goes !

It's free and open-source. It will read many different kinds of formats in the same simple syntax:

Code:
import track
 with track.load('tracks/rp_genes.bed') as rp:
     data = rp.read('chr3')
You install it by typing:
Code:
 $ sudo easy_install track
Particular attention was brought to the documentation http://xapple.github.com/track/

Please tell me what you think about it !
xApple is offline   Reply With Quote
Old 02-13-2012, 05:44 AM   #2
ffinkernagel
Senior Member
 
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110
Default

Nice work - especially on the documentation and the pythonicness of it.
I haven't looked at the code yet.

How do you implement the dual tuple/dict nature of the generated rows?

Do you read everything at once, or is this a memory-conservative interface?

May I suggest you make the 'numbering convention' more prominent - right now it requires careful reading to determine whether this package a) converts all formats into one definite numbering scheme (thumbs up for doing so) and b) which one is it (I guess from the link that it's 0 based, end exclusive, just like python sequences, though that's a lot of verbiage and pretty pictures to say 'just like python: 0 based, 1 inclusive, length = end - start.
Question 3 in the link makes no sense to me - the red image is not 4..8 end *exclusive*, no matter what you count).
ffinkernagel is offline   Reply With Quote
Old 02-13-2012, 07:02 AM   #3
xApple
Member
 
Location: Behind you.

Join Date: Feb 2012
Posts: 12
Default

Thanks ! If you want to look at the code, it's all free on github. To answer your questions:

1. The dual tuple/dict hybrid is something I would like to take further. One often needs such an object. There is an implementation in the collections library, but it wasn't designed with performance in mind. Performance usually isn't an issue, and python is slow anyway. But this was one of the rare crucial parts that needed optimization. Indeed, others have thought the same, and I started with the sqlite3.Row object as it is found in the built-in library. It's essentially the same kind of tuple/dict hybrid. But, I thought it was buggy and ended up rolling my own python c extension.

2. These two concepts are not mutually exclusive. By default it does both at the same time. It's memory-conservative unless you tell it explicitly to load everything into RAM by using "track.memory.read". If you just use "track.read", every time you load a file to do something with it, we parse it from end to end and create a file-based SQLite database. So parsing is only ever done once.

3. The numbering convention is important yes, but do you think the paragraph about it should go before the paragraph about, say, installation ? What I could do, is remove the link to the wiki page and integrate the text and the two images directly in the documentation page ?

4. Could you explain better what is unclear in question 3 ? I tried to be as clear as possible about the difference in numbering on the sugar or numbering on the phosphate.

5. The two images at the end are supposed to share the first <zeroness> characteristic and second <inclusivness> characteristic. Hence, both are zero-based (instead of one-based) and both are end inclusive (instead of end exclusive). However they differ on the third characteristic and this produces a difference in results. So the red image is "4..8 end inclusive". And the green image is also "4..8 end inclusive".
xApple is offline   Reply With Quote
Old 02-13-2012, 07:56 AM   #4
ffinkernagel
Senior Member
 
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110
Default

Re 1: I simply use dictionaries when iterating across the rows of my pyDataFrames and have been looking for a decent implementation myself to get rid of the overhead, so that's why I was wondering.

2. So basically, if I say with Track(...) as tp: for entry in tp.read(), the tp.read will load everything an once, right? What about the case where it is indeed an SQLite file?

3. I'd stick a mention of it into the quickstart, maybe right after the example that sums up gene lengths, since that length calculation is only correct under the conversion.

Otherwise, keep the link, just expand the paragraph to explicitly state the model used (and that it equals standard pythno slicing) (so no 'see link for what the library use' but 'the library uses X, see link for a detailed explanation', which would save the user at least one click/load/read cycle )

4. I like the biologically motivated explanation, I love the images. I just don't see how 'n-based, in/exclusive' is underspecified, to me '0 based, inclusive' clearly suggests the left model, and while the right model is '0 based, exclusive'.
I mentally translate inclusive to 'closed' intervals in mathematics, and right-exclusive to a half-open interval, so 4..8 inclusive = [4..8], and that's 5 elements, and 4..8 exclusive is [4..8), and that's 4 elements...
ffinkernagel is offline   Reply With Quote
Old 02-13-2012, 08:04 PM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,283
Default

Thanks for posting here, your fellow redditor!
nilshomer is offline   Reply With Quote
Old 02-14-2012, 08:06 AM   #6
xApple
Member
 
Location: Behind you.

Join Date: Feb 2012
Posts: 12
Default

1. You're welcome to use/steal/take from my implementation. I was thinking of making a generic object called a "tict()" instead of a "dict()" as a play on tictionary. It would be written in C and could be used alone or directly as an sqlite3 row factory.

2. Basically, yes. Although, actually it's the "track.load" that loads everything at once and not the "t.read". If it's an SQL file already, we don't do anything at all, we just wait till you try to do a "t.read" and execute the right "SELECT" statement behind the scenes.

3. I updated the documentation !

4. Yes that's the crux of the problem exactly. The phrase "n-based, in/exclusive" is underspecified simply because it can mean different things to different people. Of course, practically, doing my personal statistics have revealed that about 90% of people will agree on the same definition given only the phrase "zero-based, end exclusive". And this is the UCSC standard they usually end up applying with those meager instructions. But you have 10% approximately that will rather number between nucleotides and they will get different results.
xApple is offline   Reply With Quote
Reply

Tags
file, genomic, load, parse, python

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:09 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.