Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Easy library to read and write genomic files

    Hi guys !

    I just finished work on a python package that can read most of the genomic file formats such as BED, WIG, GFF, BedGraph in a simple and standard way. I made a post about in /r/bioinformatics and several people told me I should also announce it in the seqanswers.com forum. So here goes !

    It's free and open-source. It will read many different kinds of formats in the same simple syntax:

    Code:
    import track
     with track.load('tracks/rp_genes.bed') as rp:
         data = rp.read('chr3')
    You install it by typing:
    Code:
     $ sudo easy_install track
    Particular attention was brought to the documentation http://xapple.github.com/track/

    Please tell me what you think about it !

  • #2
    Nice work - especially on the documentation and the pythonicness of it.
    I haven't looked at the code yet.

    How do you implement the dual tuple/dict nature of the generated rows?

    Do you read everything at once, or is this a memory-conservative interface?

    May I suggest you make the 'numbering convention' more prominent - right now it requires careful reading to determine whether this package a) converts all formats into one definite numbering scheme (thumbs up for doing so) and b) which one is it (I guess from the link that it's 0 based, end exclusive, just like python sequences, though that's a lot of verbiage and pretty pictures to say 'just like python: 0 based, 1 inclusive, length = end - start.
    Question 3 in the link makes no sense to me - the red image is not 4..8 end *exclusive*, no matter what you count).

    Comment


    • #3
      Thanks ! If you want to look at the code, it's all free on github. To answer your questions:

      1. The dual tuple/dict hybrid is something I would like to take further. One often needs such an object. There is an implementation in the collections library, but it wasn't designed with performance in mind. Performance usually isn't an issue, and python is slow anyway. But this was one of the rare crucial parts that needed optimization. Indeed, others have thought the same, and I started with the sqlite3.Row object as it is found in the built-in library. It's essentially the same kind of tuple/dict hybrid. But, I thought it was buggy and ended up rolling my own python c extension.

      2. These two concepts are not mutually exclusive. By default it does both at the same time. It's memory-conservative unless you tell it explicitly to load everything into RAM by using "track.memory.read". If you just use "track.read", every time you load a file to do something with it, we parse it from end to end and create a file-based SQLite database. So parsing is only ever done once.

      3. The numbering convention is important yes, but do you think the paragraph about it should go before the paragraph about, say, installation ? What I could do, is remove the link to the wiki page and integrate the text and the two images directly in the documentation page ?

      4. Could you explain better what is unclear in question 3 ? I tried to be as clear as possible about the difference in numbering on the sugar or numbering on the phosphate.

      5. The two images at the end are supposed to share the first <zeroness> characteristic and second <inclusivness> characteristic. Hence, both are zero-based (instead of one-based) and both are end inclusive (instead of end exclusive). However they differ on the third characteristic and this produces a difference in results. So the red image is "4..8 end inclusive". And the green image is also "4..8 end inclusive".

      Comment


      • #4
        Re 1: I simply use dictionaries when iterating across the rows of my pyDataFrames and have been looking for a decent implementation myself to get rid of the overhead, so that's why I was wondering.

        2. So basically, if I say with Track(...) as tp: for entry in tp.read(), the tp.read will load everything an once, right? What about the case where it is indeed an SQLite file?

        3. I'd stick a mention of it into the quickstart, maybe right after the example that sums up gene lengths, since that length calculation is only correct under the conversion.

        Otherwise, keep the link, just expand the paragraph to explicitly state the model used (and that it equals standard pythno slicing) (so no 'see link for what the library use' but 'the library uses X, see link for a detailed explanation', which would save the user at least one click/load/read cycle )

        4. I like the biologically motivated explanation, I love the images. I just don't see how 'n-based, in/exclusive' is underspecified, to me '0 based, inclusive' clearly suggests the left model, and while the right model is '0 based, exclusive'.
        I mentally translate inclusive to 'closed' intervals in mathematics, and right-exclusive to a half-open interval, so 4..8 inclusive = [4..8], and that's 5 elements, and 4..8 exclusive is [4..8), and that's 4 elements...

        Comment


        • #5
          Thanks for posting here, your fellow redditor!

          Comment


          • #6
            1. You're welcome to use/steal/take from my implementation. I was thinking of making a generic object called a "tict()" instead of a "dict()" as a play on tictionary. It would be written in C and could be used alone or directly as an sqlite3 row factory.

            2. Basically, yes. Although, actually it's the "track.load" that loads everything at once and not the "t.read". If it's an SQL file already, we don't do anything at all, we just wait till you try to do a "t.read" and execute the right "SELECT" statement behind the scenes.

            3. I updated the documentation !

            4. Yes that's the crux of the problem exactly. The phrase "n-based, in/exclusive" is underspecified simply because it can mean different things to different people. Of course, practically, doing my personal statistics have revealed that about 90% of people will agree on the same definition given only the phrase "zero-based, end exclusive". And this is the UCSC standard they usually end up applying with those meager instructions. But you have 10% approximately that will rather number between nucleotides and they will get different results.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X