Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Renaming reads within SFF files

    Hello,
    Has anybody a script/software to rename reads within an SFF file ?
    We are getting data back from providers that have been tagged/barcoded .
    We can split the data without a problem, but it would be nice if the id of the read could be changed to reflect the source of that read.
    Especially in SNP discovery of unsequenced organisms it would be handy.

    I have started to have look at the sff format, using the info on the NCBI website.
    I have already found that the iolib from staden can read the data, so most of the data has already been done.
    I could do it myself, but if somebody has an implementation yet, it would save me some time

  • #2
    I have done this as a one off in python for some test SFF files - would that be of interest?

    Comment


    • #3
      Yes, that would be very helpful.
      Would it be OK if I published it on a website somewhere if it works satisfactory ?

      Comment


      • #4
        Do you care about the Roche XML manifest and/or the record index? If not, this makes life simpler (and if you want the index added later on, just put the SFF file though the Roche tool sfffile and it will generate the index).

        Do you know any Python? You would need a line or two of Python to do the renaming. Can you give a few examples of the old names and the desired new names?

        On re-reading your original question, I would guess the renaming could be based on the barcodes (i.e. you'll need to look at the called sequence). This would complicate things a little. If so, what do you do if there the barcode isn't sequenced perfectly and does not match any of your expected barcodes?

        Comment


        • #5
          I don't care about the index, one could reconstruct that easily with sfffile.
          I just had a look at iolib and sff_extract, and they both don't have information about the manifest.
          I can see that it is placed just before the index, so I guess that one could just grab the bytes between the last read and the index location, and reuse the info as is.

          Did you find a description of the manifest block ?

          My idea on renaming would be to let the user decide what they want.
          My idea now would be add something to the 454 identifier, so it stays unique and indentifiable form which run it came.

          I'll have a look at the manifest block to see whether i can guess what the leading bytes mean.

          Comment


          • #6
            Originally posted by jvhaarst View Post
            I don't care about the index, one could reconstruct that easily with sfffile.
            Yes - that works fine I've found.
            Originally posted by jvhaarst View Post
            I just had a look at iolib and sff_extract, and they both don't have information about the manifest.
            It is undocumented as far as I know.
            Originally posted by jvhaarst View Post
            I can see that it is placed just before the index, so I guess that one could just grab the bytes between the last read and the index location, and reuse the info as is.
            Yes you can do that. Roche SFF files with a manifest use the "SFF index block" to hold both an XML manifest, and an actual index block.
            Originally posted by jvhaarst View Post
            Did you find a description of the manifest block ?
            No - but a little reverse engineering shows the length of the XML string is given (so you know where it is, and where the following index data is), and the length of the index data.
            Originally posted by jvhaarst View Post
            My idea on renaming would be to let the user decide what they want. My idea now would be add something to the 454 identifier, so it stays unique and indentifiable form which run it came.
            Just adding the same text to every read identifier? Should be easy...
            Originally posted by jvhaarst View Post
            I'll have a look at the manifest block to see whether i can guess what the leading bytes mean.
            I've told you what I think it means above - very simple, just two lengths



            The above documentation (and the Roche 454 manual which has similar content) don't actually cover the index. All the specification lays down is the index starts with a four byte "magic number" (a format name) and a four byte version (typically a string). Thus different SFF index types can be distinguished by their first eight characters.

            I have only seen Roche SFF files with indexes starting ".srt1.00" (with no XML manifest) and more commonly ".mft1.00" (short for Manifest v1.00 is my guess). These both use the same index internally, working in base 255 so that 0xFF can be used as a separator character. As far as I know, neither of these index block formats is documented (although I have reverse engineered enough to understand most of the layout).

            Looking at the Staden IO lib, their code knows about ".srt1.00" (454 sorted v1.00) and also supports ".hsh1.00" (hash table v1.00). They provide documentation of these hash tables too. I have no idea if these hash indexes are actually in widespread use or not.

            I'm working on support for SFF files in Biopython, including the indexes. This code is currently on github and is not yet in the main trunk:
            GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


            Once it is (or if you are happy using my branch for a one off conversion), then this should work if you don't care about the Roche XML manifest:

            Code:
            from Bio import SeqIO
            
            def rename(record) :
                """Function to alter the record's identifier."""
                record.id += "_and_a_suffix"
                return record
            
            #Python generator expression, only one record in memory at a time:
            records = (rename(rec) for rec in SeqIO.parse(open("input.sff","rb"),"sff"))
            
            #This will not write the Roche XML manifest!
            handle = open("output.sff", "wb")
            SeqIO.write(records, handle, "sff")
            handle.close()
            I can do something similar preserving the XML, but it requires going a little low level - not just using Biopython's SeqRecord based SeqIO system:

            Comment


            • #7
              Great !
              This saves me (and probably others) a lot of time.
              Adding the index and the manifest shouldn't be that hard, would it ?
              The index is probably just a sorted list with IDs, and an adress ?

              Comment


              • #8
                Originally posted by jvhaarst View Post
                Great !
                This saves me (and probably others) a lot of time.
                Adding the index and the manifest shouldn't be that hard, would it ?
                The index is probably just a sorted list with IDs, and an adress ?
                It's not hard, but I haven't settled on my API yet, and I'm still hoping for more details about the XML manifest format, and the index. The Roche index is an alphabetically sorted list of the names, storing the offset using base 255 (not 256), followed by a marker character (byte 0xFF, decimal 256).

                The short script above (using the current version of the Biopython branch referred to) will write a Roche style index with a dummy manifest. I would expect this to work as is when SFF support is merged into the main Biopython trunk.

                I could share an example which first extracts the original XML manifest, and saves that to the output file (along with the selected records and their new names and offsets). However, right now that requires calling "private" methods in my code, and such a script will probably go out of date shortly. If you are doing this as a one off, this might be fine, but I don't want to circulate an example which I expect to break soon (as I work on the Biopython SFF support).

                Note that one of the things recorded in the XML manifest is the <accession_prefix>, i.e. what all the reads are expected to start with. If you edit the SFF read names, but not this bit of the manifest, it may confuse the Roche tools. As the XML manifest is (to my knowledge) undocumented, the only safe option is to not to write it, or make the user/programmer calling Biopython decide this themselves.
                Last edited by maubp; 09-08-2009, 03:34 AM.

                Comment


                • #9
                  For now, I think I will first have a test with the changed reads.
                  I myself wouldn't change the start of the reads, because that would make it harder to see which run produced a read. This means that the accession_prefix can stay as it is.

                  For the future a version which can reuse the old manifest would be great.

                  Comment


                  • #10
                    The read names in 454 enode among other things the date and time, so collisions should be very unlikely. That said, I wrote a small utility to just add a serial number to all reads in a set of SFF files (also ignoring index and manifest etc) - it's available as part of the flower package (http://blog.malde.org/index.php/flower).

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Advancing Precision Medicine for Rare Diseases in Children
                      by seqadmin




                      Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                      12-16-2024, 07:57 AM
                    • seqadmin
                      Recent Advances in Sequencing Technologies
                      by seqadmin



                      Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                      Long-Read Sequencing
                      Long-read sequencing has seen remarkable advancements,...
                      12-02-2024, 01:49 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 12-17-2024, 10:28 AM
                    0 responses
                    39 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-13-2024, 08:24 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-12-2024, 07:41 AM
                    0 responses
                    38 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-11-2024, 07:45 AM
                    0 responses
                    46 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X