Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • johan
    Junior Member
    • Jun 2010
    • 4

    SFF Read names

    Is there anyone here who knows how the read names are assigned to reads in the SFF-output from a 454 sequencing round. I have multiple reads with the same read name, with almost (!) identical nucleotide sequences. Anyone seen something like it, or who knows in what way the read names are assigned?
  • kmcarr
    Senior Member
    • May 2008
    • 1181

    #2
    Do you mean IDs that look like this?

    EBO6PME01EGNVK

    454 calls those unique accession numbers (uaccno). The first seven characters encode the start time of the run, the next two digits represent region of the picotiter plate which contained the reads and the last 5 characters encode the X and Y coordinates of the read. I forget the exact encoding scheme but I think it's some sort of 16 bit encoding of the epoch time and x-y postions.

    These IDs are supposed to be universally unique so you should not have multiple reads with the same ID. If you do it most likely means that someone has altered the names.

    Comment

    • johan
      Junior Member
      • Jun 2010
      • 4

      #3
      I mean exactly these IDs. And if you're correct, I should really start worrying about my non-unique IDs... Thanks a lot for the information!

      Comment

      • Loris
        Junior Member
        • Dec 2009
        • 6

        #4
        Could someone have processed the original sff file in different ways (changed filters, trim points etc.), with the resultant files later being merged together?

        You could have a look at the manifest with sffinfo -m <filename> and see if there are any duplications.

        Comment

        • kmcarr
          Senior Member
          • May 2008
          • 1181

          #5
          For those interested in the in all the gory details of what the Universal Accession Number means I stumbled across the description in the Roche documentation "SW-Manual_Overview-FileFormats_Oct2009"

          2.3.7 454 “Universal” Accession Numbers
          The standard 454 read identifiers, used in Genome Sequencer FLX System data analysis software versions prior to 1.0.52 (early GS 20 System), have the format “rank_x_y” (as in 003048_1034_0651), where “rank” is a ranking of the well in a region by signal intensity, and “x” and “y” are the pixel location of the well’s center on the sequencing Run images. This identifier is guaranteed to be unique only within the context of a single sequencing Run, and may or may not be unique across specific sets of Runs.

          To allow for the combination of reads across larger data sets, a more unique accession number format has been developed. An accession in this format is a 14 character string, as in C3U5GWL01CBXT2, and consist of 4 components:
          C3U5GW - a six character encoding of the timestamp of the Run
          L - a randomizing “hash” character to enhance uniqueness
          01 - the region the read came from, as a two-digit number
          CBXT2 - a five character encoding of the X,Y location of the well

          The timestamp, hash character and X,Y location use a base-36 encoding (where values 0-25 are the letters ‘A’-‘Z’ and the values 26-35 are the digits ‘0’-‘9’). An accession thus consists only of letters and digits, and is case-insensitive.
          • The timestamp is encoded by computing a “total” value as shown below, then converting
          it into a base-36 string:
          total =
          (year - 2000) * 13 * 32 * 24 * 60 * 60 +
          month * 32 * 24 * 60 * 60 +
          day * 24 * 60 * 60 +
          hour * 60 * 60 +
          minute * 60 +
          second;
          As a result of this calculation, the first character of read accessions will always be a letter for Runs performed from now until 2038. The timestamp values are taken from the rigRunName found in the analysisParms.parse file in the specified analysis directory.

          This rigRunName is the R_... name that is generated by the instrument software, and is also used as the standard directory name for the Run. Thus, a Run whose name begins with R_2004_09_22_16_59_10_... generates C3U5GW as its encoded timestamp value.

          • Since two Runs may be started at the same second, an additional base-36 character is generated by hashing the full rigRunName to a base-31 number (the highest prime below 36), as in:

          Code:
           chval = 0; 
           for (s=rigRunName; *s; s++) { 
            chval += (int) *s; 
            chval %= 31; 
           } 
           ch = (chval < 26 ? 'A' + chval : '0' + chval - 26);
          • The X,Y location is encoded by computing a total value of “X * 4096 + Y” and encoding that as a five character, base-36 string.

          Comment

          • johan
            Junior Member
            • Jun 2010
            • 4

            #6
            Thanks all of you for your answers, information and suggestion. I have now discussed with the bioinformatician who sent me the sequences, and it turned out that the problem was with the DNA barcodes for the different samples. Mismatches were allowed in these barcodes, which in a few instances led to the same accession number being coupled to more than one sequence. As the library was sent over as one file for each barcode, the IDs looked unique until all sequences from the run was compared to each other and the problem occurred. The problem was solved by not allowing mismatches in the barcodes.

            Comment

            • maasha
              Senior Member
              • Apr 2009
              • 153

              #7
              Anyone knows how you can extract the X/Y-coordinates from the name? Somehow sffinfo does this ...


              M

              Comment

              • kmcarr
                Senior Member
                • May 2008
                • 1181

                #8
                Originally posted by maasha View Post
                Anyone knows how you can extract the X/Y-coordinates from the name? Somehow sffinfo does this ...


                M
                On this page http://www.genome.ou.edu/informatics.html the first script listed, 454_base36 will do what you want.

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  Biopython 1.60 will include this too. Thanks kmcarr for that very informative post, and Jeff Hussmann who wrote the code.

                  Comment

                  Latest Articles

                  Collapse

                  • GATTACAT
                    Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by GATTACAT
                    Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                    07-01-2026, 11:43 AM
                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 11:08 AM
                  0 responses
                  6 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-30-2026, 05:37 AM
                  0 responses
                  11 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-26-2026, 11:10 AM
                  0 responses
                  19 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  53 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...