Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Metrix - A Server / client parser for Illumina (InterOp) run directories

    Hi,

    A while ago our core facility ran into problems with obtaining run statistics for Illumina (HiSeq and MiSeq) runs.
    Our desire was to integrate it into another project where I am contributing to an opensource LIMS system called GNomEx (http://sourceforge.net/projects/gnomex/). We also needed it to gather statistics on the command-line without any difficult integration or scripting.

    Using this approach, I was able to obtain sequence statistics for all the available runs, with a variable scanning interval.

    Several features of Metrix:
    • Java 7.0 Watcher Service
    • Runs as a server daemon
    • Most used statistics (Current cycle, (pre-)phasing e.g...) are gathered. Other Metrics will be added in the future or can be implemented yourself.
    • All information of a single run is stored in an object
    • Collections of Run Summaries (Summary) can be stored in an SummaryCollection object
    • RunInfo.xml parsed and stored
    • Summary and SummaryCollection are accessible from a client interface in several formats (XML and POJO)
    • New, current and past runs are detected automatically and classified with their respective state.
    • Open source (GPL v.3) on GitHub
    • Server / Client communication is performed using a Metrix Command object where all details of the communication are stored.
    • Several client modes available:
      - Mode: "POLLED" - Command is executed every XYZ milliseconds; minimum of 10000ms)
      - Mode: "GET" - Command is executed once


    Each run directory will get a status code (state) appointed.
    • [1] Running
    • [2] Finished
    • [3] Hanging with errors
    • [4] Flowcell needs turning
    • [5] Initializing
      - MiSeq runs takes 5 cycles before statistics are detected from state 5 to state 1.
      - HiSeq runs take 4 - 6 cycles before InterOp statistics are detected and set from state 5 to state 1.


    Please note that this is the first version and does not yet support full parsing of all InterOp files.
    Also, I have left the client side 'open' for users to implement their most preferred way.
    Its quite easy if you're familiar with Java!

    The repository is available on GitHub

    I applaud all feedback, bug reports and input! They are very welcome.

    Good luck!
    Bernd
    Last edited by Rhizosis; 02-22-2013, 01:12 AM.

  • #2
    This looks great! I've been struggling with getting miseq metrics out of the interop files ( Illumina has been no help ) for over a year. I'll give this a whirl soon.

    Comment


    • #3
      Ahhhhhhhhhhwesome! Will try asap!

      Comment


      • #4
        Hello

        Hello, my name is Eric, I just started a job at an HIV clinical testing lab in Vancouver, Canada. We purchased a MiSeq to develop tests for drug resistance in HIV+ individuals using deep sequencing. I'm supposed to help get this off the ground...

        We will likely need to upload quality data to our database. Metrix looks great for helping us do this.

        I've noticed a report from Illumina here: http://www.illumina.com/Documents/pr...operations.pdf on page 26-29 gives a description of the binary structure of these InterOp files. However, this document looks like it's HiSeq, not MiSeq, and I can't find similar documents for MiSeq. Do they have the same structure?

        UPDATE: I directly executed your QualityMetrics + LittleEndianInputStream code on a sample QMetricsOut.bin file we have and I'm getting output. But because I don't know the structure of these InterOp files, I can't validate if it worked or not... ;( here's some sample output

        1 202 2114
        Record: 1 Assigned: 0
        Record: 2 Assigned: 0
        Record: 3 Assigned: 0
        Record: 4 Assigned: 0
        Record: 5 Assigned: 0
        Record: 6 Assigned: 0
        Record: 7 Assigned: 0
        Record: 8 Assigned: 0
        Record: 9 Assigned: 0
        Record: 10 Assigned: 0
        Record: 11 Assigned: 0
        Record: 12 Assigned: 228
        Record: 13 Assigned: 110
        Record: 14 Assigned: 9893
        Record: 15 Assigned: 1791
        Record: 16 Assigned: 5314
        Record: 17 Assigned: 3048
        Record: 18 Assigned: 3326
        Record: 19 Assigned: 3101
        Record: 20 Assigned: 40
        Record: 21 Assigned: 0
        Record: 22 Assigned: 0
        Record: 23 Assigned: 0
        Record: 24 Assigned: 39
        Record: 25 Assigned: 6
        Record: 26 Assigned: 28
        Record: 27 Assigned: 1039
        Record: 28 Assigned: 50
        Record: 29 Assigned: 2068
        Record: 30 Assigned: 5081
        Record: 31 Assigned: 2586
        Record: 32 Assigned: 2301
        Record: 33 Assigned: 11376
        Record: 34 Assigned: 7843
        Record: 35 Assigned: 7224
        Record: 36 Assigned: 22598
        Record: 37 Assigned: 55197
        Record: 38 Assigned: 110514
        Record: 39 Assigned: 177584
        Record: 40 Assigned: 0
        Record: 41 Assigned: 0
        Record: 42 Assigned: 0

        Does this output look sane? (What -are- cluster scores anyways?)

        Sorry and thanks so much for being patient. We're still setting up the machine, installing software, etc, so my real world experience with this data is absolutely nill.

        UPDATE 2: I will see if I can load these files into Sequence Analysis Viewer (SAV) to see if these same values show up on the software.

        -Eric
        Last edited by emartin; 02-27-2013, 05:36 PM.

        Comment


        • #5
          Originally posted by emartin View Post
          Hello, my name is Eric, I just started a job at an HIV clinical testing lab in Vancouver, Canada. We purchased a MiSeq to develop tests for drug resistance in HIV+ individuals using deep sequencing. I'm supposed to help get this off the ground...

          We will likely need to upload quality data to our database. Metrix looks great for helping us do this.

          I've noticed a report from Illumina here: http://www.illumina.com/Documents/pr...operations.pdf on page 26-29 gives a description of the binary structure of these InterOp files. However, this document looks like it's HiSeq, not MiSeq, and I can't find similar documents for MiSeq. Do they have the same structure?

          UPDATE: I directly executed your QualityMetrics + LittleEndianInputStream code on a sample QMetricsOut.bin file we have and I'm getting output. But because I don't know the structure of these InterOp files, I can't validate if it worked or not... ;( here's some sample output

          1 202 2114
          Record: 1 Assigned: 0
          ..
          ..
          ..
          Record: 42 Assigned: 0

          Does this output look sane? (What -are- cluster scores anyways?)

          Sorry and thanks so much for being patient. We're still setting up the machine, installing software, etc, so my real world experience with this data is absolutely nill.

          UPDATE 2: I will see if I can load these files into Sequence Analysis Viewer (SAV) to see if these same values show up on the software.

          -Eric
          Hi Eric,

          Many thanks for putting Metrix through the testing stages. Its always difficult to develop a generic system that will function on every core facilities site.

          Im glad to hear that you got some output.
          To answer some questions in order.

          However, this document looks like it's HiSeq, not MiSeq, and I can't find similar documents for MiSeq. Do they have the same structure?
          As far as im aware, im getting the same results from MiSeq InterOp files as for the HiSeq InterOp files. So in essence the structure is the same.

          Your output
          The quality score data is merely a distribution of the quality metrics from Q1 through Q50. Just as you see in the graphs on the MiSeq / HiSeq during run time or in the summary HTML files that (until thusfar) are being generated by the Hi/Mi-Seq software.

          In your table, record reflects the Q-score metric. Where in all measured statistics the Q39 should have the most clusters assigned to it.

          So in your case it looks perfectly fine (177.584 clusters assigned with score Q39).

          All the values are stored in a hierarchical manner so what you see is:

          Cycle --> Lane --> Tile --> QScore Value (Q1 - Q50) --> # Assigned clusters.
          So keep in mind that the raw values you are seeing are on a 'per tile' basis.
          The values you are printed are from:

          Lane: 1
          Cycle: 202
          Tile: 2114

          Using these statistics you can also study outliers easier and possibly find reasons why certain clusters have a lower quality score. In other words, to find out if / why the Q-Score distribution is skewed.

          I hope can get everything up and running!
          Questions? Fire away.

          Bernd
          Last edited by Rhizosis; 02-28-2013, 12:03 AM.

          Comment


          • #6
            Thanks for your response, Bernd! I am curious about some code you have in ExtractionMetrics.java ...

            public void outputData() {
            ...
            long dateTime = leis.readLong();
            ...
            }


            When I print this date out using your code, I'm getting a value such as -8588401683902830492 which I can reproduce in perl (my native language) if I unpack as a signed quad (64-bit). I also unpacked as an unsigned quad and I am getting a value of 9858342390026624502, which at least is positive, but I still don't know what time standard this is.

            Aren't dates typically the number of milli-seconds since a certain date ("epoch")? If so, the negative value does not make sense to me. (It's also possible we are experiencing system-specific differences in interpretation - do you get different numbers when you execute locally?)

            UPDATE: I talked to developers at Illumina. This is a C# dateTime object, with the first 2 bits representing a "Kind" field, and the remaining 62 bits representing an integer value - "ticks" - the number of 100 ns intervals since Gregorian midnight, Jan 1, 0001. With that in mind, we can probably figure out how to parse this date - if we want to.

            I was also wondering, and I REALLY appreciate you taking the time to do this (Should you choose to), if you understand what's going on with ErrorMetrics... when it says Error rate in the documentation, what Errors is it really talking about? When it says read with 1 error, what sort of error is it?

            Thanks...

            -Eric
            Last edited by emartin; 03-05-2013, 06:48 PM.

            Comment


            • #7
              Hi Eric -

              I haven't extensively tested the ExtractionMetrics.java yet. I'm afraid that I just assumed that the produced value was based on the time in milliseconds since the epoch (January 1, 1970, 00:00:00 GMT), which is completely my mistake!
              Thank you for clearing this up at Illumina, I will convert these values accordingly and store them in the Summary object in the future.

              As I stated in my opening post, I haven't implemented (extracted, parsed and stored) all metrics yet. This is due to the needs we first had to extract certain metrics, however new metrics and an stability update for the server will be implemented with the next update.

              To answer your question about the ErrorMetrics, this (most likely; 99% sure!) is the error metric of the PhiX control (either a whole lane or just the PhiX added as a control in a lane). If im correct these metrics error.htm files were generated in runs which had been post-processed by ELAND.
              So in essence, ErrorMetrics will give you:
              - Number of perfect reads in PhiX
              - Number of reads with 1 error in PhiX
              - Number of reads with 2 errors in PhiX
              - Number of reads with 3 errors in PhiX
              - Number of reads with 4 errors in PhiX

              per cycle, per tile, per lane.

              Let me know if you have a desire to obtain a certain metric from the files. I will put that on my list then.
              Please note that i'm not 100% fluent in java and code could be written way more efficient.

              Good luck,

              Bernd

              Comment


              • #8
                Metrix screenshots?

                Just curious, do you have any screenshots of what the output of Metrix looks like? How are the statistics/metrics output or shown?

                Thanks!

                Kristie

                Comment


                • #9
                  Hi Kristie

                  Thank you for your interest! There is no specific answer to your question because of the nature of Metrix itself. From the ground up it is meant to function in two ways.

                  1. As a Server interface where the server can monitor a Illumina run directory; and its counterpart, the client interface, which can use certain Command objects (POJO; Plain Old Java Object) to request specific parsed data from the server.

                  2. If you decide to dissect Metrix and use the parsing components outside the server interface, you can instantiate it with the InterOp directory as an argument and as of now you have to decide on your own output formatting.

                  The reason I chose for option two is that for this kind of data there is no set format and I did not feel like creating yet another data structure. Most of the time people are only interested in certain metrics and in this way they can choose their own variables and formatting of the output.

                  If you would like to see specific features, or if you need help, please do leave a message.

                  Thanks!
                  Bernd

                  Comment


                  • #10
                    Thanks Bernd!

                    you mention that some features of metrix are:

                    Collections of Run Summaries (Summary) can be stored in an SummaryCollection object
                    Summary and SummaryCollection are accessible from a client interface in several formats (XML and POJO)

                    Do these Run Summaries (Summary) have pre-defined sets of metrics? Is it anything like what is shown in SAV's "Summary" tab? Do you have an example of one of these Summaries in xml?

                    Thanks again,

                    Kristie

                    Comment


                    • #11
                      Hi Bernd. Thanks for your earlier response on PhiX controls.

                      Now that I can parse Q-metrics, I'm starting to pay attention to the Q-value distributions and noticed there isn't ANY data in Q1-Q11. Even if I aggregate by tile/cycle. There's simply no scores in that range. (Or for more detail... we see nothing from Q1-Q11, a tiny bump of 'mediocre' data from Q12-Q21, and most of the data in Q32-Q39, with nothing beyond Q39)

                      Is this typical for runs on your end?

                      Thanks a lot.

                      -Eric
                      Last edited by emartin; 03-20-2013, 03:11 PM.

                      Comment


                      • #12
                        Hi Eric -

                        The data you posted earlier was the data of one specific tile.
                        The LittleEndianInputStream combined with the QMetrics parser only obtains all the separate values foreach Lane -> Cycle -> Tile.
                        As you might know it is possible that certain tiles might not perform optimally, sometimes not generating any results and sometimes giving a skewed distribution as a result of signals not being picked up (or not being there at all; on that tile).

                        So in essence, if I would compare your Q1-Q50 data for a single tile. I might say that we have the same results. Like this:

                        Code:
                        1       14      1110
                        Record: 1       Assigned: 0
                        Record: 2       Assigned: 0
                        Record: 3       Assigned: 0
                        Record: 4       Assigned: 0
                        Record: 5       Assigned: 0
                        Record: 6       Assigned: 2
                        Record: 7       Assigned: 11
                        Record: 8       Assigned: 437
                        Record: 9       Assigned: 117
                        Record: 10      Assigned: 1142
                        Record: 11      Assigned: 318
                        Record: 12      Assigned: 9
                        Record: 13      Assigned: 0
                        Record: 14      Assigned: 0
                        Record: 15      Assigned: 37
                        Record: 16      Assigned: 981
                        Record: 17      Assigned: 1695
                        Record: 18      Assigned: 675
                        Record: 19      Assigned: 795
                        Record: 20      Assigned: 42
                        Record: 21      Assigned: 236
                        Record: 22      Assigned: 148
                        Record: 23      Assigned: 752
                        Record: 24      Assigned: 1503
                        Record: 25      Assigned: 3797
                        Record: 26      Assigned: 1533
                        Record: 27      Assigned: 7995
                        Record: 28      Assigned: 527
                        Record: 29      Assigned: 2503
                        Record: 30      Assigned: 6832
                        Record: 31      Assigned: 8034
                        Record: 32      Assigned: 12791
                        Record: 33      Assigned: 16312
                        Record: 34      Assigned: 16697
                        Record: 35      Assigned: 13527
                        Record: 36      Assigned: 45063
                        Record: 37      Assigned: 39125
                        Record: 38      Assigned: 105567
                        Record: 39      Assigned: 110522
                        Record: 40      Assigned: 234803
                        Record: 41      Assigned: 878176
                        Record: 42      Assigned: 0
                        Record: 43      Assigned: 0
                        Record: 44      Assigned: 0
                        Record: 45      Assigned: 0
                        Record: 46      Assigned: 0
                        Record: 47      Assigned: 0
                        Record: 48      Assigned: 0
                        Record: 49      Assigned: 0
                        Record: 50      Assigned: 0
                        Here you see that a multi 'peak' distribution is presented.
                        QScores 6 - 10 have a miniscule peak.
                        QScores 15 - 20 have a miniscule peak.
                        QScores 21 and higher contribute to the main peak in the data.

                        Now... As mentioned earlier, this is just data from a single tile and to get a better, more complete picture, all these values from all these tiles should be averaged per cycle. This will most likely present a more fluent distribution.

                        Using this data you can also immediately see if tiles have dropped out and could not be scanned in the HiSeq or MiSeq. You will be able to recognise this by either a low QScore skew in the distribution, or a very low number of clusters assigned to any QScore in general (0's for the majority of the QScores).

                        I hope this helps!
                        Bernd

                        Comment


                        • #13
                          Originally posted by kmjones96 View Post
                          Thanks Bernd!

                          you mention that some features of metrix are:

                          Collections of Run Summaries (Summary) can be stored in an SummaryCollection object
                          Summary and SummaryCollection are accessible from a client interface in several formats (XML and POJO)

                          Do these Run Summaries (Summary) have pre-defined sets of metrics? Is it anything like what is shown in SAV's "Summary" tab? Do you have an example of one of these Summaries in xml?

                          Thanks again,

                          Kristie
                          Hi Kristie,

                          As of now the Summaries and SummaryCollections are only generated by the server side of Metrix. I haven't made a separate specific interface for this yet.

                          --------------------------
                          So the flow of Metrix is as follows:

                          1. Run Directories (RD) get scanned.
                          2. Each RD has its InterOp directory parsed (MetrixLogic.java; Line 51. function: processMetrics)
                          3. These metrics are stored in a Summary java object and is serialized (stored) in the SQL database.

                          ---------------------------

                          These steps are repeated aslong the server is running and new data is added to the metrics files.

                          These Summary objects can be retrieved by the client-side of Metrix. On line 56 of MetrixClient.java you can see that a Command object is created and populated and sent accordingly for the desired information:

                          Code:
                          			
                                          // Set a value for command
                          		sendCommand.setFormat("XML");
                          		sendCommand.setState(2); // Only fetch finished runs. 
                          		sendCommand.setCommand("FETCH");
                          		oos.writeObject(sendCommand);
                          		oos.flush();
                          MetrixThread.java is the part of Metrix that handles the incoming requests of connected clients and thus reads the commands.

                          Code:
                          // Retrieve set and return object.
                          			SummaryCollection sc = ds.getSummaryCollections();
                          Combined with the above code which retrieves all the finished runs (state 2) in XML format.
                          In SummaryCollection.java (nki/objects/SummaryCollection.java) you will find the generation of the XML string.
                          Please note that not all available metrics (such a QMetrics) have been entered yet. This will be more customisable in the future.
                          Most values described in the Summary.java object can be retrieved. However please note that not all of them have been implemented.
                          If you like to add more values to the generated XML you can do so by adding these in the getSummaryCollectionAsXML function of SummaryCollection.

                          As it stands now the generated XML looks like this:

                          Code:
                          <SummaryCollection active="1" error="5" finished="115" init="2" turn="">
                             <Summary>
                                 <runId>121022_M00003_0002_ABCDE</runId>
                                 <runType>Single End</runType>
                                 <flowcellId>000000000-ABCDE</flowcellId>
                                 <runSide/>
                                 <runState>2</runState>
                                 <runPhase/>
                                 <lastUpdated>21/03/2013 10:21:33</lastUpdated>
                                 <runDate>121022</runDate>
                                 <currentCycle>51</currentCycle>
                                 <totalCycle>51</totalCycle>
                                 <instrument>M00003</instrument>
                             </Summary>
                             <Summary>
                                 <runId>111202_SN002_0140_C0374DCXX-RUN131</runId>
                                 <runType>Paired End</runType>
                                 <flowcellId>C0374DCXX</flowcellId>
                                 <runSide/>
                                 <runState>2</runState>
                                 <runPhase/>
                                 <lastUpdated>21/03/2013 10:21:34</lastUpdated>
                                 <runDate>111202</runDate>
                                 <currentCycle>107</currentCycle>
                                 <totalCycle>107</totalCycle>
                                 <instrument>SN002</instrument>
                             </Summary>
                             <Summary>
                                 <runId>120203_SN002_0149_C06WFACXX-RUN130</runId>
                                 <runType>Paired End</runType>
                                 <flowcellId>C06WFACXX</flowcellId>
                                 <runSide/>
                                 <runState>2</runState>
                                 <runPhase/>
                                 <lastUpdated>21/03/2013 10:21:34</lastUpdated>
                                 <runDate>120203</runDate>
                                 <currentCycle>159</currentCycle>
                                 <totalCycle>159</totalCycle>
                                 <instrument>SN002</instrument>
                             </Summary>
                          </SummaryCollection>
                          To answer the question of if these summary metrics look anything like the SAV summary page information. Its difficult to say, as I haven't run SAV in a long time. But SAV uses the same data sources (InterOp files) as Metrix does. So theoretically the same graphs could be generated. But as of now the Summary object isn't populated fully yet.

                          I hope this answers your question.

                          Bernd

                          Comment


                          • #14
                            Thanks, this helps!

                            Kristie

                            Comment


                            • #15
                              Bump.
                              Several minor parsing updates.

                              Added:
                              • Phasing, prephasing, cluster density and cluster density passing filter metrics being parsed, stored and added to the summary object.
                              • Added printing method iterating over a QualityScores object
                              • QScore Distribution Map. To be used for graphs or information processing.


                              TODO:
                              • Add overridden toXML() and toString() methods to all custom objects
                              • Parse and store phasing / prephasing per read metrics
                              • Support for calculating QScore distribution vs total million reads.
                              • Support for calculating base intensity / cycle.
                              • Support for outputting cluster density / lane.
                              • Support for parsing corrected intensity metrics (CorrectedIntensityMetricsOut.bin)
                              • Add error metrics object.


                              You can find these changes at the repository Metrix.
                              Last edited by Rhizosis; 07-01-2013, 04:35 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X