![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Parser For Illumina InterOp Binary Metric Files | iamh2o | Bioinformatics | 17 | 09-06-2016 06:40 AM |
parser for sam? | bioinfo308 | Bioinformatics | 1 | 10-08-2012 06:59 AM |
Samtools Pileup Parser | Graham Etherington | Bioinformatics | 5 | 08-24-2012 08:15 AM |
What happens when Illumina sequencing reagents run out? | picabo | Illumina/Solexa | 2 | 01-17-2011 08:46 AM |
How can I tell if an Illumina run is gone bad? | PFS | Bioinformatics | 2 | 08-18-2010 03:07 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
Hi,
A while ago our core facility ran into problems with obtaining run statistics for Illumina (HiSeq and MiSeq) runs. Our desire was to integrate it into another project where I am contributing to an opensource LIMS system called GNomEx (http://sourceforge.net/projects/gnomex/). We also needed it to gather statistics on the command-line without any difficult integration or scripting. Using this approach, I was able to obtain sequence statistics for all the available runs, with a variable scanning interval. Several features of Metrix:
Each run directory will get a status code (state) appointed.
Please note that this is the first version and does not yet support full parsing of all InterOp files. Also, I have left the client side 'open' for users to implement their most preferred way. Its quite easy if you're familiar with Java! The repository is available on GitHub I applaud all feedback, bug reports and input! They are very welcome. Good luck! Bernd Last edited by Rhizosis; 02-22-2013 at 01:12 AM. |
![]() |
![]() |
![]() |
#2 |
Junior Member
Location: San Francisco Join Date: Mar 2009
Posts: 6
|
![]()
This looks great! I've been struggling with getting miseq metrics out of the interop files ( Illumina has been no help ) for over a year. I'll give this a whirl soon.
|
![]() |
![]() |
![]() |
#3 |
--Site Admin--
Location: SF Bay Area, CA, USA Join Date: Oct 2007
Posts: 1,358
|
![]()
Ahhhhhhhhhhwesome! Will try asap!
|
![]() |
![]() |
![]() |
#4 |
Junior Member
Location: Vancouver, Canada Join Date: Feb 2013
Posts: 3
|
![]()
Hello, my name is Eric, I just started a job at an HIV clinical testing lab in Vancouver, Canada. We purchased a MiSeq to develop tests for drug resistance in HIV+ individuals using deep sequencing. I'm supposed to help get this off the ground...
We will likely need to upload quality data to our database. Metrix looks great for helping us do this. I've noticed a report from Illumina here: http://www.illumina.com/Documents/pr...operations.pdf on page 26-29 gives a description of the binary structure of these InterOp files. However, this document looks like it's HiSeq, not MiSeq, and I can't find similar documents for MiSeq. Do they have the same structure? UPDATE: I directly executed your QualityMetrics + LittleEndianInputStream code on a sample QMetricsOut.bin file we have and I'm getting output. But because I don't know the structure of these InterOp files, I can't validate if it worked or not... ;( here's some sample output 1 202 2114 Record: 1 Assigned: 0 Record: 2 Assigned: 0 Record: 3 Assigned: 0 Record: 4 Assigned: 0 Record: 5 Assigned: 0 Record: 6 Assigned: 0 Record: 7 Assigned: 0 Record: 8 Assigned: 0 Record: 9 Assigned: 0 Record: 10 Assigned: 0 Record: 11 Assigned: 0 Record: 12 Assigned: 228 Record: 13 Assigned: 110 Record: 14 Assigned: 9893 Record: 15 Assigned: 1791 Record: 16 Assigned: 5314 Record: 17 Assigned: 3048 Record: 18 Assigned: 3326 Record: 19 Assigned: 3101 Record: 20 Assigned: 40 Record: 21 Assigned: 0 Record: 22 Assigned: 0 Record: 23 Assigned: 0 Record: 24 Assigned: 39 Record: 25 Assigned: 6 Record: 26 Assigned: 28 Record: 27 Assigned: 1039 Record: 28 Assigned: 50 Record: 29 Assigned: 2068 Record: 30 Assigned: 5081 Record: 31 Assigned: 2586 Record: 32 Assigned: 2301 Record: 33 Assigned: 11376 Record: 34 Assigned: 7843 Record: 35 Assigned: 7224 Record: 36 Assigned: 22598 Record: 37 Assigned: 55197 Record: 38 Assigned: 110514 Record: 39 Assigned: 177584 Record: 40 Assigned: 0 Record: 41 Assigned: 0 Record: 42 Assigned: 0 Does this output look sane? (What -are- cluster scores anyways?) Sorry and thanks so much for being patient. We're still setting up the machine, installing software, etc, so my real world experience with this data is absolutely nill. UPDATE 2: I will see if I can load these files into Sequence Analysis Viewer (SAV) to see if these same values show up on the software. -Eric Last edited by emartin; 02-27-2013 at 05:36 PM. |
![]() |
![]() |
![]() |
#5 | |||
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]() Quote:
Many thanks for putting Metrix through the testing stages. Its always difficult to develop a generic system that will function on every core facilities site. Im glad to hear that you got some output. To answer some questions in order. Quote:
Quote:
In your table, record reflects the Q-score metric. Where in all measured statistics the Q39 should have the most clusters assigned to it. So in your case it looks perfectly fine (177.584 clusters assigned with score Q39). All the values are stored in a hierarchical manner so what you see is: Cycle --> Lane --> Tile --> QScore Value (Q1 - Q50) --> # Assigned clusters. So keep in mind that the raw values you are seeing are on a 'per tile' basis. The values you are printed are from: Lane: 1 Cycle: 202 Tile: 2114 Using these statistics you can also study outliers easier and possibly find reasons why certain clusters have a lower quality score. In other words, to find out if / why the Q-Score distribution is skewed. I hope can get everything up and running! Questions? Fire away. Bernd Last edited by Rhizosis; 02-28-2013 at 12:03 AM. |
|||
![]() |
![]() |
![]() |
#6 |
Junior Member
Location: Vancouver, Canada Join Date: Feb 2013
Posts: 3
|
![]()
Thanks for your response, Bernd! I am curious about some code you have in ExtractionMetrics.java ...
public void outputData() { ... long dateTime = leis.readLong(); ... } When I print this date out using your code, I'm getting a value such as -8588401683902830492 which I can reproduce in perl (my native language) if I unpack as a signed quad (64-bit). I also unpacked as an unsigned quad and I am getting a value of 9858342390026624502, which at least is positive, but I still don't know what time standard this is. Aren't dates typically the number of milli-seconds since a certain date ("epoch")? If so, the negative value does not make sense to me. (It's also possible we are experiencing system-specific differences in interpretation - do you get different numbers when you execute locally?) UPDATE: I talked to developers at Illumina. This is a C# dateTime object, with the first 2 bits representing a "Kind" field, and the remaining 62 bits representing an integer value - "ticks" - the number of 100 ns intervals since Gregorian midnight, Jan 1, 0001. With that in mind, we can probably figure out how to parse this date - if we want to. I was also wondering, and I REALLY appreciate you taking the time to do this (Should you choose to), if you understand what's going on with ErrorMetrics... when it says Error rate in the documentation, what Errors is it really talking about? When it says read with 1 error, what sort of error is it? Thanks... ![]() -Eric Last edited by emartin; 03-05-2013 at 06:48 PM. |
![]() |
![]() |
![]() |
#7 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
Hi Eric -
I haven't extensively tested the ExtractionMetrics.java yet. I'm afraid that I just assumed that the produced value was based on the time in milliseconds since the epoch (January 1, 1970, 00:00:00 GMT), which is completely my mistake! Thank you for clearing this up at Illumina, I will convert these values accordingly and store them in the Summary object in the future. As I stated in my opening post, I haven't implemented (extracted, parsed and stored) all metrics yet. This is due to the needs we first had to extract certain metrics, however new metrics and an stability update for the server will be implemented with the next update. To answer your question about the ErrorMetrics, this (most likely; 99% sure!) is the error metric of the PhiX control (either a whole lane or just the PhiX added as a control in a lane). If im correct these metrics error.htm files were generated in runs which had been post-processed by ELAND. So in essence, ErrorMetrics will give you: - Number of perfect reads in PhiX - Number of reads with 1 error in PhiX - Number of reads with 2 errors in PhiX - Number of reads with 3 errors in PhiX - Number of reads with 4 errors in PhiX per cycle, per tile, per lane. Let me know if you have a desire to obtain a certain metric from the files. I will put that on my list then. Please note that i'm not 100% fluent in java and code could be written way more efficient. Good luck, Bernd |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: Gaithersburg, MD Join Date: Oct 2010
Posts: 3
|
![]()
Just curious, do you have any screenshots of what the output of Metrix looks like? How are the statistics/metrics output or shown?
Thanks! Kristie |
![]() |
![]() |
![]() |
#9 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
Hi Kristie
Thank you for your interest! There is no specific answer to your question because of the nature of Metrix itself. From the ground up it is meant to function in two ways. 1. As a Server interface where the server can monitor a Illumina run directory; and its counterpart, the client interface, which can use certain Command objects (POJO; Plain Old Java Object) to request specific parsed data from the server. 2. If you decide to dissect Metrix and use the parsing components outside the server interface, you can instantiate it with the InterOp directory as an argument and as of now you have to decide on your own output formatting. The reason I chose for option two is that for this kind of data there is no set format and I did not feel like creating yet another data structure. Most of the time people are only interested in certain metrics and in this way they can choose their own variables and formatting of the output. If you would like to see specific features, or if you need help, please do leave a message. Thanks! Bernd |
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: Gaithersburg, MD Join Date: Oct 2010
Posts: 3
|
![]()
Thanks Bernd!
you mention that some features of metrix are: Collections of Run Summaries (Summary) can be stored in an SummaryCollection object Summary and SummaryCollection are accessible from a client interface in several formats (XML and POJO) Do these Run Summaries (Summary) have pre-defined sets of metrics? Is it anything like what is shown in SAV's "Summary" tab? Do you have an example of one of these Summaries in xml? Thanks again, Kristie |
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: Vancouver, Canada Join Date: Feb 2013
Posts: 3
|
![]()
Hi Bernd. Thanks for your earlier response on PhiX controls.
Now that I can parse Q-metrics, I'm starting to pay attention to the Q-value distributions and noticed there isn't ANY data in Q1-Q11. Even if I aggregate by tile/cycle. There's simply no scores in that range. (Or for more detail... we see nothing from Q1-Q11, a tiny bump of 'mediocre' data from Q12-Q21, and most of the data in Q32-Q39, with nothing beyond Q39) Is this typical for runs on your end? Thanks a lot. -Eric Last edited by emartin; 03-20-2013 at 04:11 PM. |
![]() |
![]() |
![]() |
#12 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
Hi Eric -
The data you posted earlier was the data of one specific tile. The LittleEndianInputStream combined with the QMetrics parser only obtains all the separate values foreach Lane -> Cycle -> Tile. As you might know it is possible that certain tiles might not perform optimally, sometimes not generating any results and sometimes giving a skewed distribution as a result of signals not being picked up (or not being there at all; on that tile). So in essence, if I would compare your Q1-Q50 data for a single tile. I might say that we have the same results. Like this: Code:
1 14 1110 Record: 1 Assigned: 0 Record: 2 Assigned: 0 Record: 3 Assigned: 0 Record: 4 Assigned: 0 Record: 5 Assigned: 0 Record: 6 Assigned: 2 Record: 7 Assigned: 11 Record: 8 Assigned: 437 Record: 9 Assigned: 117 Record: 10 Assigned: 1142 Record: 11 Assigned: 318 Record: 12 Assigned: 9 Record: 13 Assigned: 0 Record: 14 Assigned: 0 Record: 15 Assigned: 37 Record: 16 Assigned: 981 Record: 17 Assigned: 1695 Record: 18 Assigned: 675 Record: 19 Assigned: 795 Record: 20 Assigned: 42 Record: 21 Assigned: 236 Record: 22 Assigned: 148 Record: 23 Assigned: 752 Record: 24 Assigned: 1503 Record: 25 Assigned: 3797 Record: 26 Assigned: 1533 Record: 27 Assigned: 7995 Record: 28 Assigned: 527 Record: 29 Assigned: 2503 Record: 30 Assigned: 6832 Record: 31 Assigned: 8034 Record: 32 Assigned: 12791 Record: 33 Assigned: 16312 Record: 34 Assigned: 16697 Record: 35 Assigned: 13527 Record: 36 Assigned: 45063 Record: 37 Assigned: 39125 Record: 38 Assigned: 105567 Record: 39 Assigned: 110522 Record: 40 Assigned: 234803 Record: 41 Assigned: 878176 Record: 42 Assigned: 0 Record: 43 Assigned: 0 Record: 44 Assigned: 0 Record: 45 Assigned: 0 Record: 46 Assigned: 0 Record: 47 Assigned: 0 Record: 48 Assigned: 0 Record: 49 Assigned: 0 Record: 50 Assigned: 0 QScores 6 - 10 have a miniscule peak. QScores 15 - 20 have a miniscule peak. QScores 21 and higher contribute to the main peak in the data. Now... As mentioned earlier, this is just data from a single tile and to get a better, more complete picture, all these values from all these tiles should be averaged per cycle. This will most likely present a more fluent distribution. Using this data you can also immediately see if tiles have dropped out and could not be scanned in the HiSeq or MiSeq. You will be able to recognise this by either a low QScore skew in the distribution, or a very low number of clusters assigned to any QScore in general (0's for the majority of the QScores). I hope this helps! Bernd |
![]() |
![]() |
![]() |
#13 | |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]() Quote:
As of now the Summaries and SummaryCollections are only generated by the server side of Metrix. I haven't made a separate specific interface for this yet. -------------------------- So the flow of Metrix is as follows: 1. Run Directories (RD) get scanned. 2. Each RD has its InterOp directory parsed (MetrixLogic.java; Line 51. function: processMetrics) 3. These metrics are stored in a Summary java object and is serialized (stored) in the SQL database. --------------------------- These steps are repeated aslong the server is running and new data is added to the metrics files. These Summary objects can be retrieved by the client-side of Metrix. On line 56 of MetrixClient.java you can see that a Command object is created and populated and sent accordingly for the desired information: Code:
// Set a value for command sendCommand.setFormat("XML"); sendCommand.setState(2); // Only fetch finished runs. sendCommand.setCommand("FETCH"); oos.writeObject(sendCommand); oos.flush(); Code:
// Retrieve set and return object. SummaryCollection sc = ds.getSummaryCollections(); In SummaryCollection.java (nki/objects/SummaryCollection.java) you will find the generation of the XML string. Please note that not all available metrics (such a QMetrics) have been entered yet. This will be more customisable in the future. Most values described in the Summary.java object can be retrieved. However please note that not all of them have been implemented. If you like to add more values to the generated XML you can do so by adding these in the getSummaryCollectionAsXML function of SummaryCollection. As it stands now the generated XML looks like this: Code:
<SummaryCollection active="1" error="5" finished="115" init="2" turn=""> <Summary> <runId>121022_M00003_0002_ABCDE</runId> <runType>Single End</runType> <flowcellId>000000000-ABCDE</flowcellId> <runSide/> <runState>2</runState> <runPhase/> <lastUpdated>21/03/2013 10:21:33</lastUpdated> <runDate>121022</runDate> <currentCycle>51</currentCycle> <totalCycle>51</totalCycle> <instrument>M00003</instrument> </Summary> <Summary> <runId>111202_SN002_0140_C0374DCXX-RUN131</runId> <runType>Paired End</runType> <flowcellId>C0374DCXX</flowcellId> <runSide/> <runState>2</runState> <runPhase/> <lastUpdated>21/03/2013 10:21:34</lastUpdated> <runDate>111202</runDate> <currentCycle>107</currentCycle> <totalCycle>107</totalCycle> <instrument>SN002</instrument> </Summary> <Summary> <runId>120203_SN002_0149_C06WFACXX-RUN130</runId> <runType>Paired End</runType> <flowcellId>C06WFACXX</flowcellId> <runSide/> <runState>2</runState> <runPhase/> <lastUpdated>21/03/2013 10:21:34</lastUpdated> <runDate>120203</runDate> <currentCycle>159</currentCycle> <totalCycle>159</totalCycle> <instrument>SN002</instrument> </Summary> </SummaryCollection> I hope this answers your question. Bernd |
|
![]() |
![]() |
![]() |
#14 |
Junior Member
Location: Gaithersburg, MD Join Date: Oct 2010
Posts: 3
|
![]()
Thanks, this helps!
Kristie |
![]() |
![]() |
![]() |
#15 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
Bump.
Several minor parsing updates. Added:
TODO:
You can find these changes at the repository Metrix. Last edited by Rhizosis; 07-01-2013 at 05:35 AM. |
![]() |
![]() |
![]() |
#16 |
Member
Location: San Diego, CA Join Date: May 2013
Posts: 10
|
![]()
Hi all,
I'm from Illumina and over the past few months, I've built a package in R and some scripts in perl to accurately parse the binary InterOp files. I'm not sure if you've fixed the C# DateTime problem specified by emartin (in fact, I remember receiving a question from tech support about this, and I think those words in red came directly from my email response back!), but this is done correctly in the R and perl scripts. These are unsupported, which means tech support will not be able to help you with them, but we've tested these internally and I've received approval from my manager to share them with those that ask. So please PM me if these scripts would be useful for you. Cheers, mchen1 |
![]() |
![]() |
![]() |
#17 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
Hi Mchen1,
Many thanks. As you might imagine, parsing the datetime isnt the most important metric, however it would be nice to get it in there eventually. I havent looked at that metric for quite a while, but ill get around in doing so relatively soon. Firstly I will integrate Metrix with our open source academic LIMS system GNomEx. Cheers, Bernd |
![]() |
![]() |
![]() |
#18 |
Member
Location: United States of America Join Date: Mar 2011
Posts: 52
|
![]()
The CPAN module Bio::IlluminaSAV is a parser for Perl.
|
![]() |
![]() |
![]() |
#19 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
Dear all,
As the CommandProcessor module has been fully integrated, the parsing of the commands is much easier now. Several other things have changed. Added:
Todo:
Requested:
*Preparations for API integrations have been made in the CommandProcessor (https://github.com/NKI-GCF/Metrix/bl...Processor.java) If you would like to see features added. Please do let me know. Bernd Last edited by Rhizosis; 06-18-2013 at 02:20 AM. |
![]() |
![]() |
![]() |
#20 |
Member
Location: Amsterdam Join Date: Mar 2012
Posts: 41
|
![]()
A quick update:
Added:
Todo:
As always, do let me know if you would like to see a specific feature. Bernd |
![]() |
![]() |
![]() |
Tags |
illumina, interop, metrix, parsing, server |
Thread Tools | |
|
|