SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Parser For Illumina InterOp Binary Metric Files iamh2o Bioinformatics 17 09-06-2016 05:40 AM
Metrix - A Server / client parser for Illumina (InterOp) run directories Rhizosis Bioinformatics 29 10-15-2014 04:06 AM
perl semna Bioinformatics 6 09-16-2011 12:16 AM
perl? semna Bioinformatics 5 08-18-2011 12:38 AM
perl? semna Bioinformatics 1 07-27-2011 07:05 AM

Closed Thread
 
Thread Tools
Old 05-24-2013, 08:46 AM   #1
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default Illumina InterOp Parsers (perl and R)

Hi All,

Since starting at Illumina recently, one of my first projects was to write sample scripts to parse the InterOp files. We receive requests for how to parse the binary files quite often and having working code is a bit more helpful than the previous solution of sending a document describing the binary formatting.

In any case, I've built a package in R that can import the InterOp files into an R session (for visualization) and can also write out the data to flat files. I also have perl scripts that just write out the data to flat files.

I wrote these a few months ago, and we've tested it internally and have found it to be quite useful. After I came across a couple of other InterOp parsing threads here, I asked my manager for approval to send it out to people who are in need of accurate InterOp parsing outside of SAV. If you have this need, please PM me and I will send you my email address, or email me directly via SEQAnswers. With enough requests, I may even receive approval to post it to github.

Officially, these are *unsupported*, so please do not call tech support referring to these scripts. They will have no idea what you are talking about. Instead, email me with questions or bugs and I'll do my best to straighten you out.

Cheers,
mchen1

Last edited by mchen1; 05-24-2013 at 10:08 AM. Reason: enabled option to email me directly from seqanswers...
mchen1 is offline  
Old 05-28-2013, 10:08 AM   #2
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default

For those who have messaged me, thanks for the feedback! If anyone needs code in other languages, please let me know or put a request into this thread. I coded these scripts in R and perl because that is what I am familiar with, but with enough demand, we can develop a few more in other languages for release.
mchen1 is offline  
Old 06-07-2013, 08:25 PM   #3
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

The CPAN module Bio::IlluminaSAV is a parser for Perl.
earonesty is offline  
Old 06-12-2013, 09:59 AM   #4
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default

Hi earonesty,

The CPAN module looks to be nice, but I think our scripts meet different needs. There is certainly room for both. The IlluminaSAV module appears to parse the InterOp data into perl arrays. This would seem ideal for someone working in perl who may want to access the InterOp data for further manipulation. On a separate note (perhaps a feature request?), there is a new InterOp file called IndexMetricsOut.bin that describes index metrics and does not appear to be in the IlluminaSAV perl module documentation. If you need the binary format for this file in order to update your perl module, shoot me a PM or email with your email address, and I can send you our latest documentation.

The scripts I've written here are designed to simply convert InterOp data into flat files, and do so as efficiently as possible. I designed them for those wanting to parse the InterOp data for entry into a LIMS system, for example. Since most LIMS are custom-built, flat files seem to be the most universally accepted format for the data. The perl code is also sent without module packaging so that users can see how I parse the binary files in case they want to integrate the code into their own perl work. The perl code also comes without dependencies on other modules so it works out of the box with any modern perl installation (I personally dislike having to install perl modules, especially in the context of group IT policies). In any case, the goal of the two packages is the same, but it would seem our design parameters differ.

Hopefully this discussion can illuminate how the packages differ in case users are deciding between the two.

M
mchen1 is offline  
Old 06-18-2013, 10:51 PM   #5
Rhizosis
Member
 
Location: Amsterdam

Join Date: Mar 2012
Posts: 41
Default

Hi MChen,

Thank you for you previous PM regarding your parsing scripts. The IndexMetricsOut.bin statistics sound interesting to me as well. Would you be so kind to send the latest Theory of Operation (I assume) documentation to the same email address?

Many thanks in advance,

B
Rhizosis is offline  
Old 07-08-2013, 06:18 AM   #6
Bardj
UB Buffalo Bioinformatics
 
Location: Buffalo NY

Join Date: Nov 2009
Posts: 26
Thumbs up Highly Recommended

I recently had the opportunity to use mchen1's script package and I highly recommend it. Very straight forward and was extremely easy to use. Well documented as well.

I was interested in parsing the InterOp folder for 50+ illumina HiSeq runs and gather the statistics to report back our general performance over time, and I was able to do so quickly and accurately.

Thanks for sharing the code!
Bardj is offline  
Old 07-26-2013, 01:26 PM   #7
WhiteSeal
Member
 
Location: Netherlands

Join Date: Jul 2013
Posts: 13
Default

I would like to try the package for parsing Interop files in well.. Did I read it correct that there is an R version?
WhiteSeal is offline  
Old 07-31-2013, 11:38 AM   #8
iamh2o
Junior Member
 
Location: San Francisco

Join Date: Mar 2009
Posts: 6
Default

Reposting for a colleague of mine at InVitae who wrote an open source python parser for exactly this.

#######
Greetings all,

I work at InVitae and we just publicly released a library called Illuminate.

https://bitbucket.org/invitae/illuminate/

The purpose of Illuminate is to emulate the stats you see when you load a run data folder within Illumina SAV, providing programmatic access to these metrics for whatever purposes you may have -- data storage, analysis, automated machine monitoring, and so on.

This is completely free, open source software (MIT License) written in Python with the intent to be used, tested, and improved upon by the bioinformatics community.

Features:
Simple command-line tool you can use to quickly inspect a run.
Built to be easily integrated into other code.
Easily extensible even if you think you are "not much of a programmer".
Results standardized to pandas DataFrame objects (so if you know how to work in R, you can probably get up to speed quickly with this)

Here's an example of the smallest python script you could get away with using this tool.

Code:
import illuminate
myDataset = illuminate.InteropDataset('path/to/rundata/')
print myDataset.meta
print myDataset.IndexMetrics()
print myDataset.TileMetrics()
print myDataset.QualityMetrics()
And here's an example of how you would use the command-line reporter to do the same thing:

Code:
python illuminate --meta --index --tile --quality /path/to/rundata
You can even have illuminate open up in an interactive iPython shell, where the dataset will be loaded up into an InteropDataset object for you:

Code:
python illuminate -i /path/to/rundata
Not all of the metrics objects are fully fleshed out yet, although all of the binary parsers are "feature complete" in that you can produce a data dictionary and a DataFrame from them.

I'm hoping that some of you fine folks can pipe up and let me know what might be useful to you -- or better, submit contributions, bug reports, and so on that will help Illuminate become as full-featured as it needs to be.

This library has been in our production pipeline for several months now, reporting on cluster density, quality, and yield so we can keep tabs on sequencing run quality in an automated fashion.

If you use it, or you have questions about it, please comment here and let me know!

Cheers,
Naomi
iamh2o is offline  
Old 09-16-2013, 11:06 AM   #9
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Quote:
Originally Posted by mchen1 View Post
Hi earonesty,

The CPAN module looks to be nice, but I think our scripts meet different needs. There is certainly room for both. The IlluminaSAV module appears to parse the InterOp data into perl arrays. This would seem ideal for someone working in perl who may want to access the InterOp data for further manipulation. On a separate note (perhaps a feature request?), there is a new InterOp file called IndexMetricsOut.bin that describes index metrics and does not appear to be in the IlluminaSAV perl module documentation. If you need the binary format for this file in order to update your perl module, shoot me a PM or email with your email address, and I can send you our latest documentation.

The scripts I've written here are designed to simply convert InterOp data into flat files, and do so as efficiently as possible. I designed them for those wanting to parse the InterOp data for entry into a LIMS system, for example. Since most LIMS are custom-built, flat files seem to be the most universally accepted format for the data. The perl code is also sent without module packaging so that users can see how I parse the binary files in case they want to integrate the code into their own perl work. The perl code also comes without dependencies on other modules so it works out of the box with any modern perl installation (I personally dislike having to install perl modules, especially in the context of group IT policies). In any case, the goal of the two packages is the same, but it would seem our design parameters differ.

Hopefully this discussion can illuminate how the packages differ in case users are deciding between the two.

M
1. I would be interested in the IndexMetrics file (erik at q32.com) is fine

2. I would also like to try out your code (same email)

3. The LibXML reader is for parsing the RunInfo.xml into a perl hash. Other than that the module is core. Somehow I thought it would be better just to do that right.

4. Extraction is fast because usually our apps don't need all the data... many programs are just looking for maximum values, etc. (Our LIMS only gets quantile scores per cycle for example.)
earonesty is offline  
Old 09-16-2013, 11:49 AM   #10
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default

Erik, I've emailed the packages to your email address.

Regarding #4, it's nice that you are able to speed up data extraction by not parsing all the data. Many times this type of curation summarizes run quality well. The packages I send out have the goal of simply providing all of the data. This leaves it up to the user to decide on what numbers to input into their LIMS.

Thanks for your post.
mchen1 is offline  
Old 09-19-2014, 03:38 AM   #11
Florent
Junior Member
 
Location: Rennes, France

Join Date: Sep 2014
Posts: 2
Default

Hi Mchen

Are you still there ?
I'm new here and it seems impossible for new member to post PM.
So, I try to contact you by replying to this old thread.

Your post and its comments about direct usage of interop files are very interesting and promising.

I'd like to try your parsers in R and Perl.
I tested a little Bio::IlluminaSAV and Illuminate but I prefer to have a global dump of interop data to integrate them in my QC pipeline.

I hope you can help me and contact me (in PM or in this thread).

Regards
Florent is offline  
Old 09-19-2014, 06:30 AM   #12
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Global dump" is kindof ambiguous. What format do you want?

Bio::IlluminaSAV can be used to make a "dump" by using JSON or YAML or whatever, and then dumping each metric to a file.
earonesty is offline  
Old 09-19-2014, 10:52 AM   #13
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default

Yes, still here. I will PM you, Florent.
mchen1 is offline  
Old 09-19-2014, 12:44 PM   #14
Florent
Junior Member
 
Location: Rennes, France

Join Date: Sep 2014
Posts: 2
Default

Quote:
Originally Posted by earonesty View Post
Global dump" is kindof ambiguous. What format do you want?

Bio::IlluminaSAV can be used to make a "dump" by using JSON or YAML or whatever, and then dumping each metric to a file.
I agree with you, it's ambiguous.
When I wrote my post, I didn't know exactly what kind of data (and format) I could obtain from these parsers.
I wanted to convert interop files in non binaries files to get a direct access to data.

After half day of work, I understand better interop files and Bio::IlluminaSAV and I have written some code.
I need to work more but I get data and I plan to store them (maybe filtered) in xml files.
I'm still thinking about how to manage the data (keep them all or only useful part) depending of next steps of my future quality control pipeline.

Thank you for your comment.
Florent is offline  
Old 09-24-2014, 11:12 AM   #15
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default

jwater, you sent me a PM asking for the InterOp parsers, but you have set your account to reject PMs, and you left no contact information or way for me to reply. I can't help you if I can't reach you.
mchen1 is offline  
Old 07-15-2015, 12:26 PM   #16
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default

Archana91, please either send me your email address, or turn on private messaging on SEQanswers. You sent me a note to ask for the scripts, but you have disabled private messaging and thus I have no way of contacting you.

For all other readers, it is best to send me an email via SEQanswers. You may also PM me, but make sure you have private messaging turned on. Or you can include your own email address in your message. If you don't do one of these, then I am unable to reply to you.

Thanks,
mchen1
mchen1 is offline  
Old 02-09-2016, 07:48 AM   #17
ploverso-pgdx
Junior Member
 
Location: Baltimore

Join Date: Feb 2016
Posts: 3
Default

mchen1, there is an R parser on Bioconductor which will probably server your purpose:
https://www.bioconductor.org/package...html/savR.html
ploverso-pgdx is offline  
Old 02-19-2016, 10:24 AM   #18
mchen1
Member
 
Location: San Diego, CA

Join Date: May 2013
Posts: 10
Default Closing out this thread...

Illumina have now made available open-source libraries for parsing and extracting information from the InterOp files. Please see this thread: http://seqanswers.com/forums/showthread.php?t=66342

The libraries will be updated along with new releases of RTA software, and are backwards-compatible back to the GAs.

The parsers in this thread were only compatible with RTA versions 2.7 and below. Because of the availability of these open source libraries, I am retiring these parsers.

Cheers,
Menzies
mchen1 is offline  
Old 02-19-2016, 10:27 AM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

I am going to close this thread based on note in the last post by @mchen1.

No new posts can be added to this thread.

Last edited by GenoMax; 02-19-2016 at 10:30 AM.
GenoMax is offline  
Closed Thread

Tags
illumina, interop, perl, rproject

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO