The research group I'm in has started working with NGS data -- Illumina reads for the moment, but we'll be working with AB SOLiD reads soon. We're in a university setting, trying to determine the best system setup for our work. We have at our disposal a department network of about 50 machines with a handful of network drives. Each machine has its own disk, but unlike the network drives, they are not backed up.
NGS data presents some interesting challenges for us: our initial runs on ~20GB worth of sequence files took 40 hours to process, generating up to ~280GB of output. We figure if we parallelize our jobs across 50 machines and use the local drives, we can reduce the run time to less than 1 hour and the output to about 6GB per machine.
Before we make a formal request to our sys admins, I'm curious how other groups manage these large files. Do you have your own dedicated systems? Do you use a tool such as hadoop to parallelize jobs? How much data do you typically work with, and how do you manage data from multiple sequencing runs?
I would appreciate any thoughts you care to share, especially if there are questions I should have asked, but didn't.
NGS data presents some interesting challenges for us: our initial runs on ~20GB worth of sequence files took 40 hours to process, generating up to ~280GB of output. We figure if we parallelize our jobs across 50 machines and use the local drives, we can reduce the run time to less than 1 hour and the output to about 6GB per machine.
Before we make a formal request to our sys admins, I'm curious how other groups manage these large files. Do you have your own dedicated systems? Do you use a tool such as hadoop to parallelize jobs? How much data do you typically work with, and how do you manage data from multiple sequencing runs?
I would appreciate any thoughts you care to share, especially if there are questions I should have asked, but didn't.
Comment