Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Optimal" System Setup?

    The research group I'm in has started working with NGS data -- Illumina reads for the moment, but we'll be working with AB SOLiD reads soon. We're in a university setting, trying to determine the best system setup for our work. We have at our disposal a department network of about 50 machines with a handful of network drives. Each machine has its own disk, but unlike the network drives, they are not backed up.

    NGS data presents some interesting challenges for us: our initial runs on ~20GB worth of sequence files took 40 hours to process, generating up to ~280GB of output. We figure if we parallelize our jobs across 50 machines and use the local drives, we can reduce the run time to less than 1 hour and the output to about 6GB per machine.

    Before we make a formal request to our sys admins, I'm curious how other groups manage these large files. Do you have your own dedicated systems? Do you use a tool such as hadoop to parallelize jobs? How much data do you typically work with, and how do you manage data from multiple sequencing runs?

    I would appreciate any thoughts you care to share, especially if there are questions I should have asked, but didn't.

  • #2
    We generate both SOLiD and 454 data. Thus our data sizes are comparable to yours. We have a central file server, two of them actually, one of 24 TB (raw) and the other 48 TB (raw). These are RAIDed in order to provide data redundancy and security. The compute nodes all use the file servers. This can cause network congestion and high I/O loads on the file servers. Also while each of the compute nodes has scratch space I found that often the local scratch space is not large enough or has not be cleaned out properly. Thus I will often use the central servers as scratch space; this adds to the I/O loads. Backing up the the raw data and the final analysis remains a problem. In other words our solution is not ideal but it works.

    A problem with many parallel programs is that, while the program can be split up and run many places, pulling together the resultant file(s) is often done by a single processor. This can slow down the overall pipeline.

    Hadoop may solve some of the distributed file problems. If you use it then please give us a report.

    As far as multiple runs, with the TBs of space we have I haven't yet run our of space. I do have to clean up the temporary files after each run. In other words ~50GB raw data expands to ~200GB of analysis of which maybe ~10GB is useful. In the end each project/run takes up ~60GB of space. I figure after we get up to a couple hundred runs and thus 6 TB or so then we will go looking for more space.

    Comment


    • #3
      We work with Illumina data. Before they introduced Real time analysis, images had to be transferred and analyzed separately. We are talking about 0.7-1.5 Tb which produce (in the end) about 8-20 Gb of raw sequences (to be then aligned).
      I don't know Solid pipeline but for illumina there's a lot of I/O of small files which in practice reduces the hypothetical possibilities of parallelization. On a 16 CPU server we can theoretically run up to 32 (33) parallel processing jobs but I/O becomes a limiting step even at 16 jobs (and we can see the job state in 'D' - waiting for I/O resources).
      I've upgraded the firmware of our disks (HP MSA60, xfs formatted) to see if we gain something...
      About backup... AFAIK the cost of a run is less than the cost of 1 Tb backup. We only backup the raw sequences and possibly some BAM files for ready-to-use alignments on a separate fileserver. We keep analysis images/intensities/temporary files on the "local" disks only until we need space or until we are sure we don't have to base call again.

      Comment


      • #4
        Originally posted by dawe View Post
        About backup... AFAIK the cost of a run is less than the cost of 1 Tb backup. We only backup the raw sequences and possibly some BAM files for ready-to-use alignments on a separate fileserver.
        I hear this a lot and I suspect it is an urban myth started by Illumina to encourage labs to throw away their data so they will have to buy more reagents from Illumina to re-run the experiment.

        A Quantum SuperLoader3 with 1 LTO-4 drive and 16 tape slots is $4,611 (from CDW-G). LTO-4 tapes (800GB native, 1.6TB compressed, probably ~1.0TB real world) cost $50-60 each. The capital costs of the tape robot is less than one run and the incremental costs for tapes in negligible compared to the cost of an Illumina (or SOLiD) run.

        We keep images (0.7-3.5TB per run) on tape for 60-90 days, just in case there is some question about the run. We keep intensity information (a few hundred GB per run) for 1 year. Base calls and alignment data (tens of GB per run) we will keep indefinitely.

        Comment


        • #5
          Originally posted by westerman View Post
          Hadoop may solve some of the distributed file problems. If you use it then please give us a report.
          One of my lab mates has used hadoop to parallelize BLAT alignments. I'm not familiar enough with it to give a report, but I'll see if I can get him to share his experiences with it.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          54 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          50 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          44 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X