Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Seeking suggestions/recommendations for storage I/O for NGS analysis

    I realize that this topic (of hardware recommendedations) is a bit of an oldie, but goodie; however, most of the threads that I found are several years old, and are likely of less value now, as technology has improved over time.

    Also, my question is more narrowly focused, as I'm specifically interested in suggestions on I/O transfer speed, as it is generally acknowledged that this is one of the major bottlenecks for NGS analysis.

    To explain, my group is exploring developing a centralized bioinformatics compute & storage cluster to support several labs that are doing NGS analysis at our university. At this point, we have a decent idea of what we're looking for for the compute requirements, however, as storage is the more expensive cost of the two, I'm seeking suggestions on I/O transfer speed - since this is one of the major cost factors for the drives that we are considering.

    Currently, we're exploring a 3-4 tiered approach where the tiers would be:
    1. (fastest) : local storage to the compute nodes that would be used solely for the temp files that are generated by GATK
    2. (faster) : NAS drive for recently sequenced samples (or those that are currently being (re-)analyzed)
    3. (slower) : Secondary NAS drive for legacy files/samples, e.g. fastq's for samples that already have BAM's generated
    4. (slowest): TBD if we include this, but optical/tape/other for archival purposes


    I've seen others suggest this framework, or something similar, however, it's not clear to us what faster/slower would be sufficient, given the current state of technology, circa December, 2014. Currently, our IT staff has suggested/recommended using an Isolon Drive from EMC, however, while it is a strong product, the cost is prohibitive, and we're curious if anyone else can make any recommendations.

    If so, we'd greatly appreciate it.

  • #2
    Since we have a large central IT facility I am fairly far removed from the actual hardware details at my institution -- we just funnel money their way and they do the purchasing/support. That said we've had problems with the central IT's Isilon (which serves a large swath of the University) -- occasionally it just goes out to lunch and all of our jobs go into an I/O wait state. Ugly when that happens. Recently central IT has purchased an "... enterprise-class GPFS storage solution .. built on Data Direct Networks' SFA12k storage platform ..." specifically for research computing. Our group has yet to be transferred over to the new system so I can not say how well it works compared to the Isilon but it may be something to look into although it is also probably cost prohibitive.

    As for your 3-4 tiers, I get by with 3 tiers. (1) On-node local, (2) NAS [Isilon] for recent and legacy files and (3) tape for backup/archival. I am not a fan of having separate NASes for 'current' and 'legacy' projects. It is interesting how often you'll think a project is long-gone only to have years later a renewed interest in it.

    Comment


    • #3
      We got an offer from SQREAM, they develop a fast GPU based SQL server with the storage of infinidat.
      They have a niche for biological data (BAM files mainly). It's costly but might be a long-term solution.
      Another 2 cents - I don't see a reason to keep mapped fastq files, all the data is in the BAM file.

      Comment


      • #4
        @pwaltman: Can you provide some additional information? How many users you are expecting to use this, what size storage are you looking at, would this be connected to an HPC cluster or a small local cluster (# of cores/nodes)?

        I don't know what exact size of storage you looked at but if you thought that isilon was cost prohibitive at that level then you are not going to have many other options. If I/O is a critical requirement (what minimum throughput do you need?) then there is only a small subset of vendors that offer products in this space. When you balance cost/benefit (unless you have a big donor footing the bill for all this) the inconvenience your users will experience (in terms of additional time needed to complete jobs) would not be that terrible if you went with a relatively slower storage solution.

        No vendor is perfect and since this is an electromechanical device (you are not going to do all SSD's at this level) you are going to experience a failure of some sort (it is just a matter of when). We have multiple Isilon arrays scattered across and have generally had good luck with them (except for one hiccup on a large array, which may have been caused by a support contractor not following correct procedure when replacing some hardware, which resulted in a multiple week downtime). To be fair, the array was over 90% full and we probably lost one file at the end of day, which was remarkable.

        Comment


        • #5
          Pardon the naive question, but if one has an SSD for all disk I/O, will that increase the speed of BLAST? I have a 12-core Mac Pro with 128 GB of RAM but some of my BLAST work takes more than a week to finish. If an SSD could cut that time in half, it would be worth the several hundred dollars for a 1 TB SSD purchase.

          Comment


          • #6
            If you had a new MacPro (sounds like you have the last generation) then using an external thunderbolt 2 enclosure (which in theory gives you 20 Gbps) would be the best option to maximize throughput.

            Are you using all cores for blast? Have you tried to put the database into a RAMDisk to reduce I/O dependance?

            Internal SSD's are going to be limited by the bandwidth of PCIe (which I believe is 6 Gbps but the real world speed will be less) or SATA, if you don't have PCIe.

            Having said all that you are going to reach a saturation point somewhere with your hardware that will become the ultimate bottleneck. Finding a compute cluster and splitting your job into several files and running them in parallel would be the effective way to reduce total time.

            Comment


            • #7
              Thanks for your reply. Yes, it's a 2010 Mac Pro with SATA, not PCIe. The limiting factor seems to be the spinning hard drive (7200 rpm, 32 mb cache). I'd rather run all this on the Mac Pro for various reasons, so if an SSD would cut the time substantially (even by half), that would make it a worthwhile investment.

              Edit - I'm not sure about the RAM disk. That would require copying all files over after each reboot?
              Last edited by Tony-S; 12-12-2014, 08:15 AM.

              Comment


              • #8
                Can you comment on the two questions I had asked about how you are running your blast?

                How many SATA ports do you have open? If you had more than one then perhaps getting two more more SSD's (and using some sort of RAID) may increase throughput but you will hit the limit on SATA bus. Do you know if you have the 3Gbps or 6 Gbps SATA?

                Comment


                • #9
                  We have a 4-node Isilon setup, which performs quite OK when configured properly (balancing access). No disks (well, only for the OS, /tmp and swap) on the compute nodes, but plenty of RAM. Tape for backups and mid-term storage (simple regimes).

                  My experience is that you can gain extremely much by simply reducing I/O load in software - many analysis pipelines aren't optimized for it. If you pipe more, skip writing temporary files (keep them in RAM or do piped sorts) and have more RAM on the nodes you can go with cheaper storage.

                  Storage is expensive. However, when you buy it, also consider what solution is easier for you to manage - if you need to actively shuffle around the data among the tiers you might end up with very poor I/O performance where/when you actually need it.

                  If you have enough RAM to cache the BLAST databases, you won't gain much with SSDs. Most of BLAST is CPU time (unless you have way to little RAM and are swapping all the time).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  21 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X