Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • elek-tron
    Junior Member
    • Aug 2012
    • 2

    Local Galaxy concept system: hardware spec questions

    Hi all,

    I have a couple of question around the topic "hardware requirements" for a server which is intended to be bought and used as concept machine for NGS-related jobs. It should be used for development of tools and workflows (using Galaxy, sure) as well as platform for some "alpha" users, who should learn to work on NGS data, which they just began to generate.
    This concept phase is planned to last 1-2 years. During this time main memory and especially storage could be extended, the latter on a per-project basis. We will start with a small team of 3 people for supporting and developing Galaxy and system due to the user's requirements, and the first group of users will bring in data, scientific questions and hands-on work on their own data. Main task (regarding system load) will be sequence alignment (BLAST, mapping tools like BWA/Bowtie), and after that maybe some experimental sequence clustering/de novo assembly for exome data. Additionally variant detection in whatever form are targeted. Only active projects will be stored locally, data no more in use will be stored elsewhere in the network.
    So far for the setting, regarding the specs the following is intended:

    - dual-CPU mainboard
    - 256 GB RAM
    - 20-30 TB HDD @ RAID6 (data)
    - SSDs @ RAID5 (system, tmp)

    Due to funding limitations it may be the case that RAM has to be decreased to 128 GB, not solved is currently the question, if it will be enough for those SSD bundle in RAID5, maybe we have to go for only two of them in RAID1.

    What we try to find out is, where in those described tasks the machine would run into bottlenecks. What's pretty clear is that I/O is everything, already by a theoretical point of view. But we also observed that on a comparable machine (2x 3,33 Ghz Intel 6-core, 100GB RAM, 450 MB/s R/W to data RAID6).
    The question of questions is right at the beginning of configuring a system, if one should go for an AMD or an Intel architecture system. The first offers more cores (8-12) at a lower frequency (~2,4 Ghz), the latter less cores (6) with higher frequency (~3,3 Ghz). Due to the data sheets, the Intel CPUs are on a per-core basis ~30% faster with integer operations and ~50% faster with floating point. The risk we see with the AMDs is on the one hand that the number of cores per socket could saturate the memory controller, and on the other hand those jobs, which can not or only poorly be parallelized need more time.

    To bring all this to some distinct questions (don't feel forced to answer all of them):

    1. Using the described bioinformatics software: where are the potential system bottlenecks? (connections between CPUs, RAM, HDDs)

    2. What is the expected relation of integer-based and floating point based calculations, which will be loading the CPU cores?

    3. Regarding the architectural differences (strengths, weaknesses): Would an AMD- or an Intel-System be more suitable?

    4. How much I/O (read and write) can be expected at the memory controllers? Which tasks are most I/O intensive (regarding RAM and/or HDDs)?

    5. Roughly separated in mapping and clustering jobs: which amounts of main memory can be expected to be required by a single job (given e.g. Illumina exome data, 50x coverage)? As far as I know mapping should be around 4 GB, clustering much more (may reach high double digits).

    6. HDD access (R/W) is mainly in bigger blocks instead of masses of short operations - correct?

    All those questions are a bit rough and improved (yes, it IS a bit of a chaos currently - sorry for that), but any clue to a single question would help. "Unfortunately" we got the money to place the order for our own hardware unexpectedly quick, and we are now forced to act. We want to make as few cardinal errors as possible...

    Thanks a lot in advance,

    Sebastian
  • xied75
    Senior Member
    • Feb 2012
    • 129

    #2
    3. AMD/Intel, go to the larger core count, (Intel HyperThreading is not real core).

    6. HDD access, it's either Sequential or Random, (block size/queue depth are minor factor), it's all because for a mechanic harddrive, the Head needs time to move from one track to another, so even a top range HD can only do about 100+ random move per second, times 4k/8k/64k you'll get the throughput number.

    BWA/SAMTOOLS/GATK all designed for sequential access (i.e. streaming).

    SSD might change this situation so that you can do random as fast as sequential. Maybe you could consider to have some SSDs as a staging/fast scratch area, so from slow main storage -> SSD (run programs, hold temp files, hold anything need random pattern) -> write back to main. I would do a RAID 0 with SSD, personally.

    Comment

    • maubp
      Peter (Biopython etc)
      • Jul 2009
      • 1544

      #3
      Duplicate email thread on the galaxy-devel mailing list:

      Comment

      • elek-tron
        Junior Member
        • Aug 2012
        • 2

        #4
        Originally posted by maubp View Post
        Duplicate email thread on the galaxy-devel mailing list:
        http://lists.bx.psu.edu/pipermail/ga...st/010759.html
        Yes, thanks, I should have mentioned that.
        I posted in both forum and dev-list, because I don't expect the forum members and the dev-list subscribers to be a 100% identical...

        Sorry for any inconvenience...

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          Today, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Yesterday, 06:09 AM
        0 responses
        16 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        34 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        41 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        48 views
        0 reactions
        Last Post SEQadmin2  
        Working...