Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help with NGS hardware upgrade

    Hi there,
    I know this question comes up every now and then and is eventually hard to answer, but we have no sysadmin at hand with enough NGS experience and are in need to spend some money

    We need to / would like to upgrade our NGS throughput quite significantly. Currently, the only suitable sequencer for us seems to be the new Novaseq, because: HiSeqs will not be delivered anymore from mid of the year, so support of chemistries also stops probably rather sooner than later. Pacbio seems a little risky since Roche stopped the support. Genia/Oxford are no real options as they are still in some kind of alpha/beta stage.

    We want to sequencing something in the range of 40 human genomes per month at 30x. So we will have something like 20TB of data to process per month. Because variant calling is computationally probably the most expensive part, there is no real need to consider anything else here (transcriptome, methylome, etc), is it?. So the main question would be: what kind of infrastructure do we need for this? Is a cluster really required here or would something with a lower maintenance-demand also suffice? We also have the possibility to use the HPC at the local university occasionally, hence, we may perform the computational heaviest tasks there and do the rest on our local "whatever". Is this realistic or are we going to spend more time sending data around than analyzing it?

    Any ideas are highly appreciated!

    Btw: We are in Germany and working with human tumor patient samples. Hence, data protection is something we need to critically consider in every step. Cloud computing is therefore probably not a possibility, even not if it is a private cloud (maybe, if the data is guaranteed to stay in Germany, but I'm not aware of a company that can make such a guarantee)

    Thanks for reading and any comment

  • #2
    Originally posted by WhatsOEver View Post
    Hi there,
    Currently, the only suitable sequencer for us seems to be the new Novaseq, because: HiSeqs will not be delivered anymore from mid of the year, so support of chemistries also stops probably rather sooner than later.
    I don't think that is the case. Illumina may (not confirmed) stop selling HiSeq 2500/3000 at that time but other HiSeq's would certainly be still shipping (HiSeq 4000). If you want to justify getting a NovaSeq using that reason go right ahead we won't tell

    Illumina still sells reagents for GAIIx, so that is not likely to stop any time soon either.

    So the main question would be: what kind of infrastructure do we need for this? Is a cluster really required here or would something with a lower maintenance-demand also suffice? We also have the possibility to use the HPC at the local university occasionally, hence, we may perform the computational heaviest tasks there and do the rest on our local "whatever". Is this realistic or are we going to spend more time sending data around than analyzing it?
    You would definitely benefit from having access to a cluster. That way you can multitask (e.g. pre-process data for one flowcell, while you may be aligning two others and calling SNP's on something else, you get the idea).

    Having managed entire IT infrastructure in house and then switched to using shared resources I have seen the entire spectrum. As long as your central IT provides reliable/responsive services I suggest that you look into collaborating with them. Doing system admin tasks/keeping systems secure requires a professional's touch and it is best to leave that to professionals so you can focus on doing science.

    If the network links are reliable you could collect data from whichever sequencer you select (gigabit links are fine) to network storage (that could be provided by your central IT or you could set something up locally and transfer data to central processing off-line).

    You would certainly need to provision access to adequate storage (some fast, some slow) to efficiently manage data. Figure on keeping 3-4 months worth of data on disk (before moving it to long term storage e.g. tape). If a user does not come back asking for data in that period of time you are not likely to need it any time soon.

    Feel free to ask if you have additional questions.

    Comment


    • #3
      Also in Germany, so I'm familiar with why clouds are largely a no go. Having said that, talk to the GWDG. They're providing cloud support for all of us forbidden from using any service from a company outside of the EU (i.e., all of the big ones). We're not using this since we already have our own cluster, but if you're basically just running a single pipeline then this might be a nice way to go.

      Definitely look into a cluster. We're lucky enough that we have one IT person dedicated to our core facility, so he takes care of most of the pure sys-admin stuff and then gives me appropriate rights to handle everything else. That keeps things reasonable for me (a spend <5% of my time dealing with this sort of thing unless Galaxy is acting up).

      Comment


      • #4
        Make sure to do some testing with actual data on the harware of choise!

        I assume, that you have set up your pipeline, and know it's computational requirements, otherwise make sure to do it first.

        Networking:
        Think about 10G unmanaged for the cluster/storage itself (if on a budget).

        Storage: use NAS + DAS + SSD's for ref:
        If there is some in house cluster resource available - give it a try, but be prepared to spend some on a dedicated speedy NAS storage. With current workflows I would suggest having at least 500GB of workspace per sample. Make sure yours working array is RAID 10 and DO NOT USE SMR (Shingled Magnetic Recording) HDD's for scratch storage (like 8TB Seagate archive)!
        Reference databases are best kept on SSD's.
        If budget permits go for all flash.

        Servers/Worknodes:
        If you end up buying your own servers, than have at least 256GB (better 512GB) of DDR4 ram per node, 3.2GHz 8 core Xeons are quite good on vallue/performace, and go for dual socket systems. Make sure your server are AT LEAST 2U high (3U-4U better) or (1U would overheat + be extremely loud + waste a lot of power (25-30%) generating noise by tiny fans).

        PS: When parallelising, work on a higest level possible - like use each node for processing a single sample from fastq->bam->vcf (to the end), than trying to divide each step across the nodes and checkpoint inbetween. Use node's own DAS when possible (way less load on network and better scalability that way).

        PPS: Be prepared to do a lot of de novo work in 3-5 years time.

        Comment


        • #5
          Originally posted by WhatsOEver View Post
          So the main question would be: what kind of infrastructure do we need for this? Is a cluster really required here or would something with a lower maintenance-demand also suffice?
          As suggested by Markiyan, I strongly agree that you should benchmark your intended pipeline so you have an approximate estimate of your computational requirements. As soon as you get to the point where you need to scale over multiple machines, then a cluster is easier the administer than running multiple machines. A cluster, in essence, just a bunch of machines with some job queuing to software to manage load. Given the volume of data you have, you'd need at least a 4 socket server which isn't a cost-effective proposition.

          If you have access to an external cluster, that is definitely the least maintenance solution but does run the risk of high turn-around times for you jobs. My local inter-institute cluster is fully utilised and it is not uncommon for a job to take two weeks to even start running - not a good scenario if you are intending to make clinical decisions based on your data.

          Originally posted by Markiyan View Post
          If budget permits go for all flash.
          We've had very good results with tiered software arrays. The performance is close to all flash (as we have sufficient SSD capacity to keep the active working set in flash), but has much higher capacity for the same price.
          Last edited by dcameron; 01-25-2017, 04:03 PM.

          Comment


          • #6
            Be sure to thoroughly examine NovaSeq data before deciding on buying one versus another platform. Lower quality can yield a big difference in analyst time, depending on how you use the data, so that's important to factor in along with reagent costs. In fact, it would be great if you can send a sample to Illumina or somewhere and sequence it on a HiSeq2500 and NovaSeq (at the run density you expect to use) to accurately quantify how long it takes to process and analyze the data, and how good the results are. That would also give you a better idea of your computational needs.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X