![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
suggestion for hardware configuration for NGS analysis | sunsnow86 | Bioinformatics | 3 | 04-24-2015 08:04 AM |
Hardware for NGS analysis - GPU vs CPU? | eb0906 | Bioinformatics | 6 | 01-02-2015 04:44 PM |
Hardware requirements for multi purpose NGS Data analyses | sinnafoch | Bioinformatics | 8 | 10-15-2014 10:54 AM |
Hardware requirement for bacterial NGS analysis | chariko | Bioinformatics | 7 | 11-19-2013 04:16 AM |
What would be recommended hardware (computing) for a NGS lab? | sameet | Bioinformatics | 10 | 05-14-2010 01:02 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Germany Join Date: Apr 2012
Posts: 215
|
![]()
Hi there,
I know this question comes up every now and then and is eventually hard to answer, but we have no sysadmin at hand with enough NGS experience and are in need to spend some money ![]() We need to / would like to upgrade our NGS throughput quite significantly. Currently, the only suitable sequencer for us seems to be the new Novaseq, because: HiSeqs will not be delivered anymore from mid of the year, so support of chemistries also stops probably rather sooner than later. Pacbio seems a little risky since Roche stopped the support. Genia/Oxford are no real options as they are still in some kind of alpha/beta stage. We want to sequencing something in the range of 40 human genomes per month at 30x. So we will have something like 20TB of data to process per month. Because variant calling is computationally probably the most expensive part, there is no real need to consider anything else here (transcriptome, methylome, etc), is it?. So the main question would be: what kind of infrastructure do we need for this? Is a cluster really required here or would something with a lower maintenance-demand also suffice? We also have the possibility to use the HPC at the local university occasionally, hence, we may perform the computational heaviest tasks there and do the rest on our local "whatever". Is this realistic or are we going to spend more time sending data around than analyzing it? Any ideas are highly appreciated! Btw: We are in Germany and working with human tumor patient samples. Hence, data protection is something we need to critically consider in every step. Cloud computing is therefore probably not a possibility, even not if it is a private cloud (maybe, if the data is guaranteed to stay in Germany, but I'm not aware of a company that can make such a guarantee) Thanks for reading and any comment ![]() |
![]() |
![]() |
![]() |
#2 | ||
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]() Quote:
![]() Illumina still sells reagents for GAIIx, so that is not likely to stop any time soon either. Quote:
Having managed entire IT infrastructure in house and then switched to using shared resources I have seen the entire spectrum. As long as your central IT provides reliable/responsive services I suggest that you look into collaborating with them. Doing system admin tasks/keeping systems secure requires a professional's touch and it is best to leave that to professionals so you can focus on doing science. If the network links are reliable you could collect data from whichever sequencer you select (gigabit links are fine) to network storage (that could be provided by your central IT or you could set something up locally and transfer data to central processing off-line). You would certainly need to provision access to adequate storage (some fast, some slow) to efficiently manage data. Figure on keeping 3-4 months worth of data on disk (before moving it to long term storage e.g. tape). If a user does not come back asking for data in that period of time you are not likely to need it any time soon. Feel free to ask if you have additional questions. |
||
![]() |
![]() |
![]() |
#3 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Also in Germany, so I'm familiar with why clouds are largely a no go. Having said that, talk to the GWDG. They're providing cloud support for all of us forbidden from using any service from a company outside of the EU (i.e., all of the big ones). We're not using this since we already have our own cluster, but if you're basically just running a single pipeline then this might be a nice way to go.
Definitely look into a cluster. We're lucky enough that we have one IT person dedicated to our core facility, so he takes care of most of the pure sys-admin stuff and then gives me appropriate rights to handle everything else. That keeps things reasonable for me (a spend <5% of my time dealing with this sort of thing unless Galaxy is acting up). |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Cambridge Join Date: Sep 2010
Posts: 116
|
![]()
I assume, that you have set up your pipeline, and know it's computational requirements, otherwise make sure to do it first.
Networking: Think about 10G unmanaged for the cluster/storage itself (if on a budget). Storage: use NAS + DAS + SSD's for ref: If there is some in house cluster resource available - give it a try, but be prepared to spend some on a dedicated speedy NAS storage. With current workflows I would suggest having at least 500GB of workspace per sample. Make sure yours working array is RAID 10 and DO NOT USE SMR (Shingled Magnetic Recording) HDD's for scratch storage (like 8TB Seagate archive)! Reference databases are best kept on SSD's. If budget permits go for all flash. Servers/Worknodes: If you end up buying your own servers, than have at least 256GB (better 512GB) of DDR4 ram per node, 3.2GHz 8 core Xeons are quite good on vallue/performace, and go for dual socket systems. Make sure your server are AT LEAST 2U high (3U-4U better) or (1U would overheat + be extremely loud + waste a lot of power (25-30%) generating noise by tiny fans). PS: When parallelising, work on a higest level possible - like use each node for processing a single sample from fastq->bam->vcf (to the end), than trying to divide each step across the nodes and checkpoint inbetween. Use node's own DAS when possible (way less load on network and better scalability that way). PPS: Be prepared to do a lot of de novo work in 3-5 years time. |
![]() |
![]() |
![]() |
#5 | |
Member
Location: Australia Join Date: Mar 2013
Posts: 26
|
![]() Quote:
If you have access to an external cluster, that is definitely the least maintenance solution but does run the risk of high turn-around times for you jobs. My local inter-institute cluster is fully utilised and it is not uncommon for a job to take two weeks to even start running - not a good scenario if you are intending to make clinical decisions based on your data. We've had very good results with tiered software arrays. The performance is close to all flash (as we have sufficient SSD capacity to keep the active working set in flash), but has much higher capacity for the same price. Last edited by dcameron; 01-25-2017 at 04:03 PM. |
|
![]() |
![]() |
![]() |
#6 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Be sure to thoroughly examine NovaSeq data before deciding on buying one versus another platform. Lower quality can yield a big difference in analyst time, depending on how you use the data, so that's important to factor in along with reagent costs. In fact, it would be great if you can send a sample to Illumina or somewhere and sequence it on a HiSeq2500 and NovaSeq (at the run density you expect to use) to accurately quantify how long it takes to process and analyze the data, and how good the results are. That would also give you a better idea of your computational needs.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|