SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
suggestion for hardware configuration for NGS analysis sunsnow86 Bioinformatics 3 04-24-2015 07:04 AM
Hardware for NGS analysis - GPU vs CPU? eb0906 Bioinformatics 6 01-02-2015 03:44 PM
Hardware requirements for multi purpose NGS Data analyses sinnafoch Bioinformatics 8 10-15-2014 09:54 AM
Hardware requirement for bacterial NGS analysis chariko Bioinformatics 7 11-19-2013 03:16 AM
What would be recommended hardware (computing) for a NGS lab? sameet Bioinformatics 10 05-14-2010 12:02 AM

Reply
 
Thread Tools
Old 01-23-2017, 12:49 AM   #1
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default Need help with NGS hardware upgrade

Hi there,
I know this question comes up every now and then and is eventually hard to answer, but we have no sysadmin at hand with enough NGS experience and are in need to spend some money

We need to / would like to upgrade our NGS throughput quite significantly. Currently, the only suitable sequencer for us seems to be the new Novaseq, because: HiSeqs will not be delivered anymore from mid of the year, so support of chemistries also stops probably rather sooner than later. Pacbio seems a little risky since Roche stopped the support. Genia/Oxford are no real options as they are still in some kind of alpha/beta stage.

We want to sequencing something in the range of 40 human genomes per month at 30x. So we will have something like 20TB of data to process per month. Because variant calling is computationally probably the most expensive part, there is no real need to consider anything else here (transcriptome, methylome, etc), is it?. So the main question would be: what kind of infrastructure do we need for this? Is a cluster really required here or would something with a lower maintenance-demand also suffice? We also have the possibility to use the HPC at the local university occasionally, hence, we may perform the computational heaviest tasks there and do the rest on our local "whatever". Is this realistic or are we going to spend more time sending data around than analyzing it?

Any ideas are highly appreciated!

Btw: We are in Germany and working with human tumor patient samples. Hence, data protection is something we need to critically consider in every step. Cloud computing is therefore probably not a possibility, even not if it is a private cloud (maybe, if the data is guaranteed to stay in Germany, but I'm not aware of a company that can make such a guarantee)

Thanks for reading and any comment
WhatsOEver is offline   Reply With Quote
Old 01-23-2017, 04:07 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,975
Default

Quote:
Originally Posted by WhatsOEver View Post
Hi there,
Currently, the only suitable sequencer for us seems to be the new Novaseq, because: HiSeqs will not be delivered anymore from mid of the year, so support of chemistries also stops probably rather sooner than later.
I don't think that is the case. Illumina may (not confirmed) stop selling HiSeq 2500/3000 at that time but other HiSeq's would certainly be still shipping (HiSeq 4000). If you want to justify getting a NovaSeq using that reason go right ahead we won't tell

Illumina still sells reagents for GAIIx, so that is not likely to stop any time soon either.

Quote:
So the main question would be: what kind of infrastructure do we need for this? Is a cluster really required here or would something with a lower maintenance-demand also suffice? We also have the possibility to use the HPC at the local university occasionally, hence, we may perform the computational heaviest tasks there and do the rest on our local "whatever". Is this realistic or are we going to spend more time sending data around than analyzing it?
You would definitely benefit from having access to a cluster. That way you can multitask (e.g. pre-process data for one flowcell, while you may be aligning two others and calling SNP's on something else, you get the idea).

Having managed entire IT infrastructure in house and then switched to using shared resources I have seen the entire spectrum. As long as your central IT provides reliable/responsive services I suggest that you look into collaborating with them. Doing system admin tasks/keeping systems secure requires a professional's touch and it is best to leave that to professionals so you can focus on doing science.

If the network links are reliable you could collect data from whichever sequencer you select (gigabit links are fine) to network storage (that could be provided by your central IT or you could set something up locally and transfer data to central processing off-line).

You would certainly need to provision access to adequate storage (some fast, some slow) to efficiently manage data. Figure on keeping 3-4 months worth of data on disk (before moving it to long term storage e.g. tape). If a user does not come back asking for data in that period of time you are not likely to need it any time soon.

Feel free to ask if you have additional questions.
GenoMax is offline   Reply With Quote
Old 01-23-2017, 05:39 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Also in Germany, so I'm familiar with why clouds are largely a no go. Having said that, talk to the GWDG. They're providing cloud support for all of us forbidden from using any service from a company outside of the EU (i.e., all of the big ones). We're not using this since we already have our own cluster, but if you're basically just running a single pipeline then this might be a nice way to go.

Definitely look into a cluster. We're lucky enough that we have one IT person dedicated to our core facility, so he takes care of most of the pure sys-admin stuff and then gives me appropriate rights to handle everything else. That keeps things reasonable for me (a spend <5% of my time dealing with this sort of thing unless Galaxy is acting up).
dpryan is offline   Reply With Quote
Old 01-23-2017, 07:51 AM   #4
Markiyan
Senior Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 115
Lightbulb Make sure to do some testing with actual data on the harware of choise!

I assume, that you have set up your pipeline, and know it's computational requirements, otherwise make sure to do it first.

Networking:
Think about 10G unmanaged for the cluster/storage itself (if on a budget).

Storage: use NAS + DAS + SSD's for ref:
If there is some in house cluster resource available - give it a try, but be prepared to spend some on a dedicated speedy NAS storage. With current workflows I would suggest having at least 500GB of workspace per sample. Make sure yours working array is RAID 10 and DO NOT USE SMR (Shingled Magnetic Recording) HDD's for scratch storage (like 8TB Seagate archive)!
Reference databases are best kept on SSD's.
If budget permits go for all flash.

Servers/Worknodes:
If you end up buying your own servers, than have at least 256GB (better 512GB) of DDR4 ram per node, 3.2GHz 8 core Xeons are quite good on vallue/performace, and go for dual socket systems. Make sure your server are AT LEAST 2U high (3U-4U better) or (1U would overheat + be extremely loud + waste a lot of power (25-30%) generating noise by tiny fans).

PS: When parallelising, work on a higest level possible - like use each node for processing a single sample from fastq->bam->vcf (to the end), than trying to divide each step across the nodes and checkpoint inbetween. Use node's own DAS when possible (way less load on network and better scalability that way).

PPS: Be prepared to do a lot of de novo work in 3-5 years time.
Markiyan is offline   Reply With Quote
Old 01-25-2017, 02:55 PM   #5
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 26
Default

Quote:
Originally Posted by WhatsOEver View Post
So the main question would be: what kind of infrastructure do we need for this? Is a cluster really required here or would something with a lower maintenance-demand also suffice?
As suggested by Markiyan, I strongly agree that you should benchmark your intended pipeline so you have an approximate estimate of your computational requirements. As soon as you get to the point where you need to scale over multiple machines, then a cluster is easier the administer than running multiple machines. A cluster, in essence, just a bunch of machines with some job queuing to software to manage load. Given the volume of data you have, you'd need at least a 4 socket server which isn't a cost-effective proposition.

If you have access to an external cluster, that is definitely the least maintenance solution but does run the risk of high turn-around times for you jobs. My local inter-institute cluster is fully utilised and it is not uncommon for a job to take two weeks to even start running - not a good scenario if you are intending to make clinical decisions based on your data.

Quote:
Originally Posted by Markiyan View Post
If budget permits go for all flash.
We've had very good results with tiered software arrays. The performance is close to all flash (as we have sufficient SSD capacity to keep the active working set in flash), but has much higher capacity for the same price.

Last edited by dcameron; 01-25-2017 at 03:03 PM.
dcameron is offline   Reply With Quote
Old 01-26-2017, 09:26 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Be sure to thoroughly examine NovaSeq data before deciding on buying one versus another platform. Lower quality can yield a big difference in analyst time, depending on how you use the data, so that's important to factor in along with reagent costs. In fact, it would be great if you can send a sample to Illumina or somewhere and sequence it on a HiSeq2500 and NovaSeq (at the run density you expect to use) to accurately quantify how long it takes to process and analyze the data, and how good the results are. That would also give you a better idea of your computational needs.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:53 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO