SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bioinformatics workstation configuration sebl Bioinformatics 2 11-02-2014 05:53 AM
Best Bioinformatics workstation? norbert Bioinformatics 22 08-29-2014 12:55 PM
bioinformatics workstation Serban Bioinformatics 13 11-25-2013 08:52 AM
workstation hardware Berlinq Bioinformatics 7 12-10-2009 02:18 AM

Reply
 
Thread Tools
Old 12-08-2015, 03:22 AM   #1
sebl
Member
 
Location: Israel

Join Date: Mar 2014
Posts: 26
Default Feedback on workstation for bioinformatics

Dear all,

Following a discussion on a good workstation for bioinformatics work...

We are working on bacterial/plasmids/viruses genomes. Things we are doing at small scale today and that we are going to do more (i.e. scale up to hundreds of bacterial genomes) are mapping, de novo assembly, genome alignments, pan/core genome analyses, comparison by SNP and pairwise blast etc.

I approached our IT dept with basic specs for a workstation based on messages I found in the forum and in the respective forums of the programs we work with.

Below is what a vendor suggested to the IT team and now I was asked to check if that looks good enough. Since my knowledge on hardware is poor, I would be glad to get some feedback here.

I already asked that the OS is changed to Linux... I first thought on a dual-boot system, but as I read and think about it, I think that this should be mainly a Linux (Bio-Linux?) system, with a Windows VM if needed. I cannot think on any bioinformatic program that will work on Windows but not on Linux... we are still mostly working within Windows (with Linux VM), but I guess that we should make a final change over Linux now.

Thanks in advance!

Base Unit HP Z840 Workstation
Packaging HP Single Unit Packaging
Chassis HP Z840 1125W (1450W/200V) 90% Eff Chass
Operating System 10*Pro 64 downgrade to Win7 Pro 64 *
Add-On Selection Operating System Load to PCIe
Recovery Media Windows 7 Pro 64-bit OS DVD+DRDVD
Processor Intel Xeon E5-2630v3 2.4 1866 8C 1stCPU
Processor 2 Intel Xeon E5-2630v3 2.4 1866 8C 2ndCPU
System Memory 128GB DDR4-2133 (16x8GB) 2CPU RegRAM
Graphics Card NVIDIA NVS 310 1GB 1st GFX
Internal Storage 01 HP Z Turbo Drive 256GB PCIe 1st SSD
Internal Storage 01 4TB 7200 RPM SATA 1st HDD
Internal Storage 02 4TB 7200 RPM SATA 2nd HDD
Internal Storage 03 4TB 7200 RPM SATA 3rd HDD
Internal Storage 04 4TB 7200 RPM SATA 4th HDD
Internal Storage 05 4TB 7200 RPM SATA 5th HDD
Optical Device 1 9.5mm Slim SuperMulti DVDRW 1st ODD
Media Card Reader HP 15-In-1 Media Card Reader
Warranty HP 3/3/3 Warranty
Country Kit HP Z840 Country Kit
Add-On Selection HP Dual Processor Air Cooling Kit
sebl is offline   Reply With Quote
Old 12-08-2015, 05:53 AM   #2
cmccabe
Senior Member
 
Location: chicago

Join Date: Jul 2012
Posts: 354
Default

I use an HP Z640 for analysis of human ngs data. Though it was not designed optimaly, it is well suited for our current needs.
That being said a linux OS is a good choice, the flavor (ubuntu, centos, redhat) depends on your comort level and preference. We do run a windows only application, nextgene, but setup a VM rather than a dual-boot as the dual-boot was rather difficult with windows.
The type of workstation that you need really depends on your data and the applications used (are they memory dependent or processor intensive). I to am getting ideas from others more experienced, but your off to a good start.
cmccabe is offline   Reply With Quote
Old 12-08-2015, 06:44 AM   #3
Jessica_L
Senior Member
 
Location: Washington, D.C. metro area

Join Date: Feb 2010
Posts: 118
Default

My only word of caution regarding your linux OS is to choose it carefully.

I currently use Ubuntu in a VM and I've had issues getting certain programs to compile correctly (i.e. CASAVA, bcl2fastq from Illumina). Unfortunately, my IT then built our linux workstation with Ubuntu so I'm having to revisit a lot of the same problems when I go to install software. On the plus side, they're usually problems for which I've already identified solutions, but it can get frustrating.

I have a second VM that uses RedHat and I haven't had any problems or issues with it. Others may have a more informed experience with that OS, though.
Jessica_L is offline   Reply With Quote
Old 12-08-2015, 09:32 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,004
Default

Is 128 GB the max RAM for this model? I almost wonder if you should drop the second CPU and get more RAM, if you are going to be doing a lot of de novo assemblies ( I am assuming the configuration has been maxed out for your budget).
GenoMax is offline   Reply With Quote
Old 12-08-2015, 09:50 AM   #5
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I like CentOS for the operating system.
Definitely not Windows, under any condition.
Should fire your IT team for proposing Windows.
RedHat Enterprise Linux is basically the same thing as CentOS.
CentOS is the community version of RedHat Linux.

The DVD drive and the media reader are not necessary, but I suppose the cost of having them is minimal relative to the cost of the system.

I don't see the utility of the professional graphics card for next-generation sequencing, but then again if you have the budget it won't do any harm. The money spent on the graphics card could be spent on doubling the RAM.
blancha is offline   Reply With Quote
Old 12-08-2015, 11:58 AM   #6
sebl
Member
 
Location: Israel

Join Date: Mar 2014
Posts: 26
Default

Quote:
The money spent on the graphics card could be spent on doubling the RAM.
I should really check on that.

Quote:
Is 128 GB the max RAM for this model? I almost wonder if you should drop the second CPU and get more RAM
Eh, a colleague suggested that I ask for further processors... The max RAM seems to be 512. But it is already considered a very unusual purchase in our institute, so I did not want to push that much in the specs. If I get it right, there is room to upgrade it later if necessary.

About the budget, I actually just gave the IT people a basic configuration, like about 128 RAM and 16 cores, based on suggestions I've seen in the forum, without getting into too many other details. This machine is what the vendor suggested.

I thought about Bio-Linux as the OS...
sebl is offline   Reply With Quote
Old 12-08-2015, 12:10 PM   #7
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I have 48 cores on my institute's server, and it is constantly overloaded.
Luckily, I have access to thousands of core on an external computing cluster.
I do work nearly exclusively with eukaryotic NGS data, though.

You can certainly use all 16 cores, once you discover the joys of parallel processing.

It's just a question of how patient you are, and what turnaround you want. The more cores, the more samples you can process in parallel, and the faster you can process individual samples when parallelization in possible.
blancha is offline   Reply With Quote
Old 12-08-2015, 12:25 PM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,004
Default

@sebl appears to belong to a lab (not a core?) and even though prediction of hundreds of samples sounds interesting it may be a while before the lab starts doing that many (reagent costs add up quickly, if you are really going to be running hundreds of samples, even bacterial). If there really are hundreds of samples then using a central compute facility becomes economical/effective.

@blancha: You can't be the only user on your local server if it has 48 cores and it still stays busy. If you are the only user, then you must be analyzing hundreds of samples a week to keep all those cores busy
GenoMax is offline   Reply With Quote
Old 12-08-2015, 12:34 PM   #9
sebl
Member
 
Location: Israel

Join Date: Mar 2014
Posts: 26
Default

@GenoMax: Indeed.

Also, once we set up a pipeline for analysis, if it will take one day more to get it done it does not really matter most times, as long as the computer is able to process it in the end.

I agree that for really really large sets we may need some bioinformatics core etc. But we are not there yet
sebl is offline   Reply With Quote
Old 12-08-2015, 12:51 PM   #10
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

@GenoMax, I currently have 16 human exosome RNA-Seq samples to reprocess. I'm taking 4 cores per sample for the TopHat runs.
4*16 = 64 cores
I'm already exhausted my 48 cores. A TopHat run with one core would just be far too long.
And, yes, there is a proteomics web application running on the same server, so I have to be careful not to overload the server completely. I actually just keep 38 cores for my NGS pipelines, and leave the 10 others free for other uses.
I also have another project with 6 samples to reprocess that has currently been sitting in the queue on the computing cluster that I also use for the past 2 days, either because the cluster is overloaded or because the scheduler is malfunctioning again.

It doesn't take hundreds of samples to use 48 cores.
Granted, I should probably switch from TopHat to a faster aligner, but it's the only program in my pipeline that I have always been able to count on for giving reliable results. The researchers also still insist on using Cuffdiff, despite my best efforts to convince them to switch to featureCounts and DESeq2.

None of this is really relevant to @sebl since he has already said that turnaround is not an issue. But, one can really ever have too many cores. There is often a linear correlation between the number of cores available and the runtime for most bioinformatics programs.

Last edited by blancha; 12-08-2015 at 01:00 PM.
blancha is offline   Reply With Quote
Old 12-09-2015, 06:38 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,004
Default

@blancha: Sounds to me like your processes are I/O bound (not surprising) or memory limited. How much RAM is available per core? As you said our discussion is not relevant to @sebl's question though.
GenoMax is offline   Reply With Quote
Old 12-09-2015, 07:46 AM   #12
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Our local server, at our institute, has 580 GB of shared memory.
So, RAM is generally not an issue.

On the Compute Canada cluster, each core requested comes with 2.7 GB RAM, which is generally sufficient.

Yes, there is a lot of I/O.

I should probably switch to a more efficient pipeline.
I should use STAR or Brian's BBMAP, but TopHat has just been my workhorse for years.
I can't wean the researchers off Cuffdiff, mainly because they always want the isoform data, which they end up discarding anyway.

Even without TopHat or Cuffdiff, some steps monopolize a processor. For example, I had to run bedtools genomecov on dozens of samples last week. I took 42 processors at the same time, which then paralyzed the proteomics web interface running on the same server. I had to reset the queue settings to use only 38 cores.

Anyway, I'm sorry to have hijacked @sebl thread, but there can just never be too many cores, either to process multiple samples together, or process one sample in parallel threads.
blancha is offline   Reply With Quote
Old 12-09-2015, 11:27 AM   #13
sebl
Member
 
Location: Israel

Join Date: Mar 2014
Posts: 26
Default

No problem. You keep the thread active so I may get more replies from people

What about Biolinux as OS? Any cons that I should be aware of?

Thanks again.
sebl is offline   Reply With Quote
Old 12-09-2015, 11:59 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,004
Default

Quote:
Originally Posted by sebl View Post
No problem. You keep the thread active so I may get more replies from people

What about Biolinux as OS? Any cons that I should be aware of?

Thanks again.
Stick with a standard OS (centOS, ubuntu etc) and install apps as necessary to keep things flexible. Leave the systems administration to someone who's job description reflects that
GenoMax is offline   Reply With Quote
Old 12-09-2015, 12:05 PM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,004
Default

Quote:
Originally Posted by blancha View Post
but there can just never be too many cores, either to process multiple samples together, or process one sample in parallel threads.
I am not 100% convinced about that but I am more patient and do have access to significant resources.

It sounds like you have a quad-socket server which would be on the end of not affordable for @sebl. I generally have found BBMap best for my needs and working mostly with a cluster there is no point in having more cores assigned to a job than there are in a physical server since the scheduler (and in turn the admins) don't like it.
GenoMax is offline   Reply With Quote
Reply

Tags
bioinformatics, computers, hardware infrastructure, operating system, workstation

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO