SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Velvet OpenMP cluster setup cfriedline Bioinformatics 0 09-22-2011 10:03 AM
3.4GHz Quad-Core Intel Core i7 versus 3.1GHz Quad-Core Intel Core i5 brachysclereid Bioinformatics 2 05-03-2011 05:31 PM
Newbler 2.5.3 (gsAssembler) on Ubuntu or Rocks cluster? biojen 454 Pyrosequencing 0 03-01-2011 04:15 PM
Data storage rdeborja Bioinformatics 2 11-28-2010 01:46 AM
Samtools problem on a "rocks" cluster GenoMax Bioinformatics 8 08-24-2009 04:15 AM

Reply
 
Thread Tools
Old 02-19-2011, 10:55 AM   #1
quantrix
Member
 
Location: Pennsylvania

Join Date: Jan 2011
Posts: 21
Smile Core Cluster Setup - Linux, Ubuntu, Rocks, Data Storage, BluArc

Dear Group,
First post, I think this is a wonderful forum and full of ideas. I had a few questions which

I was hoping people could have a look at think if I am on the right track.

We are looking to working exomic data of 300 samples. This very much has the potential to scale upto more than a 1000 samples. I am planning the computational resources. I have prior experience building clusters (Scyld Beowulf, 16 node cluster). All things considered, my questions are as follows

The system I am planning in mind is


Front End:

1) TWO (2) Front End nodes - Regular Linux boxes. Maybe AMD Quad cores. Dumb terminals. Dual screens.


The main Workhorses:

2) TWO (2) high end linux clusters. AMD Opteron, 12 core machines, 128-256 GB Ram per server.Basically this would be a 2 node cluster with 24 CPU's. Mind you, we have the potential to scale up further if we feel the need. For now, the need is only to process exome data and SNP data. Do you feel the computational power needs will be satisified with these machines? We would need to do all the processing needed of whole-exome sequencing Including the alignment, base calling etc.. Also if there are any specific requirements which will help the process, that information is also welcome (For e.g., Gigabit ethernet for networking versus Infiniband or Myrinet (do they even exist nowadays?)



3) Database:

I am considering if it is worthwhile going the route of BlueArc storage? Or do I build something off the shelf from a place like PenguinComputing...like a Raid Array, SCSI drive storage solution. Anyone here have experience to go one way or another? On thing is for sure, we intend to keep the database and the Linux servers seperate. Our ideal database solution would be a standalone database solution.

4) Software:

Ubuntu Enterprise?, CentOS/Suse/RedHat Enterprise?, Rocks cluster software? Any advantages of one versus another? Ubuntu with Kerrighed is one option. Also one probably stupid question....When I install Ubuntu server edition on the frontend, do the Linux workhorses need a seperate install of the software? Any ideas on Ubuntu Server versus the ROcks Cluster solution? How similar are they or how different are they?

Does the BioRoll of the Rocks Cluster offer any specific advantages over installing a Ubuntu Server edition and installing the bioinformatics software seperately on it?

I know these are a lot of questions. However, I would appreciate it if anyone had more insights into my specific problem. If you have any better solutions to this problem, I would be glad to hear it. As I mentioned, our current datasets are small.(300 exomes and 300 SNP chip data) But it has the potential to balloon quickly.

Thank god for internet and this wonderful community. You guys rock!

Regards
Quantrix
quantrix is offline   Reply With Quote
Old 02-20-2011, 09:58 PM   #2
quantrix
Member
 
Location: Pennsylvania

Join Date: Jan 2011
Posts: 21
Default

Hi group,
164 views and no replies. I would appreciate ANY opinion you guys have. Please feel free to PM me if you think necessary.
I highly appreciate the opinion of the august group dealing with these issues here.
Regards
Quantrix
quantrix is offline   Reply With Quote
Old 02-21-2011, 12:24 AM   #3
jts
Member
 
Location: Cambridge

Join Date: Feb 2009
Posts: 22
Default

I don't know the requirements of the base-calling pipeline but the amount of RAM you have suggested might be excessive for alignment/variant calling applications, particularly on exomes. I would consider adding more servers with less RAM per server or having just one "big-memory' machine.
jts is offline   Reply With Quote
Old 02-21-2011, 12:57 AM   #4
stefanoberri
Member
 
Location: Cambridge area, UK

Join Date: Jan 2010
Posts: 35
Default

Hi. Here my 2 cents

some questions you should ask yourself:
How many users will run code at the same time?
Are you planing to use the cores for running many jobs at once or to use one program that uses 24 cores?

I have noticed the main bottleneck is file tranfer/copy/backup. Make sure the place where the computation happens have very quick access to disk space.
If you have 24 CPU to do things in parallel, then will your hard drive be able to provide data to those 24 CPUs simultaneously? Often scripts do relatively simple things on very big files and getting the files take a non trivial amount of time compared to the processing time.

The aligning process (I use bwa), requires about 3 GB of ram. Probably you will benefit large RAM when you will compare 1000 experiments (not necessary the case, though)
stefanoberri is offline   Reply With Quote
Old 02-21-2011, 09:36 AM   #5
quantrix
Member
 
Location: Pennsylvania

Join Date: Jan 2011
Posts: 21
Default

Hi Thank you for the replies.

At the current time, Only TWO people will be using the cluster. Most likely there will be a few jobs running at the same time. I assuming no more than 6 at a time.
Regards
Quantrix
quantrix is offline   Reply With Quote
Old 02-21-2011, 10:05 AM   #6
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

My main compute cluster uses BlueArc. It handles anything I throw at it -- I have no qualms about simultaneously running 30 jobs accessing the same, and large, datasets. My secondary compute cluster has Sun "Thumpers" -- big and slow. Running more than one job causes a noticeable slow down and screams from my sysadmin. So if you have the money I suggest a BlueArc or similar solution. I/O is a major concern and much harder to correct than limits in CPU or memory power.

As for the rest of the hardware, I agree that the memory seems excessive. I can get by with 96GB. On the other hand, it depends on what software and what comparisons you are doing. 24-cpu boxes are ok but be aware that some software simply won't scale up very well to multi-cpus.
westerman is offline   Reply With Quote
Old 02-21-2011, 11:30 PM   #7
mapper
Member
 
Location: India

Join Date: Nov 2010
Posts: 17
Default

we are using rocks from long time and its working fine....
mapper is offline   Reply With Quote
Old 02-22-2011, 12:41 AM   #8
quantrix
Member
 
Location: Pennsylvania

Join Date: Jan 2011
Posts: 21
Default

Hi Westerman,
Thanks for your reply. I was interested in the blue arc solution too. What is the size of the database which you have? Does it scale well in terms of size? Do they provide specialized tools for database administration? Any security issues? Does it play well with Linux? To start with we are looking at 6-7 TB of data but might scale to a couple of hundred TB in the next 3-4 years.

Any suggestions for a competitor company?
quantrix is offline   Reply With Quote
Old 02-22-2011, 12:46 AM   #9
quantrix
Member
 
Location: Pennsylvania

Join Date: Jan 2011
Posts: 21
Default

Hi Mapper,
Thanks for the reply. I was interested in the Rocks solution too. However, there is a belief that managing a Rocks cluster is not easy. i.e., if something breaks, good luck trying to find what caused it. Having said that, how easy do you find it to install and manage NGS on a Rocks cluster?
quantrix is offline   Reply With Quote
Old 02-22-2011, 01:14 AM   #10
Thorondor
Member
 
Location: Heidelberg

Join Date: Feb 2011
Posts: 69
Default

i must agree with stefanoberri. I recommend to use SSDs for the data that you will run on your cluster and then store it on cheaper hds. ;-)
Thorondor is offline   Reply With Quote
Old 02-22-2011, 07:07 AM   #11
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Re: database.

We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). Traditional sql-based databases are not optimized to hold a relatively low number of large files. Especially since most analysis programs do not directly deal with a DB it is easier to store and work with the files outside of the DB. On the other hand we may be unusual in this regard. If you want to get more opinions on this matter I suggest starting up a new post with the single question of what people use a DB for.

Thus the answer to your DB questions are "size is small (MBs)" and "we use mySQL -- simple, easy and cheap -- for the metadata".

No suggestion for a competitor company to BlueArc. I am sure there are some but I have not looked lately. The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit. We may be trying this on our secondary compute cluster. I still have doubts about this method (at least for us) since our secondary network is limited to 1Gbps. But at least it will be a fairly cheap solution.
westerman is offline   Reply With Quote
Old 02-22-2011, 07:31 AM   #12
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
When multiple jobs are accessing the same (large) datasets the jobs slow down.
When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
So I have two points:
1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
Chrz,
Bruins
Bruins is offline   Reply With Quote
Old 02-22-2011, 08:47 AM   #13
quantrix
Member
 
Location: Pennsylvania

Join Date: Jan 2011
Posts: 21
Default

Quote:
Originally Posted by westerman View Post
Re: database.

We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit.
Thanks a lot Westerman! That is helpful.

The idea of using a SQL database to store metadata makes perfect sense and I think is the right solution. However, the fact that you need store your meta data tells me that you probably have a very large dataset.

So my next question to you is, do you store your raw data in an unstructured format in the BlueArc data base?

I would imagine you are using the MapReduce paradigm for analyzing the data. Do you use Hadoop?

I am considering the idea of a SSD too. However, most of the commercial vendors I see on the market merely provide SSD's which are no more than 160 GB. I am wondering if this would be a bottleneck for me in the future?

What is the opinion of the group on the issue of a SSD of 160 GB size ONLY for data analysis. i.e., the data is temporarily migrated to the server containing the SSD, analysis is done, and then the results and the raw data is then dumped in the bluearc solution. Is it a viable pipeline?

My problem overall is not the size of the exomic raw data itself. I compute that to be relatively small ~ 10GB per sample. What is going to get me is the numbers. I envision hundreds of samples coming my way which I WILL need to retain in one form or another. That is the problem.

Will look forward to more of your insights Westerman. Thank you!
quantrix is offline   Reply With Quote
Old 02-22-2011, 08:55 AM   #14
quantrix
Member
 
Location: Pennsylvania

Join Date: Jan 2011
Posts: 21
Default

Quote:
Originally Posted by Bruins View Post
We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
When multiple jobs are accessing the same (large) datasets the jobs slow down.
When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
So I have two points:
1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
Chrz,
Bruins
Thanks Bruins for the reply. As I mentioned above, for my specific case, I have no alternative to setting up a cluster, since the data NEEDS to be within the firewall.

For now, we will have exclusive access to the cluster (which ever one we build). Which means, I decide how many jobs run on it. Also the throughput is not that huge in the short term. i.e., I will need to run no more than 3-4 samples a day. BUT, I will need to run these 3-4 samples a day for a LONGGG time (Job security, thank you very much!...). So the timescales are important. Which is where the data base issues crop up as well as the computing issues.

30 seconds for ls?????????? ha ha ha, I'd shoot myself and quit. Or rather the other way around.

quantrix is offline   Reply With Quote
Old 02-22-2011, 10:06 PM   #15
mapper
Member
 
Location: India

Join Date: Nov 2010
Posts: 17
Default

Well, installing ROCKS cluster is as easy as installing OS on stand alone machine(I guess 10% more efforts are required)...doing configuration and setting up takes few hrs (2-3) when you do it for first time....but I would say its not difficult...

rocks has a community support and they provide very good support....

All you need to take care while setting up rocks for NGS is accessibility of data to all nodes...

Do you have any specific things in mind regarding rocks?
mapper is offline   Reply With Quote
Old 02-22-2011, 11:00 PM   #16
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

Quote:
Originally Posted by quantrix View Post
I decide how many jobs run on it.
I want that! I want that!

Quote:
Originally Posted by quantrix View Post
30 seconds for ls?????????? ha ha ha, I'd shoot myself and quit. Or rather the other way around.

Yeah... one of us rcently had a kid so shooting wasn't an option. We now resort to yelling and SciFi.
Bruins is offline   Reply With Quote
Old 10-05-2012, 09:46 AM   #17
csmatyi
Member
 
Location: Nebraska

Join Date: Oct 2011
Posts: 25
Default Linux based at least level 2 RAID server, >5 Tb

Hello everyone,

We're looking to purchase a memory disk.
This disk must fulfill the following requirements:

- >= 5Tb memory
-min. RAID2, 3-4-5 would be better


To be used for backups, therefore rpm is not important.

Thanks for any help!!

-csmatyi
csmatyi is offline   Reply With Quote
Reply

Tags
bluarc, exome, rocks, ubuntu server, wgs

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:23 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO