Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Core Cluster Setup - Linux, Ubuntu, Rocks, Data Storage, BluArc

    Dear Group,
    First post, I think this is a wonderful forum and full of ideas. I had a few questions which

    I was hoping people could have a look at think if I am on the right track.

    We are looking to working exomic data of 300 samples. This very much has the potential to scale upto more than a 1000 samples. I am planning the computational resources. I have prior experience building clusters (Scyld Beowulf, 16 node cluster). All things considered, my questions are as follows

    The system I am planning in mind is


    Front End:

    1) TWO (2) Front End nodes - Regular Linux boxes. Maybe AMD Quad cores. Dumb terminals. Dual screens.


    The main Workhorses:

    2) TWO (2) high end linux clusters. AMD Opteron, 12 core machines, 128-256 GB Ram per server.Basically this would be a 2 node cluster with 24 CPU's. Mind you, we have the potential to scale up further if we feel the need. For now, the need is only to process exome data and SNP data. Do you feel the computational power needs will be satisified with these machines? We would need to do all the processing needed of whole-exome sequencing Including the alignment, base calling etc.. Also if there are any specific requirements which will help the process, that information is also welcome (For e.g., Gigabit ethernet for networking versus Infiniband or Myrinet (do they even exist nowadays?)



    3) Database:

    I am considering if it is worthwhile going the route of BlueArc storage? Or do I build something off the shelf from a place like PenguinComputing...like a Raid Array, SCSI drive storage solution. Anyone here have experience to go one way or another? On thing is for sure, we intend to keep the database and the Linux servers seperate. Our ideal database solution would be a standalone database solution.

    4) Software:

    Ubuntu Enterprise?, CentOS/Suse/RedHat Enterprise?, Rocks cluster software? Any advantages of one versus another? Ubuntu with Kerrighed is one option. Also one probably stupid question....When I install Ubuntu server edition on the frontend, do the Linux workhorses need a seperate install of the software? Any ideas on Ubuntu Server versus the ROcks Cluster solution? How similar are they or how different are they?

    Does the BioRoll of the Rocks Cluster offer any specific advantages over installing a Ubuntu Server edition and installing the bioinformatics software seperately on it?

    I know these are a lot of questions. However, I would appreciate it if anyone had more insights into my specific problem. If you have any better solutions to this problem, I would be glad to hear it. As I mentioned, our current datasets are small.(300 exomes and 300 SNP chip data) But it has the potential to balloon quickly.

    Thank god for internet and this wonderful community. You guys rock!

    Regards
    Quantrix

  • #2
    Hi group,
    164 views and no replies. I would appreciate ANY opinion you guys have. Please feel free to PM me if you think necessary.
    I highly appreciate the opinion of the august group dealing with these issues here.
    Regards
    Quantrix

    Comment


    • #3
      I don't know the requirements of the base-calling pipeline but the amount of RAM you have suggested might be excessive for alignment/variant calling applications, particularly on exomes. I would consider adding more servers with less RAM per server or having just one "big-memory' machine.

      Comment


      • #4
        Hi. Here my 2 cents

        some questions you should ask yourself:
        How many users will run code at the same time?
        Are you planing to use the cores for running many jobs at once or to use one program that uses 24 cores?

        I have noticed the main bottleneck is file tranfer/copy/backup. Make sure the place where the computation happens have very quick access to disk space.
        If you have 24 CPU to do things in parallel, then will your hard drive be able to provide data to those 24 CPUs simultaneously? Often scripts do relatively simple things on very big files and getting the files take a non trivial amount of time compared to the processing time.

        The aligning process (I use bwa), requires about 3 GB of ram. Probably you will benefit large RAM when you will compare 1000 experiments (not necessary the case, though)

        Comment


        • #5
          Hi Thank you for the replies.

          At the current time, Only TWO people will be using the cluster. Most likely there will be a few jobs running at the same time. I assuming no more than 6 at a time.
          Regards
          Quantrix

          Comment


          • #6
            My main compute cluster uses BlueArc. It handles anything I throw at it -- I have no qualms about simultaneously running 30 jobs accessing the same, and large, datasets. My secondary compute cluster has Sun "Thumpers" -- big and slow. Running more than one job causes a noticeable slow down and screams from my sysadmin. So if you have the money I suggest a BlueArc or similar solution. I/O is a major concern and much harder to correct than limits in CPU or memory power.

            As for the rest of the hardware, I agree that the memory seems excessive. I can get by with 96GB. On the other hand, it depends on what software and what comparisons you are doing. 24-cpu boxes are ok but be aware that some software simply won't scale up very well to multi-cpus.

            Comment


            • #7
              we are using rocks from long time and its working fine....

              Comment


              • #8
                Hi Westerman,
                Thanks for your reply. I was interested in the blue arc solution too. What is the size of the database which you have? Does it scale well in terms of size? Do they provide specialized tools for database administration? Any security issues? Does it play well with Linux? To start with we are looking at 6-7 TB of data but might scale to a couple of hundred TB in the next 3-4 years.

                Any suggestions for a competitor company?

                Comment


                • #9
                  Hi Mapper,
                  Thanks for the reply. I was interested in the Rocks solution too. However, there is a belief that managing a Rocks cluster is not easy. i.e., if something breaks, good luck trying to find what caused it. Having said that, how easy do you find it to install and manage NGS on a Rocks cluster?

                  Comment


                  • #10
                    i must agree with stefanoberri. I recommend to use SSDs for the data that you will run on your cluster and then store it on cheaper hds. ;-)

                    Comment


                    • #11
                      Re: database.

                      We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). Traditional sql-based databases are not optimized to hold a relatively low number of large files. Especially since most analysis programs do not directly deal with a DB it is easier to store and work with the files outside of the DB. On the other hand we may be unusual in this regard. If you want to get more opinions on this matter I suggest starting up a new post with the single question of what people use a DB for.

                      Thus the answer to your DB questions are "size is small (MBs)" and "we use mySQL -- simple, easy and cheap -- for the metadata".

                      No suggestion for a competitor company to BlueArc. I am sure there are some but I have not looked lately. The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit. We may be trying this on our secondary compute cluster. I still have doubts about this method (at least for us) since our secondary network is limited to 1Gbps. But at least it will be a fairly cheap solution.

                      Comment


                      • #12
                        We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
                        When multiple jobs are accessing the same (large) datasets the jobs slow down.
                        When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
                        When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
                        So I have two points:
                        1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
                        2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
                        Chrz,
                        Bruins

                        Comment


                        • #13
                          Originally posted by westerman View Post
                          Re: database.

                          We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit.
                          Thanks a lot Westerman! That is helpful.

                          The idea of using a SQL database to store metadata makes perfect sense and I think is the right solution. However, the fact that you need store your meta data tells me that you probably have a very large dataset.

                          So my next question to you is, do you store your raw data in an unstructured format in the BlueArc data base?

                          I would imagine you are using the MapReduce paradigm for analyzing the data. Do you use Hadoop?

                          I am considering the idea of a SSD too. However, most of the commercial vendors I see on the market merely provide SSD's which are no more than 160 GB. I am wondering if this would be a bottleneck for me in the future?

                          What is the opinion of the group on the issue of a SSD of 160 GB size ONLY for data analysis. i.e., the data is temporarily migrated to the server containing the SSD, analysis is done, and then the results and the raw data is then dumped in the bluearc solution. Is it a viable pipeline?

                          My problem overall is not the size of the exomic raw data itself. I compute that to be relatively small ~ 10GB per sample. What is going to get me is the numbers. I envision hundreds of samples coming my way which I WILL need to retain in one form or another. That is the problem.

                          Will look forward to more of your insights Westerman. Thank you!

                          Comment


                          • #14
                            Originally posted by Bruins View Post
                            We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
                            When multiple jobs are accessing the same (large) datasets the jobs slow down.
                            When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
                            When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
                            So I have two points:
                            1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
                            2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
                            Chrz,
                            Bruins
                            Thanks Bruins for the reply. As I mentioned above, for my specific case, I have no alternative to setting up a cluster, since the data NEEDS to be within the firewall.

                            For now, we will have exclusive access to the cluster (which ever one we build). Which means, I decide how many jobs run on it. Also the throughput is not that huge in the short term. i.e., I will need to run no more than 3-4 samples a day. BUT, I will need to run these 3-4 samples a day for a LONGGG time (Job security, thank you very much!...). So the timescales are important. Which is where the data base issues crop up as well as the computing issues.

                            30 seconds for ls?????????? ha ha ha, I'd shoot myself and quit. Or rather the other way around.

                            Comment


                            • #15
                              Well, installing ROCKS cluster is as easy as installing OS on stand alone machine(I guess 10% more efforts are required)...doing configuration and setting up takes few hrs (2-3) when you do it for first time....but I would say its not difficult...

                              rocks has a community support and they provide very good support....

                              All you need to take care while setting up rocks for NGS is accessibility of data to all nodes...

                              Do you have any specific things in mind regarding rocks?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X