Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SOLiD from an IT perspective

    Hello,

    I'm a system admin for a handful of SOLiDs and I'm curious to know what other people are doing as far as IT related issues with SOLiD. We've tried a few novel things in our environment with some success, but it'd be nice to hear from others who are managing these.

    We currently have all our machines write directly to a central NFS server (a high performance clustered storage solution) rather that copying data after primary analysis completes. A side effect is that our instruments can access results for any run they have ever performed (with the noted exception that run data in the instrument database was not preserved during the 2 -> 3 upgrade) whether that is good or bad remains to be determined. This also allows us to recover the results disk space and enlarge the images volume on the instrument, which is nice.

    Secondary analysis on instrument has been a challenge. We've made attempts at using Global SETS to move it off instrument with little success. We've played with increasing the core count on an instrument by adding nodes to an instrument (via a VLAN to our data center) and that seems promising (a 35x2 run against full human genome completes in ~ 5 days with 104 cores.) All real analysis has been done on our existing compute farm and infrastructure using Corona Lite.

    We've considered using the VLAN approach to move all compute nodes off instrument to help address heat issues in the lab where these reside.

    Any feedback would be appreciated. We are doing things in a non-standard way in an attempt to make the instruments more manageable. It'd be nice if an instrument could notify an external service when primary analysis was complete, for instance. If anyone else has had luck making SOLiD more automated, manageable and scalable I'd love to hear what you are doing.

    Thanks,

    griznog

  • #2
    No good feedback here but I concur:

    If anyone else has had luck making SOLiD more automated, manageable and scalable I'd love to hear what you are doing.
    We have only one SOLiD and thus do not have the problems that griznog has. Never-the-less I find the lack of automation irritating as well as the lack of scalability.

    Comment


    • #3
      Unfortunately not much feedback here either, but I am interested in how you connect these machines together.

      Originally posted by griznog View Post
      We currently have all our machines write directly to a central NFS server (a high performance clustered storage solution) rather that copying data after primary analysis completes.
      What is the network speed of the connection you use to connect your SOLiDs to this central NFS server? The images are acquired in Windows, so do yours write to a samba share on the onboard cluster which maps to a NFS mount?

      Unfortunately at the moment the network speeds available at our site makes dumping the images directly to our data centre via NFS unfeasible.

      Comment


      • #4
        Originally posted by OneManArmy View Post
        What is the network speed of the connection you use to connect your SOLiDs to this central NFS server? The images are acquired in Windows, so do yours write to a samba share on the onboard cluster which maps to a NFS mount?

        Unfortunately at the moment the network speeds available at our site makes dumping the images directly to our data centre via NFS unfeasible.
        Each SOLiD has a 1 gig uplink to an aggregate switch, which then has 10gbps connection to storage (via about 3 switch hops and one router hop). It's not ideal for latency, but performance seems reasonable and in simple benchmarks the central storage was at least as good as the head node storage for single clients and vastly better for multiple clients. Note that we were only using this for results. I have used it for images on one instrument when a failure in the MD1000 left us without an images directory for a few days but don't consider that a good test of central images storage because of the short duration of usage.

        Since posting this thread we've had some very good interaction with ABI and the roadmap for v3.5 of the instrument software appears to address many of our issues so once we upgrade to 3.5 later this year we'll revert back to the export model rather than using central NFS. Given the roadmap shown to us, I would withhold recommending central NFS storage until we've seen how well the new software handles exporting.

        griznog

        Comment


        • #5
          Originally posted by griznog View Post
          Each SOLiD has a 1 gig uplink to an aggregate switch, which then has 10gbps connection to storage (via about 3 switch hops and one router hop). It's not ideal for latency, but performance seems reasonable and in simple benchmarks the central storage was at least as good as the head node storage for single clients and vastly better for multiple clients. Note that we were only using this for results. I have used it for images on one instrument when a failure in the MD1000 left us without an images directory for a few days but don't consider that a good test of central images storage because of the short duration of usage.

          Since posting this thread we've had some very good interaction with ABI and the roadmap for v3.5 of the instrument software appears to address many of our issues so once we upgrade to 3.5 later this year we'll revert back to the export model rather than using central NFS. Given the roadmap shown to us, I would withhold recommending central NFS storage until we've seen how well the new software handles exporting.

          griznog
          We rely on copying the primary data (after color calling) over to NFS volumes, which allows us to have lost of cheap storage. The most current runs are then stored on a fast distributed file system (lustre) while alignment, variant calling, structural variants, and all other downstream analysis is completed. We then copy back all the results and intermediate files that need to be archived to the NFS servers. A lot of this is human automated, whereby a human has to initiate the transfer, the secondary analysis, and the final archiving.

          I would love to hear any successes with using some type of workflow system (Kepler etc.) in automating not only SOLiDs but also other NGS technology, since the big problem for us is having a mix of technologies (and workflows/applications) that are constantly being developed/updated.

          Comment


          • #6
            This is somewhat related to the above. I am with PSSC Labs (www.pssclabs.com). We are working to develop a SOLiD Offline Cluster. All of the information provided above is great. It gives me a much better understanding of the computing needs of the cluster than any of my discussions with AB.

            I had a few questions. Do any of you have experience running any AB developed application over Infiniband or other high speed network interconnects?

            Is there a maximum number of cores where the AB software will no longer scale? Or the performance gain of adding more nodes is negligible?

            Thank you

            Comment


            • #7
              Originally posted by pssclabs View Post

              Is there a maximum number of cores where the AB software will no longer scale?
              There are a handful of ABI software packages out there -- e.g., Mapping, SNP calling, Transcriptome -- which often stand alone although they may be sharing programs.

              If we consider the first program -- Mapping -- then there is a maximum number of cores. Basically the mapping program is broken down into 6 sub-programs:

              1) Map the read file to each chromosome. The natural core limit on this is the number of chromosomes.

              2) Collect the map information into one overall file -- limit of 1 core.

              3) Do a per-chromosome re-mapping for the optimal matches.

              4-6) Gather back the mapping into one overall file with statistics and an index.

              Overall rather inefficient. Some of the other ABI programs do seem to take into account the number of cores. Also one could see a way to split the read file into parts and map those parts against the chromosomes.

              New AB software due out "soon". Maybe it will be more efficient.

              Comment


              • #8
                Interesting info! especially the NFS bit.

                How about cost-effective solutions to analysis?
                I am trying to build an offline cluster with the minimum specs to do the analysis. I am thinking not all labs would have the budget for a cluster computer that just collects dust when they are done with the analysis.

                What's the lowest spec machine that a Solid User has managed to get away with?
                Anyone did any benchmarking?
                http://kevin-gattaca.blogspot.com/

                Comment


                • #9
                  Originally posted by KevinLam View Post
                  Interesting info! especially the NFS bit.

                  How about cost-effective solutions to analysis?
                  I am trying to build an offline cluster with the minimum specs to do the analysis. I am thinking not all labs would have the budget for a cluster computer that just collects dust when they are done with the analysis.

                  What's the lowest spec machine that a Solid User has managed to get away with?
                  Anyone did any benchmarking?
                  I doubt if anyone will bother benchmarking the lowest machine since such a task would be boring and, IMHO, not much use. Basically just grab a x86-64 based computer with 12 GB of memory and 500 GB of disk space. About $2500 from Dell. That would work. Might be slow. Might run out of disk space eventually. But if you want low-ball then the above should be ok.

                  Or if you want high-ball then share $100,000+ machines with other people. This is what we do.

                  Seriously, you really should set a budget and then buy within that. That is generally the best bet when purchasing computer equipment.

                  Comment


                  • #10
                    Originally posted by westerman View Post
                    I doubt if anyone will bother benchmarking the lowest machine since such a task would be boring and, IMHO, not much use. Basically just grab a x86-64 based computer with 12 GB of memory and 500 GB of disk space. About $2500 from Dell. That would work. Might be slow. Might run out of disk space eventually. But if you want low-ball then the above should be ok.

                    Or if you want high-ball then share $100,000+ machines with other people. This is what we do.

                    Seriously, you really should set a budget and then buy within that. That is generally the best bet when purchasing computer equipment.
                    Actually i think benchmarking cost effective machines can be very exciting!
                    often times when you have a super HPC you think less about algo speedups

                    anyway I managed to find this desktop benchmark for de novo assembly by CLCBIO
                    http://www.clcngs.com/2009/11/new-be...ovo-assembler/
                    http://kevin-gattaca.blogspot.com/

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X