Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • slny
    Member
    • Mar 2011
    • 54

    computation power requirement for sequencing analysis

    Hi,

    Our facility may get some funding from our institute to purchase new hardware for sequencing data analysis.

    What's the minimal requirements for comparatively efficient and decent analysis of sequencing data?

    Thanks,
    Slny
  • polyatail
    Member
    • Dec 2010
    • 25

    #2
    If most of what you're doing is aligning reads back to a reference genome, a highly-threaded sequence aligner will be quickest with a bunch of cores. If you're doing genome assembly, most of the algorithms don't have much room for parallelization and you'd probably improve performance with a high clock speed.

    These days, a decent workstation with 4-8 cores and 1-2 GB RAM per core can do almost everything. If you need more than that, you probably have a specific purpose in mind and specific hardware requirements.

    Comment

    • DZhang
      Senior Member
      • Jun 2010
      • 177

      #3
      Hi,

      Polyatail has good points there. What I want to add is that some applications (e.g., alignment) is CPU-intensive (so faster and more cores help a lot);other applications (e.g., de novo assembly) is memory intensive - if you do not have enough memory, the program will not run. To assemble a mammalian genome, you may need over 100GB memory.

      Douglas

      Comment

      • westerman
        Rick Westerman
        • Jun 2008
        • 1104

        #4
        Memory is a key point. You can always wait twice as long for a job to complete because you have one-half of the processors you need. But if you have half of the memory you need then your job will never complete. If you have to make a choice then go for the larger memory.


        Disk speed (and if you are using a cluster, interconnect speed) is also important. Bioinformatics uses large data sets thus skimping on speed will cause your program to slow down. A two-tier storage is a good idea -- fast but small for the actual computation with slow and large for long-term storage.


        At Purdue we share computing resources with many other disciplines. I am always telling the pure computational people that their needs are not matching mine. We just had a planning meeting for our next shared compute cluster. Most people seem to be holding out for the 48 core, 96 GB per node machines but I would prefer the 24 core, 192 GB per node machines at about the same price. Those extra cores are not going to do me as much good as the extra memory.

        Comment

        • polyatail
          Member
          • Dec 2010
          • 25

          #5
          At Purdue we share computing resources with many other disciplines. I am always telling the pure computational people that their needs are not matching mine. We just had a planning meeting for our next shared compute cluster. Most people seem to be holding out for the 48 core, 96 GB per node machines but I would prefer the 24 core, 192 GB per node machines at about the same price. Those extra cores are not going to do me as much good as the extra memory.
          Couldn't agree more. Especially if your storage is lacking, and those 48 cores are pushing out a lot of data (i.e. sequence alignment, BAM files), IO will be a problem. With 192 GB, you'd even have the option of using a ramfs for scratch. Personally though, I'm holding out for the 1024 core, 16 TB per node machines. All signs seem to suggest that I'll be holding out for a while.
          Last edited by polyatail; 06-03-2011, 07:55 AM. Reason: you're -> your (don't laugh, it happens)

          Comment

          • mbblack
            Senior Member
            • Aug 2009
            • 245

            #6
            Depends on what you mean by analyzing sequence data? Are you starting with raw reads or are you only performing tertiary analysis and have a core facility to do the initial heavy lifting for you? In other words, where in the pipeline from raw read files to finished data are you looking at? How much data are you handling at any one time as well (do you need to run multiple analyses simultaneously?).

            For working with raw read output, we went with one of the Penguin clusters pre-configured for use with our ABI SOLiD system (but Penguin makes clusters for any sort of use or configuration). The "base" machine from penguin for ABI data is a Scyld Beowulf 5 node cluster (head node + 4 compute nodes). Each node has a pair of 4-core Xeon's and 24Gb RAM. The whole cluster shares storage space on a ~26Tb RAID 5 array (ie. for data storage, scratch and temp files). Thus far, it's proving to be a decent little pre-packaged cluster.

            For end point analyses like differential sequence determination, I also have a R/BioConductor and ParTek machine with dual 4-core AMD cpu's and 32Gb RAM (your basic Dell off the shelf small server).

            Don't neglect file storage needs - whatever you get will need a decent amount of disc space both to keep data files, but also for temp and working files.
            Michael Black, Ph.D.
            ScitoVation LLC. RTP, N.C.

            Comment

            • MadsAlbertsen
              Member
              • Aug 2010
              • 26

              #7
              I would go with a small/medium sized cluster and then go for large jobs in the cloud instead.

              From my point of view it is way too expensive to scale the computer needs after the peak requirements - atleast in my department we cannot afford to have clusters where we only use 1-5% of the power on a regular basis.

              rgds
              Mads

              Comment

              • mbblack
                Senior Member
                • Aug 2009
                • 245

                #8
                Originally posted by MadsAlbertsen View Post
                I would go with a small/medium sized cluster and then go for large jobs in the cloud instead.

                From my point of view it is way too expensive to scale the computer needs after the peak requirements - atleast in my department we cannot afford to have clusters where we only use 1-5% of the power on a regular basis.

                rgds
                Mads
                Cloud is an option to consider. Just be sure of your available bandwidth for what you plan to move back and forth. I know for the setup at my Institute, we realized that we (currently at least) simply do not have a "fat enough" pipe to the outside world.
                Michael Black, Ph.D.
                ScitoVation LLC. RTP, N.C.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                36 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                100 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                120 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                113 views
                0 reactions
                Last Post SEQadmin2  
                Working...