Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Denovo assembly system resources

    Hi,

    Hope someone can help me out with an IT/Systems question.

    I currently process fastq files using Trinity for assembly and this roughly takes 4 hours per sample. I have noticed that throughout this time CPU use almost 100% whilst RAM usage maxes out at around 70%.

    I am using a standalone workstation with 2 six core processors and 96 Gb RAM. I have access to 5 of these currently and they are all used independently. This is the system I inherited from my predecessor so I am open to change should it increase throughput.

    My question is....

    Would creation of a small beowulf style cluster using four of the workstations, allow increased system resources and perhaps speed up my assembly and processing time.

    I am no overly familiar with the IT infrastructure side of this so any advice would be appreciated.

    Thanks in advance.

  • #2
    I wouldn't have thought so. You require all the reads to assemble the genome, so splitting this across a cluster, without a shared/distributed memory model, doesn't fit the assembly paradigm which is why most people use a big box with lots of RAM.

    See:

    Comment


    • #3
      Hi Bukowski,

      Thanks for your reply.

      If we were to cluster the machines and apply a shared/distributed memory model would I likely see an increase in processing speeds due to higher memory/available cores?

      Sorry if this is a naive question but I need to find a way of increasing throughput if at all possible. Appreciate the advice.

      Comment


      • #4
        It sounds like your best bet is just doing things in an embarrassingly parallel manner which is what you're currently doing. I may have misinterpreted your original request, though but the short answer is no.

        If you build a cluster, you get a job scheduler, and the best thing about that is that you stop having to worry about manually managing the jobs - when one finishes on one machine, it just starts the next one in the queue - that's the benefit for you building a cluster of your machines.

        I also didn't spot you were using Trinity, so I'm going to assume that you're doing transcriptome assemblies - Trinity is already using the resources efficiently in the machine, so the run time you see, is just the run time. Providing it's not maxing out the memory, it matters not a jot if your CPU utilisation is high - all you care about in terms of performance is that it's not swapping out to disk.

        Your process is CPU bound not memory bound. The only benefit you would gain from a cluster with a shared memory architecture doesn't solve your apparent issue, which isn't to do with RAM.

        https://github.com/trinityrnaseq/tri...g-Requirements suggests you need 256GB of RAM in a machine - but I don't know what organism you're working on or how many reads you have in a sample.

        You might want to look at end of run profiling:

        Trinity RNA-Seq de novo transcriptome assembly. Contribute to trinityrnaseq/trinityrnaseq development by creating an account on GitHub.


        This might give you more of an idea where the bottleneck is.

        Comment


        • #5
          Perfect.

          Thanks for the comprehensive and helpful response. Stops me wasting any more time looking into this.

          Thanks,
          Sanderson.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          44 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          43 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          38 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X