Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extreme parallelization for NGS analysis

    I'd like to start an open discussion on the topic of parallelization for NGS data. I noticed that Galaxy recently came out with a cloud-based interface using Amazon EC3. I've been trying to learn more about how these NGS analysis algorithms (for alignment, assemly, etc.) are actually implemented in a parallel fashion, but I have had trouble finding specific documentation and resources describing how it works and how it is implemented. Any direction/resources that people can provide would be much appreciated.

    Also, I have seen some papers describing parallelization of various specific algorithms, especially recently (such as PASQUAL from Georgia Tech), but they all seem to be operating on relatively "small" networks of distributed computing resources. Does anyone have any idea about how far the parallelization and speeding up of these analyses can be pushed? How difficult would it to be to implement something that runs on a distributed network of say 100,000 computers, or even more... say a million? Is there a bottleneck somewhere that would prevent that from being feasible for NGS analysis? Or would that make the analyses amazingly fast compared to what's available now? I'm thinking of a system like what the SETI project has set up for their distributed computing user base and wondering what the limits are and how one could implement such a system if the user base is already in place.

  • #2
    I realized after posting that people might begin to point out that other threads exist on specific NGS analysis algorithms for parallelization, but I decided to leave my thread very open ended because in the end, the system I have in mind should work for any and all current analysis/data processing methods.

    Comment


    • #3
      NGS mostly are text processing (doesn't matter if binary or compressed), so I/O is the bottleneck (no matter in house or to the Internet).

      SETI (or maybe Folding@Home), a small data file will make CPU happy for a while.

      Cloud (Amazon or whatever), is a business model that buy large amount of white box servers and rent out in 1 hour unit, it does not use fancy hardware, it does not upgrade until the previous investment is back.

      So today's situation is like this:
      1, for a 4TB harddrive, you can only get 100MB/s sequential read out of it.
      2, you might have a PB sized array in house, but you only have 1Gb Internet connection to the world.
      3, this won't change for some years.
      4, LHC's infrastructure, is the extreme/limit for now, anything they can't do/afford, no one can.

      Comment


      • #4
        1. This can change now if you have $$$
        2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
        3. InfiniBand for 300Gbps network

        Comment


        • #5
          Originally posted by ymc View Post
          2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
          No no that's not my point. I would rather say you can get 2500MB/s random read (maybe, I don't have these to play with.)

          Originally posted by ymc View Post
          3. InfiniBand for 300Gbps network
          No no again, I was talking about Internet connection, the thread is asking about Cloud, (unless Private Cloud is also included in the discussion.)

          Comment


          • #6
            There are links here on deploying galaxy in a cluster (and other things)



            We have this deployed on our cluster and jobs are basically distributed to cluster nodes by the Sun Grid Engine.

            It's up to the tools themselves to do MPI/threading etc.

            In a cloud setting, NGS data can get quite large so storage may be an issue

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            57 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            56 views
            0 likes
            Last Post seqadmin  
            Working...
            X