Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing pBWA [Parallel BWA]

    EDIT (July 5th, 2011): An alternate version of pBWA is now available that cleans up the workflow a bit. The user is no longer required to enter the number of reads in the FASTQ file, and SAM information is output to one file in parallel by all processors. There are also a few minor stability enhancements that should make pBWA compatible with MPICH. Performance appears to be similar to pBWA-r32. Thanks go to Rob Egan for the enhancements.

    For my master's thesis in computer science, I developed a parallel version of BWA based on the OpenMPI library, called pBWA. pBWA retains and improves upon the multithreading provided by BWA while adding efficient parallelization for its core alignment functions [aln, sampe, samse]. The wall-time speedup of pBWA is bounded only by the size of the parallel system as it can run on any number of nodes and/or cores simultaneously. With suitable computer systems, pBWA can align billions of sequence reads within hours, more efficiently facilitating the analysis of new generations of NGS data.

    Note that the improvements pBWA makes for the multithreading have been shown to Heng Li and will probably be implemented in a future release of BWA.

    I have successfully tested pBWA on a couple systems, namely the SHARCNET (www.sharcnet.ca) and a school server with the most basic OpenMPI install.

    If you have access to a cluster or parallel machine, you may want to give pBWA a try. Due to the nature of parallel computing, the optimal number of nodes/threads used will vary greatly depending on things like RAM and interconnect speeds.

    pBWA can be obtained by visiting
    Download Parallel BWA [pBWA] for free. Parallel implementation of BWA (http://bio-bwa.sourceforge.net/) using the OpenMPI library.


    A manual page is located at


    Thanks for your time!
    Last edited by dp05yk; 07-05-2011, 06:08 AM. Reason: Alternate version now available

  • #2
    Making this sticky, since a lot of users should find this useful and interesting!

    Comment


    • #3
      Thanks! I know there have been a few posts lately asking how parallel computing can work with NGS and so I thought I'd post this. There may yet be some kinks in the code and it's not published yet (still shopping for a suitable journal) but if anyone is interested in the methods I'll post a basic workflow behind the algorithm. The sourceforge page goes into a little detail for best use but not too much.

      As for the improvements to multithreading, I noticed that BWA introduces too much thread competition, which isn't a problem for small amounts of threads, but when I was going up to 24 threads I noticed that pBWA was actually faster for 24 parallel processes than BWA was for 24 threads which surprised me. I changed the way BWA handles multithreading by basing it purely on a loop counter's mod value and removed all the sequence locking, etc, and multithreading improved by ~20% for higher amounts of threads.

      Comment


      • #4
        I think multi-threading samse/sampe is a huge contribution.

        Comment


        • #5
          samse/sampe are not multithreaded yet, they are just parallelized (hence 'p'BWA). Aln is parallelized AND multithreaded. Not sure if you knew this - but there is a difference between parallelism and multithreading... multithreaded applications can only run on as many cores exist on one node, and parallel applications (like pBWA) can run on as many nodes or cores as you wish.

          However, parallelism is more efficient than multithreadiing would be (if it were implemented) for samse/sampe if your system has >=4GB RAM/core, because adding multithreading to samse/sampe would require removal of the hash table.

          That being said, I will work multithreading into samse/sampe for future releases - this will make pBWA more attractive as it will be more efficient for systems with less available RAM.

          Comment


          • #6
            So how much RAM is required for each process? I want parallelism and the benefit of shared memory.

            Comment


            • #7
              I've run pBWA on clusters so far, so shared memory does not exist between nodes, only between cores. Here's an example for what you're looking for:

              I use the Orca cluster on the SHARCNET. Orca has 320 nodes, each node with 24 cores and 32GB of RAM. We can run pBWA as follows:

              sqsub -q mpi -n 320 -N 80 --mpp 8G ./pBWA aln -t 6....

              We specify that we want 320 parallel processes spread across 80 nodes. This means each node will have 4 instances of pBWA running at 8GB/instance. 8GB/instance is overkill, but what it does is ensure that the remaining 20 cores on each node are free for threading. Since we've called aln with -t 6, each of our parallel processes will spawn six more threads for a total of 320*6 threads of execution.

              Now because samse/sampe are not multithreaded, we can only run samse/sampe as follows:

              sqsub -q mpi -n 320 ./pBWA sampe ...

              So we're only able to run samse/sampe with 320 processes and no extra threads, but it is still 319 more processes than you'd be able to run with if you were to use BWA.

              I'm definitely planning on introducing multithreading into samse/sampe, but I'm busy with other things right now (preparing for my defense).

              EDIT: totally forgot to mention, that pBWA requires as much memory per parallel process as BWA does per run. That is why combining multithreading with parallelism is a good fit for lower-RAM clusters.

              Comment


              • #8
                Excellent clarification. Thank-you.

                Comment


                • #9
                  I think we had novoalignMPI out before pBWA so it wouldnt be the first parallel alignment tool. We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.

                  BTW nice work and it would be great to see more projects like this for SNP calling, Indel detection, SNV detection, coverage calculation, etc.
                  Last edited by zee; 04-17-2011, 08:32 PM.

                  Comment


                  • #10
                    I think we had novoalignMPI out before pBWA so it wouldnt be the first parallel alignment tool. We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.
                    Thanks for the info! I'll edit that out of my original post.

                    By the way, what shortcomings did you find with OpenMPI? I'm curious as to any problems you ran into, as I may have had the same.

                    Comment


                    • #11
                      Originally posted by zee View Post
                      We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.
                      I am also curious to hear what shortcomings were encountered.

                      MPICH2 and Open-MPI are both compliant with the MPI 2.2 standard !

                      Comment


                      • #12
                        Hello dp05yk,

                        I am looking forward for your published paper about this work.

                        Originally posted by dp05yk View Post
                        As for the improvements to multithreading, I noticed that BWA introduces too much thread competition, which isn't a problem for small amounts of threads, but when I was going up to 24 threads I noticed that pBWA was actually faster for 24 parallel processes than BWA was for 24 threads which surprised me. I changed the way BWA handles multithreading by basing it purely on a loop counter's mod value and removed all the sequence locking, etc, and multithreading improved by ~20% for higher amounts of threads.
                        Quite interesting !

                        What exactly is a loop counter's modulo value ?

                        Do you suspect that removing the locks was THE game changer that allowed a significant 20% improvement ?



                        Originally posted by nilshomer View Post
                        I think multi-threading samse/sampe is a huge contribution.


                        I also think that it is a huge contribution !



                        Correct me if I am wrong, but you use MPI_Send/MPI_Recv in bwaseqio.c because you don't want rank 0 to receive anything, but you you want all the other ranks in MPI_COMM_WORLD to receive data prepared by rank 0.

                        What exactly are these 8-byte numbers that you are sending there ?

                        For any MPI rank greater than 0, the MPI rank 0 will send numerous 8-byte values individually. I can see there that you could manually group them and send them in agglomerates. Messages are usually sent eagerly when they are at most 4096 bytes in MPICH2 and Open-MPI.

                        Let us assume that the envelope is at most 96 bytes. Then, the data size will be at least 4000 bytes -- or 500 8-byte integers. So, since you will be manually grouping your 8-byte communications into a few larger 4000-byte communications, the input/output will be at most 500 times faster considering the overhead of sending individual small messages.






                        Awesome work !

                        Cordially,

                        Sébastien

                        Comment


                        • #13
                          Does threading improve over the "embarrassingly parallel" style of just splitting the input and running on num_available_cpus instances? I think any speedups in BWA would be with 1) aggressive optimization 2) re-workiing code to keep more data/code in close cache.

                          Comment


                          • #14
                            Hi Sebastien,

                            What exactly is a loop counter's modulo value ?
                            A mod value is the remainder after performing integer division. For instance, 7 % 3 = 1, since 3*2 = 7 - 1. So the way we handle sequence distribution for threading is:

                            Loop i = 1 to (num_seqs)

                            if (i % num_threads) = thread_id then process
                            else skip

                            End loop

                            Since we are performing the modulus on the loop counter with the number of threads, we are guaranteed to cycle through the numbers 0...num_threads for each consecutive sequence. This ensures that the sequences are evenly divided, and it also ensures that no threads will be competing, since thread i only processes sequences (i, i+num_threads, i+(2*num_threads), etc.

                            Does that make sense? Previously, threads would essentially fight over sequence distribution by locking and "reserving" sequences for processing. This is responsible for the 20% efficiency difference.

                            Correct me if I am wrong, but you use MPI_Send/MPI_Recv in bwaseqio.c because you don't want rank 0 to receive anything, but you you want all the other ranks in MPI_COMM_WORLD to receive data prepared by rank 0.

                            What exactly are these 8-byte numbers that you are sending there ?

                            For any MPI rank greater than 0, the MPI rank 0 will send numerous 8-byte values individually. I can see there that you could manually group them and send them in agglomerates. Messages are usually sent eagerly when they are at most 4096 bytes in MPICH2 and Open-MPI.

                            Let us assume that the envelope is at most 96 bytes. Then, the data size will be at least 4000 bytes -- or 500 8-byte integers. So, since you will be manually grouping your 8-byte communications into a few larger 4000-byte communications, the input/output will be at most 500 times faster considering the overhead of sending individual small messages.
                            This function in bwaseqio.c performs the input reads file indexing. In order to distribute reads evenly over processes, each process receives a contiguous block of reads. However, with paired reads, we cannot assume that both reads files will be of exactly the same length (especially when dealing with SOLiD reads), so we need to index the files to find the start and end location of each contiguous block of reads. This is a one-processor job, and processor 0 essentially scans the file, marking the start and end locations of an evenly distributed reads block. Once it finds these positions, it sends them to processor i and this becomes processor i's block of input reads. The reason these are 8-byte numbers is because some input reads files are extremely large and are larger than 2^32 bytes.

                            Comment


                            • #15
                              Originally posted by Richard Finney View Post
                              Does threading improve over the "embarrassingly parallel" style of just splitting the input and running on num_available_cpus instances? I think any speedups in BWA would be with 1) aggressive optimization 2) re-workiing code to keep more data/code in close cache.
                              It's interesting you mention this, as I originally thought pBWA with x processors would be slower than BWA with x threads, however I found them to be competitive, and even found pBWA to be faster than BWA in most instances.

                              However, once I implemented my multithreading improvement into BWA, BWA with x threads outperforms pBWA with x threads. The advantage provided by pBWA is that you can run it on multiple processors, whereas multithreaded BWA only runs on as many cores as you possess on a single processor (or x2 if you have hyperthreading). Another advantage provided with pBWA is that you can combine parallelism and multithreading,

                              ie. run pBWA with 20 processes, each process on its own node, and each process spawning x threads, where each node is an x-core processor. I've been able to run pBWA with 2000+ cores and align 500million+ SOLiD reads in under an hour.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X