SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing the Trimmomatic tonybolger Bioinformatics 189 08-16-2018 11:22 AM
introducing BAMseek, a large file viewer for BAM and SAM BAMseek Bioinformatics 11 07-23-2013 09:02 PM
Introducing Savant: Genome Browser for HTS Datasets mfiume Bioinformatics 5 08-08-2011 01:52 PM
Introducing DARIO, a web service for the analysis of small RNA-seq data mfasold RNA Sequencing 0 06-27-2011 02:49 AM
Introducing our Ion Torrent! nickloman Ion Torrent 34 05-26-2011 06:56 PM

Reply
 
Thread Tools
Old 04-14-2011, 09:09 PM   #1
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default Introducing pBWA [Parallel BWA]

EDIT (July 5th, 2011): An alternate version of pBWA is now available that cleans up the workflow a bit. The user is no longer required to enter the number of reads in the FASTQ file, and SAM information is output to one file in parallel by all processors. There are also a few minor stability enhancements that should make pBWA compatible with MPICH. Performance appears to be similar to pBWA-r32. Thanks go to Rob Egan for the enhancements.

For my master's thesis in computer science, I developed a parallel version of BWA based on the OpenMPI library, called pBWA. pBWA retains and improves upon the multithreading provided by BWA while adding efficient parallelization for its core alignment functions [aln, sampe, samse]. The wall-time speedup of pBWA is bounded only by the size of the parallel system as it can run on any number of nodes and/or cores simultaneously. With suitable computer systems, pBWA can align billions of sequence reads within hours, more efficiently facilitating the analysis of new generations of NGS data.

Note that the improvements pBWA makes for the multithreading have been shown to Heng Li and will probably be implemented in a future release of BWA.

I have successfully tested pBWA on a couple systems, namely the SHARCNET (www.sharcnet.ca) and a school server with the most basic OpenMPI install.

If you have access to a cluster or parallel machine, you may want to give pBWA a try. Due to the nature of parallel computing, the optimal number of nodes/threads used will vary greatly depending on things like RAM and interconnect speeds.

pBWA can be obtained by visiting
http://sourceforge.net/projects/pbwa

A manual page is located at
http://pbwa.sourceforge.net/

Thanks for your time!

Last edited by dp05yk; 07-05-2011 at 07:08 AM. Reason: Alternate version now available
dp05yk is offline   Reply With Quote
Old 04-15-2011, 12:44 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Making this sticky, since a lot of users should find this useful and interesting!
nilshomer is offline   Reply With Quote
Old 04-15-2011, 05:37 AM   #3
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Thanks! I know there have been a few posts lately asking how parallel computing can work with NGS and so I thought I'd post this. There may yet be some kinks in the code and it's not published yet (still shopping for a suitable journal) but if anyone is interested in the methods I'll post a basic workflow behind the algorithm. The sourceforge page goes into a little detail for best use but not too much.

As for the improvements to multithreading, I noticed that BWA introduces too much thread competition, which isn't a problem for small amounts of threads, but when I was going up to 24 threads I noticed that pBWA was actually faster for 24 parallel processes than BWA was for 24 threads which surprised me. I changed the way BWA handles multithreading by basing it purely on a loop counter's mod value and removed all the sequence locking, etc, and multithreading improved by ~20% for higher amounts of threads.
dp05yk is offline   Reply With Quote
Old 04-15-2011, 09:38 AM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

I think multi-threading samse/sampe is a huge contribution.
nilshomer is offline   Reply With Quote
Old 04-15-2011, 09:44 AM   #5
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

samse/sampe are not multithreaded yet, they are just parallelized (hence 'p'BWA). Aln is parallelized AND multithreaded. Not sure if you knew this - but there is a difference between parallelism and multithreading... multithreaded applications can only run on as many cores exist on one node, and parallel applications (like pBWA) can run on as many nodes or cores as you wish.

However, parallelism is more efficient than multithreadiing would be (if it were implemented) for samse/sampe if your system has >=4GB RAM/core, because adding multithreading to samse/sampe would require removal of the hash table.

That being said, I will work multithreading into samse/sampe for future releases - this will make pBWA more attractive as it will be more efficient for systems with less available RAM.
dp05yk is offline   Reply With Quote
Old 04-15-2011, 09:46 AM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

So how much RAM is required for each process? I want parallelism and the benefit of shared memory.
nilshomer is offline   Reply With Quote
Old 04-15-2011, 09:55 AM   #7
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

I've run pBWA on clusters so far, so shared memory does not exist between nodes, only between cores. Here's an example for what you're looking for:

I use the Orca cluster on the SHARCNET. Orca has 320 nodes, each node with 24 cores and 32GB of RAM. We can run pBWA as follows:

sqsub -q mpi -n 320 -N 80 --mpp 8G ./pBWA aln -t 6....

We specify that we want 320 parallel processes spread across 80 nodes. This means each node will have 4 instances of pBWA running at 8GB/instance. 8GB/instance is overkill, but what it does is ensure that the remaining 20 cores on each node are free for threading. Since we've called aln with -t 6, each of our parallel processes will spawn six more threads for a total of 320*6 threads of execution.

Now because samse/sampe are not multithreaded, we can only run samse/sampe as follows:

sqsub -q mpi -n 320 ./pBWA sampe ...

So we're only able to run samse/sampe with 320 processes and no extra threads, but it is still 319 more processes than you'd be able to run with if you were to use BWA.

I'm definitely planning on introducing multithreading into samse/sampe, but I'm busy with other things right now (preparing for my defense).

EDIT: totally forgot to mention, that pBWA requires as much memory per parallel process as BWA does per run. That is why combining multithreading with parallelism is a good fit for lower-RAM clusters.
dp05yk is offline   Reply With Quote
Old 04-15-2011, 10:09 AM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Excellent clarification. Thank-you.
nilshomer is offline   Reply With Quote
Old 04-17-2011, 09:29 PM   #9
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

I think we had novoalignMPI out before pBWA so it wouldnt be the first parallel alignment tool. We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.

BTW nice work and it would be great to see more projects like this for SNP calling, Indel detection, SNV detection, coverage calculation, etc.

Last edited by zee; 04-17-2011 at 09:32 PM.
zee is offline   Reply With Quote
Old 04-18-2011, 05:23 AM   #10
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Quote:
I think we had novoalignMPI out before pBWA so it wouldnt be the first parallel alignment tool. We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.
Thanks for the info! I'll edit that out of my original post.

By the way, what shortcomings did you find with OpenMPI? I'm curious as to any problems you ran into, as I may have had the same.
dp05yk is offline   Reply With Quote
Old 04-29-2011, 10:42 AM   #11
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by zee View Post
We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.
I am also curious to hear what shortcomings were encountered.

MPICH2 and Open-MPI are both compliant with the MPI 2.2 standard !

http://www.mpi-forum.org/docs/docs.html
seb567 is offline   Reply With Quote
Old 04-29-2011, 11:35 AM   #12
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Hello dp05yk,

I am looking forward for your published paper about this work.

Quote:
Originally Posted by dp05yk View Post
As for the improvements to multithreading, I noticed that BWA introduces too much thread competition, which isn't a problem for small amounts of threads, but when I was going up to 24 threads I noticed that pBWA was actually faster for 24 parallel processes than BWA was for 24 threads which surprised me. I changed the way BWA handles multithreading by basing it purely on a loop counter's mod value and removed all the sequence locking, etc, and multithreading improved by ~20% for higher amounts of threads.
Quite interesting !

What exactly is a loop counter's modulo value ?

Do you suspect that removing the locks was THE game changer that allowed a significant 20% improvement ?



Quote:
Originally Posted by nilshomer View Post
I think multi-threading samse/sampe is a huge contribution.


I also think that it is a huge contribution !



Correct me if I am wrong, but you use MPI_Send/MPI_Recv in bwaseqio.c because you don't want rank 0 to receive anything, but you you want all the other ranks in MPI_COMM_WORLD to receive data prepared by rank 0.

What exactly are these 8-byte numbers that you are sending there ?

For any MPI rank greater than 0, the MPI rank 0 will send numerous 8-byte values individually. I can see there that you could manually group them and send them in agglomerates. Messages are usually sent eagerly when they are at most 4096 bytes in MPICH2 and Open-MPI.

Let us assume that the envelope is at most 96 bytes. Then, the data size will be at least 4000 bytes -- or 500 8-byte integers. So, since you will be manually grouping your 8-byte communications into a few larger 4000-byte communications, the input/output will be at most 500 times faster considering the overhead of sending individual small messages.






Awesome work !

Cordially,

Sébastien
seb567 is offline   Reply With Quote
Old 04-29-2011, 12:51 PM   #13
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Does threading improve over the "embarrassingly parallel" style of just splitting the input and running on num_available_cpus instances? I think any speedups in BWA would be with 1) aggressive optimization 2) re-workiing code to keep more data/code in close cache.
Richard Finney is offline   Reply With Quote
Old 04-29-2011, 01:17 PM   #14
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Hi Sebastien,

Quote:
What exactly is a loop counter's modulo value ?
A mod value is the remainder after performing integer division. For instance, 7 % 3 = 1, since 3*2 = 7 - 1. So the way we handle sequence distribution for threading is:

Loop i = 1 to (num_seqs)

if (i % num_threads) = thread_id then process
else skip

End loop

Since we are performing the modulus on the loop counter with the number of threads, we are guaranteed to cycle through the numbers 0...num_threads for each consecutive sequence. This ensures that the sequences are evenly divided, and it also ensures that no threads will be competing, since thread i only processes sequences (i, i+num_threads, i+(2*num_threads), etc.

Does that make sense? Previously, threads would essentially fight over sequence distribution by locking and "reserving" sequences for processing. This is responsible for the 20% efficiency difference.

Quote:
Correct me if I am wrong, but you use MPI_Send/MPI_Recv in bwaseqio.c because you don't want rank 0 to receive anything, but you you want all the other ranks in MPI_COMM_WORLD to receive data prepared by rank 0.

What exactly are these 8-byte numbers that you are sending there ?

For any MPI rank greater than 0, the MPI rank 0 will send numerous 8-byte values individually. I can see there that you could manually group them and send them in agglomerates. Messages are usually sent eagerly when they are at most 4096 bytes in MPICH2 and Open-MPI.

Let us assume that the envelope is at most 96 bytes. Then, the data size will be at least 4000 bytes -- or 500 8-byte integers. So, since you will be manually grouping your 8-byte communications into a few larger 4000-byte communications, the input/output will be at most 500 times faster considering the overhead of sending individual small messages.
This function in bwaseqio.c performs the input reads file indexing. In order to distribute reads evenly over processes, each process receives a contiguous block of reads. However, with paired reads, we cannot assume that both reads files will be of exactly the same length (especially when dealing with SOLiD reads), so we need to index the files to find the start and end location of each contiguous block of reads. This is a one-processor job, and processor 0 essentially scans the file, marking the start and end locations of an evenly distributed reads block. Once it finds these positions, it sends them to processor i and this becomes processor i's block of input reads. The reason these are 8-byte numbers is because some input reads files are extremely large and are larger than 2^32 bytes.
dp05yk is offline   Reply With Quote
Old 04-29-2011, 01:24 PM   #15
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Quote:
Originally Posted by Richard Finney View Post
Does threading improve over the "embarrassingly parallel" style of just splitting the input and running on num_available_cpus instances? I think any speedups in BWA would be with 1) aggressive optimization 2) re-workiing code to keep more data/code in close cache.
It's interesting you mention this, as I originally thought pBWA with x processors would be slower than BWA with x threads, however I found them to be competitive, and even found pBWA to be faster than BWA in most instances.

However, once I implemented my multithreading improvement into BWA, BWA with x threads outperforms pBWA with x threads. The advantage provided by pBWA is that you can run it on multiple processors, whereas multithreaded BWA only runs on as many cores as you possess on a single processor (or x2 if you have hyperthreading). Another advantage provided with pBWA is that you can combine parallelism and multithreading,

ie. run pBWA with 20 processes, each process on its own node, and each process spawning x threads, where each node is an x-core processor. I've been able to run pBWA with 2000+ cores and align 500million+ SOLiD reads in under an hour.
dp05yk is offline   Reply With Quote
Old 04-29-2011, 02:03 PM   #16
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by dp05yk View Post
Hi Sebastien,



A mod value is the remainder after performing integer division. For instance, 7 % 3 = 1, since 3*2 = 7 - 1. So the way we handle sequence distribution for threading is:

Loop i = 1 to (num_seqs)

if (i % num_threads) = thread_id then process
else skip

End loop

Since we are performing the modulus on the loop counter with the number of threads, we are guaranteed to cycle through the numbers 0...num_threads for each consecutive sequence. This ensures that the sequences are evenly divided, and it also ensures that no threads will be competing, since thread i only processes sequences (i, i+num_threads, i+(2*num_threads), etc.

Does that make sense? Previously, threads would essentially fight over sequence distribution by locking and "reserving" sequences for processing. This is responsible for the 20% efficiency difference.
I know what a modulo is.

In Ray, a de novo assembler, we were using this approach to split sequences across MPI ranks. Now, we just do a simple partition on the sequences.

Given N sequences (which can be in many files, of course) and M MPI ranks, MPI rank 0 takes sequences 0 to (N/M)-1, MPI rank 1 takes sequences (N/M) to 2*(N/M)-1, and so on. Finally, the MPI rank M-1 (the last one) also takes the remaining N%M sequences.

The partition-wise approach has the advantage that each MPI rank knows where to start and where to end.

Quote:
Originally Posted by dp05yk View Post

This function in bwaseqio.c performs the input reads file indexing. In order to distribute reads evenly over processes, each process receives a contiguous block of reads. However, with paired reads, we cannot assume that both reads files will be of exactly the same length (especially when dealing with SOLiD reads), so we need to index the files to find the start and end location of each contiguous block of reads. This is a one-processor job, and processor 0 essentially scans the file, marking the start and end locations of an evenly distributed reads block. Once it finds these positions, it sends them to processor i and this becomes processor i's block of input reads. The reason these are 8-byte numbers is because some input reads files are extremely large and are larger than 2^32 bytes.
Regardless, I think you could enhance your already-enhanced approach using message aggregation.

Example with 4 MPI ranks and 9 integers so send:

Without message aggregation

Rank 0 sends value 0 to Rank 1
Rank 0 sends value 1 to Rank 2
Rank 0 sends value 2 to Rank 3
Rank 0 sends value 3 to Rank 1
Rank 0 sends value 4 to Rank 2
Rank 0 sends value 5 to Rank 3
Rank 0 sends value 6 to Rank 1
Rank 0 sends value 7 to Rank 2
Rank 0 sends value 8 to Rank 3

(9 messages)

With message aggregation


Rank 0 sends values 0,3,6 to Rank 1
Rank 0 sends values 1,4,7 to Rank 2
Rank 0 sends values 2,5,8 to Rank 3

(3 messages)

In this toy example, agglomerated messages contains 3 values.

You can bundle 500 8-byte integers (4000 bytes) in a 4096-byte message, assuming the the envelope is at most 96 bytes.

So, in your case, agglomerated messages would contain 500 values and you would divide your number of sent messages by 500, which is good given that transiting a message between two MPI ranks that are not on the same computer is costly.

Sébastien http://Boisvert.info
seb567 is offline   Reply With Quote
Old 05-01-2011, 03:00 PM   #17
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

We are seeing more and more that re-alignment is an amazing benefit but terribly slow.. it makes the regular bwa alignment seem so fast... I hope there are solutions in the works for re-alignment aspect, where one needs to take all reads in a window and cannot arbitrarily split and parallelize..
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 05-05-2011, 05:34 PM   #18
YEG
Junior Member
 
Location: Bethesda, MD

Join Date: Apr 2008
Posts: 2
Default

How does pBWI work with single-end reads? It's not really clear from the tutorial.
I aligned single-end reads using 3 processors. Now I have 3 *.sai files: a1-0.sai ... a1-2.sai. To get the sam file I tried:
Code:
pBWA samse -f out.sam ~/hg18/hg18 a1-0.sai all.fq 29424134
pBWA crashes and produces the following:
Code:
[bwa_sai2sam_se_core] fail to open file 'a1-0.sai--1.sai'. Abort!
[bart:29010] *** Process received signal ***
[bart:29010] Signal: Aborted (6)
[bart:29010] Signal code:  (-6)
[bart:29010] [ 0] /lib64/libpthread.so.0 [0x7f8f5393bc00]
[bart:29010] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x7f8f528984e5]
[bart:29010] [ 2] /lib64/libc.so.6(abort+0x180) [0x7f8f528999b0]
[bart:29010] [ 3] pBWA [0x405309]
[bart:29010] [ 4] pBWA(bwa_sai2sam_se_core+0xca) [0x41597a]
[bart:29010] [ 5] pBWA(bwa_sai2sam_se+0x14a) [0x415e5a]
[bart:29010] [ 6] pBWA(main+0xe3) [0x427263]
[bart:29010] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f8f52884a7d]
[bart:29010] [ 8] pBWA [0x404f69]
[bart:29010] *** End of error message ***
Aborted (core dumped)
I tried the same command with 'regular' bwa (minus the last argument) and it executed without a problem. What am I missing?

I am using 0.5.9-r21-MPI.
YEG is offline   Reply With Quote
Old 05-05-2011, 07:27 PM   #19
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Hi YEG,

You want to input the .sai prefix (a1)... Then pBWA will align every .sai file that you have with that prefix! Also, you just want to specify "out" as your -f parameter as pBWA will add the rank and .SAMs to the output files.

Also, you may want to use revision 30, always best to stay current! :-)
dp05yk is offline   Reply With Quote
Old 05-06-2011, 04:45 AM   #20
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Quote:
Originally Posted by dp05yk View Post
Hi YEG,

You want to input the .sai prefix (a1)... Then pBWA will align every .sai file that you have with that prefix! Also, you just want to specify "out" as your -f parameter as pBWA will add the rank and .SAMs to the output files.

Also, you may want to use revision 30, always best to stay current! :-)
I should probably just have showed you:

./pBWA samse -f out ~/hg18/hg18 a1 all.fq 29424134


And that will align all of your .sai files at the same time!

Cheers,
Darren
dp05yk is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO