Originally posted by dp05yk
View Post
In Ray, a de novo assembler, we were using this approach to split sequences across MPI ranks. Now, we just do a simple partition on the sequences.
Given N sequences (which can be in many files, of course) and M MPI ranks, MPI rank 0 takes sequences 0 to (N/M)-1, MPI rank 1 takes sequences (N/M) to 2*(N/M)-1, and so on. Finally, the MPI rank M-1 (the last one) also takes the remaining N%M sequences.
The partition-wise approach has the advantage that each MPI rank knows where to start and where to end.
Originally posted by dp05yk
View Post
Example with 4 MPI ranks and 9 integers so send:
Without message aggregation
Rank 0 sends value 0 to Rank 1
Rank 0 sends value 1 to Rank 2
Rank 0 sends value 2 to Rank 3
Rank 0 sends value 3 to Rank 1
Rank 0 sends value 4 to Rank 2
Rank 0 sends value 5 to Rank 3
Rank 0 sends value 6 to Rank 1
Rank 0 sends value 7 to Rank 2
Rank 0 sends value 8 to Rank 3
(9 messages)
With message aggregation
Rank 0 sends values 0,3,6 to Rank 1
Rank 0 sends values 1,4,7 to Rank 2
Rank 0 sends values 2,5,8 to Rank 3
(3 messages)
In this toy example, agglomerated messages contains 3 values.
You can bundle 500 8-byte integers (4000 bytes) in a 4096-byte message, assuming the the envelope is at most 96 bytes.
So, in your case, agglomerated messages would contain 500 values and you would divide your number of sent messages by 500, which is good given that transiting a message between two MPI ranks that are not on the same computer is costly.
Sébastien http://Boisvert.info
Comment