This question is more along the line of "has anyone else seen this behavior or am I just going crazy?" If you have seen the following then please let know. In a day or two I will dig further into the problem and send a bug report to Roche ... but it would be handy to know that I am not alone.
-------------------
We have the 454/Roche newbler software on our "custom compute cluster"; e.g., a box that we did not purchase from Roche. It is a 16 CPU system with lots of memory that is capable of running MPI although I have yet to get it to do so. Instead we have been using gsRunProcessor with "GS_LAUNCH_MODE=MULTI" and "GS_NUM_PROCESSORS=16". This has work great with version 2.0.00.20 of the software.
Recently we installed the 2.0.00.22 patch which was released on 1-26-2009. Unfortunately this has broken our setup. Using the same MULTI mode that we have been using in the past the .22 software bombs out (after 6 hours of running) with MPI (!) errors. This apparently causes other programs to crash; e.g., the program 'compute_FlowHist1_4'. The error reports items such as:
application called MPI_Abort(MPI_COMM_WORLD, 1)
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)
MPI_Barrier(comm=0x84000000) failed
MPIR_Barrier(77)
MPIC_Sendrecv(126)
And so on. A really ugly error log especially considering I am not requesting the MPI mode but rather the MULTI launch mode.
As I mentioned I will try troubleshooting this more in a couple of days after my current runs are processed. In the meantime any words of advice or sympathy?
-------------------
We have the 454/Roche newbler software on our "custom compute cluster"; e.g., a box that we did not purchase from Roche. It is a 16 CPU system with lots of memory that is capable of running MPI although I have yet to get it to do so. Instead we have been using gsRunProcessor with "GS_LAUNCH_MODE=MULTI" and "GS_NUM_PROCESSORS=16". This has work great with version 2.0.00.20 of the software.
Recently we installed the 2.0.00.22 patch which was released on 1-26-2009. Unfortunately this has broken our setup. Using the same MULTI mode that we have been using in the past the .22 software bombs out (after 6 hours of running) with MPI (!) errors. This apparently causes other programs to crash; e.g., the program 'compute_FlowHist1_4'. The error reports items such as:
application called MPI_Abort(MPI_COMM_WORLD, 1)
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)
MPI_Barrier(comm=0x84000000) failed
MPIR_Barrier(77)
MPIC_Sendrecv(126)
And so on. A really ugly error log especially considering I am not requesting the MPI mode but rather the MULTI launch mode.
As I mentioned I will try troubleshooting this more in a couple of days after my current runs are processed. In the meantime any words of advice or sympathy?
Comment