Seqanswers Leaderboard Ad

**sheepyuan** · 09-24-2012, 10:33 PM

Originally posted by talioto View Post

I compiled Ray with openmpi 1.4.2, gcc version 4.1.2 20080704 (Red Hat 4.1.2-44), x86_64 architecture and run it with "mpirun -mca btl ^sm". The data is 3 simulated Illumina libraries comprising 52x coverage of a 225MB chromosome: 40x 500bp PE 95nt reads (inward facing), 8x 5kb mate paired 36nt reads (outward facing), 4x 10kb mate paired 36nt reads(outward facing).

Using 128 cores (16 8-core nodes), it runs fine up until the "Extending seeds" step. After a while the printing of the dots seem to slow down to glacial speeds. I've let it sit for several days with no progress. Is this an open mpi problem, you think? Any ideas on getting around this problem?

The same to u！

**seb567** · 09-25-2012, 03:45 AM

sheepyuan: what do you mean by "The same to u！" ?

If you refer to the post of taliato, I don't think it is a good idea to disable shared memory
as it is the fastest way to do message passing between processes on the same machine.

Also, Open-MPI 1.4.2 is very old. The current stable release of Open-MPI is 1.6.1.

A lot of improvements were added in Open-MPI since 1.4.2 !

And gcc 4.1.2 is very old too although I don't think this will change much.

Originally posted by sheepyuan View Post

The same to u！

**bstamps** · 10-11-2012, 07:46 AM

I'm attempting to run Ray on our local cluster and after building I get an error:

Ray: error while loading shared libraries: libmpi_cxx.so.0: cannot open shared object file: No such file or directory

Thoughts?

**jfpombert** · 10-11-2012, 08:02 AM

You need to install the openmpi package. For example, if you are using Fedora do a 'yum install openmpi openmpi-devel'. If the packages are already installed, make sure that they are in your path (you can add them to your .bash_profile). If you are trying to run Ray from a remote 'screen' job, make sure you source your .bash_profile too.

**bstamps** · 10-11-2012, 08:20 AM

Problem solved- I had forgotten to set my mpi version on the cluster using mpi-selector.

**yaximik** · 10-12-2012, 06:15 PM

Guys,

I examine dthis tread from the very beginning but could not find answer for my problem. Sorry for silly question. I tried to install Ray 2.0.0 and failed on two machines, one SciLinux 5.5 and another RHEL 55, which are esentially the same. Here is the output:

[Code]
[yaximik@SciLinux55 Ray-v2.0.0]$ make PREFIX=ray-build
make[1]: Entering directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/RayPlatform'
mpic++ -Wall -ansi -O3 -D MAXKMERLENGTH=32 -D RAY_VERSION=\"2.0.0\" -D RAYPLATFORM_VERSION=\"1.0.3\" -I. -c -o memory/ReusableMemoryStore.o memory/ReusableMemoryStore.cpp
make[1]: mpic++: Command not found
make[1]: *** [memory/ReusableMemoryStore.o] Error 127
make[1]: Leaving directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/RayPlatform'
make[1]: Entering directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/code'
mpic++ -Wall -ansi -O3 -D MAXKMERLENGTH=32 -D RAY_VERSION=\"2.0.0\" -I ../RayPlatform -I. -c -o application_core/ray_main.o application_core/ray_main.cpp
make[1]: mpic++: Command not found
make[1]: *** [application_core/ray_main.o] Error 127
make[1]: Leaving directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/code'
mpic++ code/TheRayGenomeAssembler.a RayPlatform/libRayPlatform.a -o Ray
make: mpic++: Command not found
make: *** [Ray] Error 127
[yaximik@SciLinux55 Ray-v2.0.0]$
[Code]

Her is output from RHEL55

[code]
[[yaximik@G5NNJN1 Ray-v2.0.0]$ make PREFIX=ray-build
make[1]: Entering directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/RayPlatform'
mpicxx -Wall -ansi -O3 -D MAXKMERLENGTH=32 -D RAY_VERSION=\"2.0.0\" -D RAYPLATFORM_VERSION=\"1.0.3\" -I. -c -o memory/ReusableMemoryStore.o memory/ReusableMemoryStore.cpp
make[1]: mpicxx: Command not found
make[1]: *** [memory/ReusableMemoryStore.o] Error 127
make[1]: Leaving directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/RayPlatform'
make[1]: Entering directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/code'
mpicxx -Wall -ansi -O3 -D MAXKMERLENGTH=32 -D RAY_VERSION=\"2.0.0\" -I ../RayPlatform -I. -c -o application_core/ray_main.o application_core/ray_main.cpp
make[1]: mpicxx: Command not found
make[1]: *** [application_core/ray_main.o] Error 127
make[1]: Leaving directory `/home/yaximik/Bioinformatics/Ray-v2.0.0/code'
mpicxx code/TheRayGenomeAssembler.a RayPlatform/libRayPlatform.a -o Ray
make: mpicxx: Command not found
make: *** [Ray] Error 127
[yaximik@G5NNJN1 Ray-v2.0.0]$
[code]

Essentially tghe same. I have

openmpiwrappers-openmpi-1-4.el5.x86_64
openmpi-1.4.-4.el5.x86_64
openmpi-devel-1.4-4. el5-x86_64
openmpi-libs-1.4-4.el5 x86_64

installed. Both machines are 64 bit, one is 2 processor, 8 GB RAM, another is 16 processor 96GB RAM. Please help as II'd like to try Ray 2.0.0 on my project.

**jfpombert** · 10-12-2012, 09:11 PM

If you look at both outputs:

make[1]: mpicxx: Command not found

Make sure to add the directory containing the openmpi execs to your path. Should fix the problem.

**bstamps** · 10-22-2012, 08:30 AM

Ray runs well when I use a single node, but when utilizing more than this I get an MPI exit code- like this

Ray:25109 terminated with signal 11 at PC=5718e0 SP=7fff9eb8a838. Backtrace:
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZNK14ReadAnnotation7getRankEv+0x0)[0x5718e0]
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN40Adapter_RAY_MPI_TAG_REQUEST_VERTEX_READS4$
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN18MessageTagExecutor11callHandlerEiP7Messag$
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN11ComputeCore3runEv+0x3cc)[0x5985ec]
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN7Machine5startEv+0x1d8d)[0x46906d]
/home/bstamps/Ray/Ray-v2.0.0/Ray(main+0x73)[0x464d73]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3fbc934cdd]
/home/bstamps/Ray/Ray-v2.0.0/Ray[0x464c39]
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 25094 on
node c310 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Thoughts?

**westerman** · 10-23-2012, 05:00 AM

Originally posted by bstamps View Post

Ray runs well when I use a single node, but when utilizing more than this I get an MPI exit code- like this
...
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 25094 on
node c310 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

I am not a big Ray user but I will sometimes get the above problem and then when I do a re-run the problem goes away. I think that it has to do with my cluster's setup. I suggest trying a small run and put one job per node just to make sure that everything will work.

Not much help, I know, but the general idea is that the problem may be with your hardware setup and not with ray.

**bstamps** · 10-23-2012, 10:27 AM

It appears setting my ptile below the maximum per node (16) has solved the problem...I'll have to go bug my computing center as to why 15 is kosher and 16 causes MPI to die. Either way I'm very happy with Ray's performance- being able to span my job across 4500 cores has sped assembly up quite a bit...

**bstamps** · 10-24-2012, 02:25 PM

I spoke a little too soon- Ray appears to be throwing segmentation faults randomly through the assembly process on random nodes. Adding in "route-messages" seems to have helped, but my jobs still fail every so often. The computing center seem to think it's an issue with Ray, but I'm curious as to what the community thinks.

**seb567** · 10-30-2012, 03:08 PM

Originally posted by bstamps View Post

Ray runs well when I use a single node, but when utilizing more than this I get an MPI exit code- like this

Ray:25109 terminated with signal 11 at PC=5718e0 SP=7fff9eb8a838. Backtrace:
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZNK14ReadAnnotation7getRankEv+0x0)[0x5718e0]
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN40Adapter_RAY_MPI_TAG_REQUEST_VERTEX_READS4$
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN18MessageTagExecutor11callHandlerEiP7Messag$
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN11ComputeCore3runEv+0x3cc)[0x5985ec]
/home/bstamps/Ray/Ray-v2.0.0/Ray(_ZN7Machine5startEv+0x1d8d)[0x46906d]
/home/bstamps/Ray/Ray-v2.0.0/Ray(main+0x73)[0x464d73]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3fbc934cdd]
/home/bstamps/Ray/Ray-v2.0.0/Ray[0x464c39]
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 25094 on
node c310 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Thoughts?

Hi,

Ray v2.1.0 was released today. There are a lot of bug fixes, with 2 fixes for 2 bugs that could lead to segmentation faults.

**seb567** · 10-30-2012, 03:09 PM

Originally posted by westerman View Post

I am not a big Ray user but I will sometimes get the above problem and then when I do a re-run the problem goes away. I think that it has to do with my cluster's setup. I suggest trying a small run and put one job per node just to make sure that everything will work.

Not much help, I know, but the general idea is that the problem may be with your hardware setup and not with ray.

It sounds like a race condition. The bug may be in Ray, who knows.

**seb567** · 10-30-2012, 03:10 PM

Originally posted by bstamps View Post

It appears setting my ptile below the maximum per node (16) has solved the problem...I'll have to go bug my computing center as to why 15 is kosher and 16 causes MPI to die. Either way I'm very happy with Ray's performance- being able to span my job across 4500 cores has sped assembly up quite a bit...

What is "ptile" ? Are you using a fancy architecture (Cray XE6 or Blue Gene /Q for instance) ?

Originally posted by bstamps View Post

across 4500 cores

I guess you are playing with fancy hardware, right ?

**seb567** · 10-30-2012, 03:21 PM

Originally posted by bstamps View Post

I spoke a little too soon- Ray appears to be throwing segmentation faults randomly through the assembly process on random nodes. Adding in "route-messages" seems to have helped, but my jobs still fail every so often. The computing center seem to think it's an issue with Ray, but I'm curious as to what the community thinks.

It can possibly be a bug in Ray. Every software has bugs. Can you try with the new Ray v2.1.0 to see if the numerous bug fixes alleviate your problem ?

Can you send an email on the list with your hardware and Ray command ?

Pure MPI applications may not be the answer for very large clusters, hybrid programming models are likely better.

We have work in progress on a new hybrid programming model. At the moment, Ray only uses MPI (v2.1.0 for instance). So when you run on 8 nodes * 24 cores / node = 192 cores, Ray is launched on 192 processes, with 24 processes per node.

We have devised a new programming model called "mini-ranks". If you Google "mini-ranks", you will mostly find hits about Lego blocks because "mini-ranks" in parallel programming is new as I believe we invented that ourselves !

Our implementation of the mini-ranks model can use 1 MPI process per node, 23 POSIX threads per process and an additional communication thread for each node. The mini-ranks run inside POSIX threads and the MPI rank actually does not do much.

Ray is already ported to that model (mini-ranks implemented with MPI+POSIX threads) in the git source tree.

Instead of launching like this:

mpiexec -n 192 Ray ...

You launch it like this:

mpiexec -n 8 -bynode Ray -mini-ranks-per-rank 23 ...

Note that our "mini-ranks" implementation needs 1 thread for communication for each node.

Although this is experimental, you may be interested to test that on your hardware.

The branch is called minirank-model should you want to check that.

Sébastien Boisvert
Ray maintainer

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News