Seqanswers Leaderboard Ad

**victor73** · 11-24-2008, 08:38 PM

We ourselves have run into this problem as well. Basically, the processes get SIGTERM'd for us because the Linux kernel is killing the processes off because all the system's memory has been consumed by them... We also had aspirations of using Sun Grid Engine with a properly configured MPI parallel environment but I don't think we can get there until we can get direct MPI usage working reliably...

**Tom Bair** · 12-11-2008, 11:33 AM

Don't think this will help with your problems but I thought I would post some notes from my experience getting MPI working on a single 8 core redhat system.

1. have had persistent problems getting the pipeline to work with more than 6 cores. Running with >6 cores leads to hard lockup, but have not had time to track this down have run several runs with 6 cores takes ~12 hours to process

2.
needed to add the following to .bashrc or similar
ulimit -l unlimited
export RIGDIR="/opt/454"
export LD_LIBRARY_PATH=/usr/lib64/openmpi/1.2.5-gcc/lib/
export PATH=$PATH:/opt/454:/usr/lib/openmpi/1.2.5-gcc/bin
export GS_LAUNCH_MODE="MPI"
export GS_MPIARGS=" --n 6 "
export GS_XML_PORT=4540
export GS_CACHEDIR=/cache

note you may need to make some changes to /etc/security/limits.conf to have the ulimit work. You will know that this is a problem if the runPipeAnalysis complains that it can only allocate 32k

the command to run the analysis then would be runAnalysisPipe R_2008_11...

this will run with 6 process (based on GS_MPIARGS)

Documentation on this was sparse so not sure if it is canonically correct but it seems to work.

Any other info on MPI for titanium?

**cdwan** · 04-02-2009, 07:43 AM

Titanium off rig analysis

I've been having 'fun' trying to get the titanium off-rig analysis to work properly on a small linux cluster running Sun Grid Engine. We've had limited success.

I would be deeply grateful for anyone else's thoughts on this.

Here are some notes, in case they might help anyone else:

* The cluster consists of nine Linux servers running Centos 5.
* Each machine has 8 cores of x86_64, and 8GB of RAM.
* All nodes are connected via gigabit ethernet to a 90TB NFS share.
* The cluster is in moderate use for BLAST and other standard bioinformatic processing, and has never seen lockups or crashes before.

Environment variables that seem important to runAnalysisPipe are:

* export GS_MPIARGS="--n $NSLOTS --machinefile $TMPDIR/machines"
* export GS_LAUNCH_MODE=MPI
* export PATH=${PATH}:/opt/454/bin

I'm very curious to try this "GS_CACHEDIR", but I don't know what it does.

Note that the lines above are from my SGE job submission script. $NSLOTS and $TMPDIR/machines are created by a wrapper script and get set up based on how I submit the job. $NSLOTS is "how many parallel threads to start.". The machines files is a list of hostnames to start them on.

I found the "--progress" and "--verbose" flags to be quite useful in figuring out if processing is making progress or not.

We also encountered the hard-lockup behavior. I still have no idea of the *cause* of these lockups - but we've managed to work around. Here are my observations:

* openmpi jobs run on a single machine never finish, no matter how many threads I give them (1, 2, 4, 8, 16). I wave my hands in the direction of "16GB of RAM required".

* If I start 8 threads, four on each of two machines, those jobs run in a few hours.

* If I start more than 4 threads on any one machine, I have high odds of locking up (requiring a hard power cycle) at least one of the machines involved in that run.

* If I run threads from more than one job at a time on a particular machine, odds are high that I will lock up that machine.

* I can run 4 BLAST jobs and 4 threads of gsRunProcessor without too much contention on the same 8 core machine.

* gsRunProcessor leaves zombie processes all over the place when one of the compute nodes locks up during a run. I encounter fewer lockups if I clean those up prior to starting a run. This is made simple by the observation that I can't run two jobs on the same node anyway.

There is some correlation of the node lock-ups with heavy loads on the NFS file server - but I have yet to encounter any smoking gun with this.

Anyone else?

**countzero** · 06-03-2009, 05:00 PM

Originally posted by cdwan View Post

I've been having 'fun' trying to get the titanium off-rig analysis to work properly on a small linux cluster running Sun Grid Engine. We've had limited success.

I would be deeply grateful for anyone else's thoughts on this.

Here are some notes, in case they might help anyone else:

* The cluster consists of nine Linux servers running Centos 5.
* Each machine has 8 cores of x86_64, and 8GB of RAM.
* All nodes are connected via gigabit ethernet to a 90TB NFS share.
* The cluster is in moderate use for BLAST and other standard bioinformatic processing, and has never seen lockups or crashes before.

Environment variables that seem important to runAnalysisPipe are:

* export GS_MPIARGS="--n $NSLOTS --machinefile $TMPDIR/machines"
* export GS_LAUNCH_MODE=MPI
* export PATH=${PATH}:/opt/454/bin

I'm very curious to try this "GS_CACHEDIR", but I don't know what it does.

Note that the lines above are from my SGE job submission script. $NSLOTS and $TMPDIR/machines are created by a wrapper script and get set up based on how I submit the job. $NSLOTS is "how many parallel threads to start.". The machines files is a list of hostnames to start them on.

I found the "--progress" and "--verbose" flags to be quite useful in figuring out if processing is making progress or not.

We also encountered the hard-lockup behavior. I still have no idea of the *cause* of these lockups - but we've managed to work around. Here are my observations:

* openmpi jobs run on a single machine never finish, no matter how many threads I give them (1, 2, 4, 8, 16). I wave my hands in the direction of "16GB of RAM required".

* If I start 8 threads, four on each of two machines, those jobs run in a few hours.

* If I start more than 4 threads on any one machine, I have high odds of locking up (requiring a hard power cycle) at least one of the machines involved in that run.

* If I run threads from more than one job at a time on a particular machine, odds are high that I will lock up that machine.

* I can run 4 BLAST jobs and 4 threads of gsRunProcessor without too much contention on the same 8 core machine.

* gsRunProcessor leaves zombie processes all over the place when one of the compute nodes locks up during a run. I encounter fewer lockups if I clean those up prior to starting a run. This is made simple by the observation that I can't run two jobs on the same node anyway.

There is some correlation of the node lock-ups with heavy loads on the NFS file server - but I have yet to encounter any smoking gun with this.

Anyone else?

Hello,

First of all, I want to thank you for sharing your experience. I've been searching the web for information on how to setup the GS FLX Titanium software with Sun Grid Engine and your post is the first concrete reference that I've found.

So, basically, I don't have experience with the software nor Sun Grid Engine and I'm trying to setup an "off rig" cluster with SGE (though challenge). The (would be) cluster specs are:

* 1 head node with 8 cores, x86_64, 16 GB of RAM;
* 3 nodes with 4 cores, x86_64, 8 GB of RAM;
* CentOS 5.3;
* ~ 4 TB of storage.

The GS FLX Titanium software is already installed, on all nodes, with OpenMPI support.

I would be really great if you could share any information about how to setup Sun Grid Engine with this software like a tutorial, howtos, wikis, or even any *good* documentation about setting up SGE, its architecture, etc., would be excellent!.

Regarding the NFS server lockup: you could try sending the system and kernel logs to a remote syslog and see if the (high) load triggers some sort of kernel panic, just a thought...

If I can put the cluster together, I'll be happy to share our experiences

Thank you!

Best regards,
Joao

**dan** · 06-22-2009, 05:31 AM

Does the Newbler application 'runAssembly' work in an SGE environment? The Celera Assembler at least has some docs on this (although they look complex). I don't even know how to begin to submit my runAssembly to 'the cluster'.

AFAICT, we have several 16 Gb 8 core boxes. I am trying to assemble 2 full runs of GS FLX Titanium (~1 Bn bases at ~400 bp per run).

The progress of the assembly seems to get slower and slower ... (or perhaps I'm getting increasingly impatient). I did get CA to run on this data, giving me an assembly in about 1 day (on one box).

Thanks for any hints,

Dan.

**joa_ds** · 06-22-2009, 06:59 AM

hi everybody.

I am having a hard time keeping up with analysing data those lab people keep generating...

So i just need a quickie answer on my question before it start losing time again with figuring it out myself...

I configured the sequencer so that it automatically transfers everything after the imageprocessing step to our monster server to do basecalling.

My questions. This is the first shotgun experiment ever and also the first Titanium run. So I am a bit confused because I know Titanium has software updates but i dont know if I already installed it on the server.

So my questions: gsRunProcessor 2.0.00.22 (Build 184) -> is that Titanium ok software? Second question, is anything different for shotgun basecalling compared to amplicon basecalling?

I have the .cwf files here, and i was planning on hitting the 'runanalysispipe'. Will that work all right?

Everybody here keeps telling me i have to lookout for the software versions when using Titanium, but i guess there is not much i can do wrong when all i need to do is basecalling, right? I have my own tools to process the fasta/fastq files so i think i should not worry, right?

greetings from belgium

**westerman** · 06-22-2009, 07:29 AM

[QUOTE=joa_ds;6081]hi everybody.

So my questions: gsRunProcessor 2.0.00.22 (Build 184) -> is that Titanium ok software?

It certainly is not the latest. I am running 2.0.01.12 for my Titanium runs. Is there any reason why you can't upgrade to the latest version in order to make sure everything is compatible?

Second question, is anything different for shotgun basecalling compared to amplicon basecalling?

I am not sure but I believe not.

**dan** · 06-22-2009, 08:03 AM

Version sounds fine. Don't know about other questions.

Oh... I just saw westerman's reply. I'd run the latest version... where is that downloadable from?

**joa_ds** · 06-24-2009, 01:04 AM

hi, i just received an email with the new version 1.12 thingie. I installed it and I started my basecalling again and it appears that is working from the beginning now... I am hopefully

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Titanium software (MPI mode)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News