We are building an application for analyzing NGS data. The application uses the Kepler workflow engine (v 2.4) for workflow execution. We have integrated HT Condor (v.8.2.3) with kepler for grid enablement. The application is installed on Ubuntu 12.04 and we are currently using this as a single node deployment. We are running this application on a high end AWS instance.
The problem we are facing is this:
When we submit a workflow containing multiple files to kepler it creates multiple generic job launcher actors for each step. Each generic job launcher actor has a jdl which it submits to the condor manager. We have observed that when more than 4 jdls are present, then intermittently, one of the jdl is not getting processed (i.e. one of the actor is not getting executed).The job remains in idle state for some time and then gets evicted. Our jdl has conditions to generate 3 files on execution, namely .out .log and .err, but in our case, few of the jdls are not executing at all and hence no files are generated.
We have tried to troubleshoot this by changing the memory parameters (heap size) for condor and kepler but that has not worked.
Please note that we can run workflows containing upto 2 files successfully. We have a requirement to run 1000 jobs or more simultaneously.
Our thoughts on this issue:
1) We are thinking that there might be a problem with Resource (cores) allocation.
2) Jobs are getting evicted OR might be pre-empted.
3) We suspect a condor_suspend signal is sent to the job. If the job is in “wait” state for a long time, after a specified time it will get evicted automatically by condor.
Has anybody else faced similar problem on condor?
The problem we are facing is this:
When we submit a workflow containing multiple files to kepler it creates multiple generic job launcher actors for each step. Each generic job launcher actor has a jdl which it submits to the condor manager. We have observed that when more than 4 jdls are present, then intermittently, one of the jdl is not getting processed (i.e. one of the actor is not getting executed).The job remains in idle state for some time and then gets evicted. Our jdl has conditions to generate 3 files on execution, namely .out .log and .err, but in our case, few of the jdls are not executing at all and hence no files are generated.
We have tried to troubleshoot this by changing the memory parameters (heap size) for condor and kepler but that has not worked.
Please note that we can run workflows containing upto 2 files successfully. We have a requirement to run 1000 jobs or more simultaneously.
Our thoughts on this issue:
1) We are thinking that there might be a problem with Resource (cores) allocation.
2) Jobs are getting evicted OR might be pre-empted.
3) We suspect a condor_suspend signal is sent to the job. If the job is in “wait” state for a long time, after a specified time it will get evicted automatically by condor.
Has anybody else faced similar problem on condor?