You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I followed the instructions from the README.md file, and got the Docker compose cluster working after #1
But running the example results in the error below.
kinow@ranma:/tmp/docker-compss-runtime$ docker-compose exec compss-master bash
(eddl_onnx_last) root@c6e014895899:~# cd pyeddl/third_party/compss_runtime/
(eddl_onnx_last) root@c6e014895899:~/pyeddl/third_party/compss_runtime# runcompss --lang=python --python_interpreter=python3 --project=linux-based/project.xml --resources=linux-based/resources.xml eddl_train_batch_compss.py
[ INFO] Using default execution type: compss
----------------- Executing eddl_train_batch_compss.py --------------------------
WARNING: COMPSs Properties file is null. Setting default values
[(778) API] - Starting COMPSs Runtime v2.6.rc2003 (build 20200408-1126.rcbac84bafe556637e165de38764868ac68a8a75e)
Sleeping 30 seconds...
E: uname_result(system='Linux', node='c6e014895899', release='5.4.0-120-generic', version='#136-Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022', machine='x86_64', processor='x86_64')
Generating Random Table
---------------------------------------------
---------------------------------------------
None
CS with low memory setup
Model training...
Number of epochs: 1
Number of epochs for parameter syncronization: 1
Training epochs [ 1 - 1 ] ...
Num workers: 4
Num images per worker: 15000
Workers batch size: 250
[ERRMGR] - WARNING: There was an exception when initiating worker deephealth_compss-worker_4.
[ERRMGR] - WARNING: There was an exception when initiating worker deephealth_compss-worker_2.
Stack trace:
Stack trace:
es.bsc.compss.exceptions.InitNodeException: [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_4 through user .
es.bsc.compss.exceptions.InitNodeException: [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_2 through user .
OUTPUT:
OUTPUT:
ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known
at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:90)
at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:142)
at es.bsc.compss.nio.master.NIOWorkerNode.start(NIOWorkerNode.java:153)
at es.bsc.compss.types.resources.ResourceImpl.start(ResourceImpl.java:119)
at es.bsc.compss.scheduler.types.allocatableactions.StartWorkerAction$1.run(StartWorkerAction.java:109)
[ERRMGR] - ERROR: [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_2 through user .
OUTPUT:
ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known
[ERRMGR] - Shutting down COMPSs...
ERROR:ssh: Could not resolve hostname deephealth_compss-worker_4: Name or service not known
at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:90)
at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:142)
at es.bsc.compss.nio.master.NIOWorkerNode.start(NIOWorkerNode.java:153)
at es.bsc.compss.types.resources.ResourceImpl.start(ResourceImpl.java:119)
at es.bsc.compss.scheduler.types.allocatableactions.StartWorkerAction$1.run(StartWorkerAction.java:109)
[(163161) API] - Execution Finished
Shutting down the running process
Error running application
(eddl_onnx_last) root@c6e014895899:~/pyeddl/third_party/compss_runtime#
Thanks!
-Bruno
The text was updated successfully, but these errors were encountered:
The last time I used that --scale argument was a long time ago with the PBS Torque Docker image. Looks like now Docker Compose added a slug (that random hash appended to the name).
It makes it harder to use the --scale as in the documentation, since the master is not able to find the slave hosts.
Here's a diff that made the README instructions work (could work as replacement for #1)
I tried using a single slave, but I think the master configuration is set to 4 workers, so I thought it easier to just add the four workers directly in docker-compose.yaml.
Hi,
I followed the instructions from the
README.md
file, and got the Docker compose cluster working after #1But running the example results in the error below.
Thanks!
-Bruno
The text was updated successfully, but these errors were encountered: