Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known #2

Open
kinow opened this issue Jun 20, 2022 · 1 comment

Comments

@kinow
Copy link

kinow commented Jun 20, 2022

Hi,

I followed the instructions from the README.md file, and got the Docker compose cluster working after #1

But running the example results in the error below.

kinow@ranma:/tmp/docker-compss-runtime$ docker-compose exec compss-master bash
(eddl_onnx_last) root@c6e014895899:~# cd pyeddl/third_party/compss_runtime/
(eddl_onnx_last) root@c6e014895899:~/pyeddl/third_party/compss_runtime# runcompss --lang=python --python_interpreter=python3 --project=linux-based/project.xml --resources=linux-based/resources.xml eddl_train_batch_compss.py
[  INFO] Using default execution type: compss

----------------- Executing eddl_train_batch_compss.py --------------------------

WARNING: COMPSs Properties file is null. Setting default values
[(778)    API]  -  Starting COMPSs Runtime v2.6.rc2003 (build 20200408-1126.rcbac84bafe556637e165de38764868ac68a8a75e)
Sleeping 30 seconds...
E:  uname_result(system='Linux', node='c6e014895899', release='5.4.0-120-generic', version='#136-Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022', machine='x86_64', processor='x86_64')
Generating Random Table
---------------------------------------------
---------------------------------------------

None
CS with low memory setup
Model training...
Number of epochs:  1
Number of epochs for parameter syncronization:  1
Training epochs [ 1  -  1 ] ...
Num workers:  4
Num images per worker:  15000
Workers batch size:  250
[ERRMGR]  -  WARNING: There was an exception when initiating worker deephealth_compss-worker_4.
[ERRMGR]  -  WARNING: There was an exception when initiating worker deephealth_compss-worker_2.
                      Stack trace:
                      Stack trace:
                      es.bsc.compss.exceptions.InitNodeException: [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_4 through user .
                      es.bsc.compss.exceptions.InitNodeException: [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_2 through user .
                      OUTPUT:
                      OUTPUT:
                      ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known
                      
                      	at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:90)
                      	at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:142)
                      	at es.bsc.compss.nio.master.NIOWorkerNode.start(NIOWorkerNode.java:153)
                      	at es.bsc.compss.types.resources.ResourceImpl.start(ResourceImpl.java:119)
                      	at es.bsc.compss.scheduler.types.allocatableactions.StartWorkerAction$1.run(StartWorkerAction.java:109)
[ERRMGR]  -  ERROR:   [START_CMD_ERROR]: Could not start the NIO worker in resource deephealth_compss-worker_2 through user .
                      OUTPUT:
                      ERROR:ssh: Could not resolve hostname deephealth_compss-worker_2: Name or service not known
[ERRMGR]  -  Shutting down COMPSs...
                      ERROR:ssh: Could not resolve hostname deephealth_compss-worker_4: Name or service not known
                      
                      	at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:90)
                      	at es.bsc.compss.nio.master.starters.WorkerStarter.startWorker(WorkerStarter.java:142)
                      	at es.bsc.compss.nio.master.NIOWorkerNode.start(NIOWorkerNode.java:153)
                      	at es.bsc.compss.types.resources.ResourceImpl.start(ResourceImpl.java:119)
                      	at es.bsc.compss.scheduler.types.allocatableactions.StartWorkerAction$1.run(StartWorkerAction.java:109)
[(163161)    API]  -  Execution Finished
Shutting down the running process

Error running application

(eddl_onnx_last) root@c6e014895899:~/pyeddl/third_party/compss_runtime#

Thanks!
-Bruno

@kinow
Copy link
Author

kinow commented Jun 20, 2022

The last time I used that --scale argument was a long time ago with the PBS Torque Docker image. Looks like now Docker Compose added a slug (that random hash appended to the name).

It makes it harder to use the --scale as in the documentation, since the master is not able to find the slave hosts.

Here's a diff that made the README instructions work (could work as replacement for #1)

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 59bedae..ba31cc6 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -1,12 +1,29 @@
 version: '3.7'
 services:
-  compss-worker:
+  compss-worker_1:
     image: "bscppc/compss-deephealth-demo"
     command: ["-c", "/usr/sbin/sshd -D"]
-       
+    container_name: deephealth_compss-worker_1
+  compss-worker_2:
+    image: "bscppc/compss-deephealth-demo"
+    command: ["-c", "/usr/sbin/sshd -D"]
+    container_name: deephealth_compss-worker_2
+  compss-worker_3:
+    image: "bscppc/compss-deephealth-demo"
+    command: ["-c", "/usr/sbin/sshd -D"]
+    container_name: deephealth_compss-worker_3
+  compss-worker_4:
+    image: "bscppc/compss-deephealth-demo"
+    command: ["-c", "/usr/sbin/sshd -D"]
+    container_name: deephealth_compss-worker_4
+
   compss-master:
     image: "bscppc/compss-deephealth-demo"
     stdin_open: true
     tty: true
     depends_on:
-      - compss-worker
+      - compss-worker_1
+      - compss-worker_2
+      - compss-worker_3
+      - compss-worker_4
+    container_name: deephealth_compss-master_1

I tried using a single slave, but I think the master configuration is set to 4 workers, so I thought it easier to just add the four workers directly in docker-compose.yaml.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant