(raylet) socket.gaierror: [Errno -2] Name or service not known #8

xunaichao opened this issue May 31, 2022 · 9 comments

xunaichao opened this issue May 31, 2022 · 9 comments


When I run tensorFlow 2 For example.
(raylet) Traceback (most recent call last):
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/", line 334, in
(raylet) raise e
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/", line 323, in
(raylet) loop.run_until_complete(
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/asyncio/", line 568, in run_until_complete
(raylet) return future.result()
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/", line 138, in run
(raylet) modules = self._load_modules()
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/", line 92, in _load_modules
(raylet) c = cls(self)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/", line 72, in init
(raylet) self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/", line 76, in init
(raylet) namespace="ray", port=metrics_export_port)))
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/", line 334, in new_stats_exporter
(raylet) options=option, gatherer=option.registry, collector=collector)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/", line 266, in init
(raylet) self.serve_http()
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/", line 321, in serve_http
(raylet) port=self.options.port, addr=str(self.options.address))
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/", line 168, in start_wsgi_server
(raylet) TmpServer.address_family, addr = _get_best_family(addr, port)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/", line 157, in _get_best_family
(raylet) infos = socket.getaddrinfo(address, port)
(raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/", line 753, in getaddrinfo
(raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
(raylet) socket.gaierror: [Errno -2] Name or service not known

Hosts file

After running the example, session files are generated in /tmp/ray/ of the system

Runtime environment: Docker deployment uses Miniconda to install AZ and Ray

Conda create -n zoo python=3.7
conda activate zoo
pip install --pre --upgrade analytics-zoo
pip install analytics-zoo[ray]
PIP install tensorflow = = 2.3.0

conda list

1、Check python:
from zoo.util.utils import detect_python_location

2、Check ray installation
/usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --head --include-dashboard ture --dashboard-host --port 35413 --redis-password 123456 --num-cpus 1

/usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --address --redis-password 123456 --num-cpus 1

ray start --address=‘' --redis-password='0'



Please help solve it. Thank you

I'm going crazy

hkvision commented Jun 1, 2022

Hi @xunaichao

I checked the code and run it on Google Colab, I can get this error as well. But seems this error doesn't impact or interrupt the running, you can find the train and evaluate results in your log. Seems the error comes from ray dashboard, not sure whether this is caused by the out-of-date ray version.

As mentioned above, you are highly recommended to switch to the latest version of BigDL, I run the same BigDL example in Google Colab and there's no such error:

xunaichao commented Jun 1, 2022

@jason-dai @hkvision
thanks for your response. I have follow the instructions you gave:
I now run my and have a exception,

run logs:

2022-06-01 10:01:10.069315: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2022-06-01 10:01:10.074183: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-06-01 10:01:10.074198: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Initializing orca context
Current pyspark location is : /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/pyspark/
Start to getOrCreate SparkContext
pyspark_submit_args is: --driver-class-path /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.1.0-20220314.094552-2.jar:/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.0.0-jar-with-dependencies.jar:/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.0.0-jar-with-dependencies.jar pyspark-shell
2022-06-01 10:01:13 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-06-01 10:01:14,896 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-01 10:01:14,898 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-01 10:01:14,899 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-01 10:01:14,899 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false
22-06-01 10:01:14 [Thread-4] INFO Engine$:121 - Auto detect executor number and executor cores number
22-06-01 10:01:14 [Thread-4] INFO Engine$:123 - Executor number is 1 and executor cores number is 4

User settings:


Effective settings:

KMP_CPUINFO_FILE: value is not defined
KMP_FORCE_REDUCTION: value is not defined
OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
OMP_PLACES: value is not defined
OMP_TOOL_LIBRARIES: value is not defined

22-06-01 10:01:15 [Thread-4] INFO ThreadPool$:95 - Set mkl threads to 1 on thread 30
2022-06-01 10:01:15 WARN SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
22-06-01 10:01:15 [Thread-4] INFO Engine$:446 - Find existing spark context. Checking the spark conf...
BigDLBasePickler registering: bigdl.dllib.utils.common Sample
BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult
BigDLBasePickler registering: bigdl.dllib.utils.common JTensor
BigDLBasePickler registering: bigdl.dllib.utils.common JActivity
Successfully got a SparkContext
2022-06-01 10:01:18,220 INFO -- View the Ray dashboard at
2022-06-01 10:01:18,225 WARNING -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
{'node_ip_address': '', 'raylet_ip_address': '', 'redis_address': '', 'object_store_address': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2022-06-01_10-01-15_641395_1703868', 'metrics_export_port': 47074, 'node_id': 'a6dd76c71c04c32df5e009bc951165e1b0e85486a8a75d23fb5ab9ed'}
(Worker pid=1704437) 2022-06-01 10:01:19.629608: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
(Worker pid=1704437) 2022-06-01 10:01:19.634737: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/cv2/../../lib64:
(Worker pid=1704437) 2022-06-01 10:01:19.634753: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=1704437) WARNING:tensorflow:From /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/ _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
(Worker pid=1704437) Instructions for updating:
(Worker pid=1704437) use distribute.MultiWorkerMirroredStrategy instead
(Worker pid=1704437) 2022-06-01 10:01:21.270040: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/cv2/../../lib64:
(Worker pid=1704437) 2022-06-01 10:01:21.270095: W tensorflow/stream_executor/cuda/] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=1704437) 2022-06-01 10:01:21.270135: I tensorflow/stream_executor/cuda/] kernel driver does not appear to be running on this host (816d2073a24f): /proc/driver/nvidia/version does not exist
(Worker pid=1704437) 2022-06-01 10:01:21.271364: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
(Worker pid=1704437) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=1704437) 2022-06-01 10:01:21.297690: I tensorflow/core/distributed_runtime/rpc/] Initialize GrpcChannelCache for job worker -> {0 ->}
(Worker pid=1704437) 2022-06-01 10:01:21.297883: I tensorflow/core/distributed_runtime/rpc/] Initialize GrpcChannelCache for job worker -> {0 ->}
(Worker pid=1704437) 2022-06-01 10:01:21.299556: I tensorflow/core/distributed_runtime/rpc/] Started server with target: grpc://
(raylet) /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/dashboard/ DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(raylet) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"):
(raylet) /usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/dashboard/ DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(raylet) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"):
Traceback (most recent call last):
File "", line 656, in
File "", line 643, in main
trainer = Estimator.from_keras(model_creator=model_creator)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/", line 69, in from_keras
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/", line 96, in init
for i, worker in enumerate(self.remote_workers)])
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/_private/", line 105, in wrapper
(Worker pid=1704437) 2022-06-01 10:01:27.086318: W tensorflow/core/util/] Could not open ./yolov3/yolov3.weights: DATA_LOSS: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
return func(*args, **kwargs)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/ray/", line 1713, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::Worker.setup_distributed() (pid=1704437, ip=, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7faab3e7fcd0>)
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/", line 321, in setup_distributed
self.model = self.model_creator(self.config)
File "", line 571, in model_creator
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/keras/utils/", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/h5py/_hl/", line 394, in init
File "/usr/local/miniconda3/envs/py37/lib/python3.7/site-packages/h5py/_hl/", line 170, in make_fid
fid =, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 85, in
OSError: Unable to open file (file signature not found)
Stopping orca context

the code i used is pasted here:

conda list:

thank you for help!

It seems you may try to load the wrong weights:

./yolov3/yolov3.weights: DATA_LOSS: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

You may need to convert the pre-trained darknet weights first, as does in yolo v3 example.

And you could always refer to our Yolov3 example in BigDL. Hope that helps.

May I ask whether you met the same error with your TensorFlow code (without using bigdl), i.e with your tflocal mode?

we use,, this example to save the model. and change the save module to :
4061654161657_ pic
we now get the .pb file sucessfully, but have an exception when i use model optimizer of openvino to convert the model format to IR. the error is like this:

Model Optimizer arguments:
Common parameters:
- Path to the Input Model: /az/test1/saved_model.pb
- Path for generated IR: /opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/.
- IR output name: saved_model
- Log level: ERROR
- Batch: Not specified, inherited from the model
- Input layers: Not specified, inherited from the model
- Output layers: Not specified, inherited from the model
- Input shapes: [1,120,120,3]
- Mean values: Not specified
- Scale values: Not specified
- Scale factor: Not specified
- Precision of IR: FP32
- Enable fusing: True
- Enable grouped convolutions fusing: True
- Move mean values to preprocess section: None
- Reverse input channels: False
TensorFlow specific parameters:
- Input model in text protobuf format: False
- Path to model dump for TensorBoard: None
- List of shared libraries with TensorFlow custom layers implementation: None
- Update the configuration file with input/output node names: None
- Use configuration file used to generate the model with Object Detection API: None
- Use the config file: None
- Inference Engine found in: /opt/intel/openvino_2021.4.752/python/python3.6/openvino
Inference Engine version: 2021.4.2-3974-e2a469a3450-releases/2021/4
Model Optimizer version: 2021.4.2-3974-e2a469a3450-releases/2021/4
2022-06-02 09:03:05.880400: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/mo/utils/../../../inference_engine/lib/intel64:/opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/mo/utils/../../../inference_engine/external/tbb/lib:/opt/intel/openvino_2021.4.752/deployment_tools/model_optimizer/mo/utils/../../../ngraph/lib
2022-06-02 09:03:05.880451: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/ DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
[ FRAMEWORK ERROR ] Cannot load input model: TensorFlow cannot read the model file: "/az/test1/saved_model.pb" is incorrect TensorFlow model file.
The file should contain one of the following TensorFlow graphs:

  1. frozen graph in text or binary format
  2. inference graph for freezing with checkpoint (--input_checkpoint) in text or binary format
  3. meta graph

Make sure that --input_model_is_text is provided for a model in text format. By default, a model is interpreted in binary format. Framework error details: Error parsing message.
For more information please refer to Model Optimizer FAQ, question #43. (
can you help us, thank you very much!
@yushan111 thank you for the example, it helps a lot!

You will get a tf.keras model with est.get_model(), and you could successfully save the model with tf.saved_model API.

After that, it depends on you how you would like to use your tensorflow model.

About using Openvino to convert your tensorflow model, maybe you could open an issue in the Openvino project.

@yushan111 thanks for your help

