-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dask] Socket recv error, code: 54 on macOS #4116
Comments
Linking #3782 (comment) and #4064 (comment). |
It has been about a month since we merged #4132, and since then I haven't seen this error in tests. Have you, @StrikerRUS ? If not, I think this can be closed. |
Yeah, seems that #4132 was great bug fix! |
Hi, I'm newbie of Dask. Scenario 1: Jupyter Lab on macOSHere is packages version. # conda list
python 3.8.10 h0e5c897_0_cpython conda-forge
dask 2021.9.0 pyhd8ed1ab_0 conda-forge
dask-core 2021.9.0 pyhd8ed1ab_0 conda-forge
distributed 2021.9.0 py38h50d1736_0 conda-forge
lightgbm 3.2.1 py38ha048514_0 conda-forge When I ran the example code on jupyter lab as following, everythin went well in the beginning. import dask.array as da
from distributed import Client, LocalCluster
from sklearn.datasets import make_regression
import lightgbm as lgb
cluster = LocalCluster(n_workers=2, dashboard_address=':9797')
client = Client(cluster)
X, y = make_regression(n_samples=7000, n_features=50)
dX = da.from_array(X, chunks=(100, 50))
dy = da.from_array(y, chunks=(100,))
print("beginning training")
dask_model = lgb.DaskLGBMRegressor(n_estimators=10)
dask_model.fit(dX, dy)
assert dask_model.fitted_
print("done training") But when I start changing the This error would be raised when Finding random open ports for workers
[LightGBM] [Info] Listening...
[LightGBM] [Info] Trying to bind port 51608...
[LightGBM] [Info] Binding port 51608 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Fatal] Socket recv error, code: 54
distributed.nanny - WARNING - Restarting worker
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
distributed.worker - WARNING - Compute Failed
Function: _train_part
args: ()
kwargs: {'model_factory': <class 'lightgbm.sklearn.LGBMRegressor'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 10000, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'tree_learner': 'data', 'num_threads': 4, 'machines': '192.168.16.136:51606,192.168.16.136:51608', 'local_listen_port': 51606, 'time_out': 120, 'num_machines': 2}, 'list_of_parts': [{'data': array([[ 0.78009344, 1.54505727, -0.09284188, ..., 0.25296354,
-0.51492386, -0.8339314 ],
[-0.70814669, -1.26873321, 1.16311847, ..., -0.73389737,
0.0318654 , 0.40117427],
[ 0.19502484, 0.46833338, 0.64913442, ..., 0.3898734 ,
0.03733873, 0.43668199],
...,
[-
Exception: LightGBMError('Socket recv error, code: 54')
---------------------------------------------------------------------------
LightGBMError Traceback (most recent call last)
/var/folders/vv/s18mz7fj3pg2t9y0rvg_s6dw0000gq/T/ipykernel_93644/2905138063.py in <module>
2
3 dask_model = lgb.DaskLGBMRegressor(n_estimators=10000)
----> 4 dask_model.fit(dX, dy)
5 assert dask_model.fitted_
6
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/dask.py in fit(self, X, y, sample_weight, init_score, **kwargs)
882 ) -> "DaskLGBMRegressor":
883 """Docstring is inherited from the lightgbm.LGBMRegressor.fit."""
--> 884 return self._lgb_dask_fit(
885 model_factory=LGBMRegressor,
886 X=X,
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/dask.py in _lgb_dask_fit(self, model_factory, X, y, sample_weight, init_score, group, **kwargs)
615 params.pop("client", None)
616
--> 617 model = _train(
618 client=_get_dask_client(self.client),
619 data=X,
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/dask.py in _train(client, data, label, params, model_factory, sample_weight, init_score, group, **kwargs)
448 ]
449
--> 450 results = client.gather(futures_classifiers)
451 results = [v for v in results if v]
452 model = results[0]
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1941 else:
1942 local_worker = None
-> 1943 return self.sync(
1944 self._gather,
1945 futures,
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
838 return future
839 else:
--> 840 return sync(
841 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
842 )
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
324 if error[0]:
325 typ, exc, tb = error[0]
--> 326 raise exc.with_traceback(tb)
327 else:
328 return result[0]
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/distributed/utils.py in f()
307 if callback_timeout is not None:
308 future = asyncio.wait_for(future, callback_timeout)
--> 309 result[0] = yield future
310 except Exception:
311 error[0] = sys.exc_info()
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1806 exc = CancelledError(key)
1807 else:
-> 1808 raise exception.with_traceback(traceback)
1809 raise exc
1810 if errors == "skip":
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/dask.py in _train_part()
116 model.fit(data, label, sample_weight=weight, init_score=init_score, group=group, **kwargs)
117 else:
--> 118 model.fit(data, label, sample_weight=weight, init_score=init_score, **kwargs)
119
120 finally:
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/sklearn.py in fit()
816 callbacks=None, init_model=None):
817 """Docstring is inherited from the LGBMModel."""
--> 818 super().fit(X, y, sample_weight=sample_weight, init_score=init_score,
819 eval_set=eval_set, eval_names=eval_names, eval_sample_weight=eval_sample_weight,
820 eval_init_score=eval_init_score, eval_metric=eval_metric,
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/sklearn.py in fit()
681 init_model = init_model.booster_
682
--> 683 self._Booster = train(params, train_set,
684 self.n_estimators, valid_sets=valid_sets, valid_names=eval_names,
685 early_stopping_rounds=early_stopping_rounds,
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/engine.py in train()
247 evaluation_result_list=None))
248
--> 249 booster.update(fobj=fobj)
250
251 evaluation_result_list = []
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/basic.py in update()
2641 if self.__set_objective_to_none:
2642 raise LightGBMError('Cannot update due to null objective function.')
-> 2643 _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
2644 self.handle,
2645 ctypes.byref(is_finished)))
~/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/basic.py in _safe_call()
108 """
109 if ret != 0:
--> 110 raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
111
112
LightGBMError: Socket recv error, code: 54 Scenario 2: Jupyter Lab on LinuxI knew Dask integration is only tested on Linux., so I build a simple image usnig the following Dockerfile: FROM continuumio/miniconda3
COPY . /app
WORKDIR /app
RUN conda install -c conda-forge jupyterlab pandas matplotlib -y && \
conda install -c conda-forge lightgbm=3.2.1 -y && \
conda install -c conda-forge dask -y && \
conda install -c conda-forge graphviz python-graphviz -y
EXPOSE 8888
CMD ["/bin/bash"] The packages version within container. python 3.9.5 h12debd9_4
dask 2021.9.0 pyhd8ed1ab_0 conda-forge
dask-core 2021.9.0 pyhd8ed1ab_0 conda-forge
distributed 2021.9.0 py39hf3d152e_0 conda-forge
lightgbm 3.2.1 py39he80948d_0 conda-forge I ran the same example code on jupyter lab as I describe above. Also changing the The only warning only shows when I set distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 1.51 GiB -- Worker memory limit: 1.92 GiB However, the Dask integration on Linux is still benefits for me, I won't ran the cluster on macOS in production. My question is this situation still kind of bug for macOS only? |
@orcahmlee thanks very much for using LightGBM and the Dask interface! I noticed you're using Please try installing from source and let us know whether or not that resolves the git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM/python-package
python setup.py install For the warning This warning is more likely to happen for a larger value of To learn more about that, you might be interested in the my Dask Summit talk from earlier this year, "How Distributed LightGBM on Dask Works": https://github.com/jameslamb/talks/tree/main/how-distributed-lightgbm-on-dask-works. |
@jameslamb Thanks for the kindly and quickly reply. Sorry for my late reply, this my first time build from sources. Now, my lightgbm version is # conda list
dask 2021.9.0 pyhd8ed1ab_0 conda-forge
dask-core 2021.9.0 pyhd8ed1ab_0 conda-forge
dask-kubernetes 2021.3.1 pyhd8ed1ab_0 conda-forge
distributed 2021.9.0 py38h50d1736_0 conda-forge
lightgbm 3.2.1.99 pypi_0 pypi Howerver, when I ran the same code, I kept receiving the import dask.array as da
from distributed import Client, LocalCluster
from sklearn.datasets import make_regression
import lightgbm as lgb
cluster = LocalCluster(n_workers=2, dashboard_address=':9797')
client = Client(cluster)
X, y = make_regression(n_samples=7000, n_features=50)
dX = da.from_array(X, chunks=(100, 50))
dy = da.from_array(y, chunks=(100,))
print("beginning training")
dask_model = lgb.DaskLGBMRegressor(n_estimators=10)
dask_model.fit(dX, dy)
assert dask_model.fitted_
print("done training") Output from Jupyter /Users/andrew/miniconda3/envs/lightgbm/lib/python3.8/site-packages/lightgbm/dask.py:525: UserWarning: Parameter n_jobs will be ignored.
_log_warning(f"Parameter {param_alias} will be ignored.")
Finding random open ports for workers
[LightGBM] [Info] Trying to bind port 51431...
[LightGBM] [Info] Binding port 51431 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 338 milliseconds
[LightGBM] [Info] Trying to bind port 51432...
[LightGBM] [Info] Binding port 51432 succeeded
[LightGBM] [Info] Listening...
distributed.nanny - WARNING - Restarting worker
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
distributed.nanny - WARNING - Restarting worker Did I miss something? |
No problem, we are here to help! Looks like you're running into a part of One possibility is that your worker ran out of memory and was killed by Dask. You can try re-running training and watching the memory utilization in the Dask dashboard. If you do that and that doesn't seem to be the case, please report a new issue at https://github.com/microsoft/LightGBM/issues using the excellent example code in your post above, and we can work on it there (since you don't seem to be facing the LightGBM distributed training is not currently resilient to worker failures (partially documented in #3775, but we probably need another feature request for training being interrupted), and in the situation where a Dask worker dies you might not get an informative error. |
Yes, I do watching the Dask Dashboard.
|
@orcahmlee I'd be happy to look into this when I have some availability today or tomorrow. For now, can you please take this excellent writeup from your recent comment and open a new issue, as I requested in #4116 (comment)?
|
@jameslamb Thanks for your help. I opened a new issue to trace this situation #4625. |
thanks so much for the excellent write-up there! I'm going to lock discussion in this issue to encourage others to start new ones in the future. |
Just saw this error in
master
for macOS in one of the examples.Full logs:
The text was updated successfully, but these errors were encountered: