You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training a LightGBM model in a distributed cluster setting using the Dask interface. When I set the early_stopping_rounds parameter, it causes the training job to hang indefinitely whenever the condition for early stopping seems to be triggered.
For example, here are the logs for one of the machines in the cluster where early_stopping_rounds is set to a value of 4:
2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
-- | --
| 2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
| 2024-03-04T16:58:45.000-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
| 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
| 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
| 2024-03-04T16:58:50.001-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
| 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
| 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
| 2024-03-04T16:58:56.011-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
| 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
| 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
| 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
| 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
| 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
| 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Info] Finished linking network in 382.644965 seconds
It seems like in a distributed training Dask setting, if any of the individual workers in the cluster hits the early stopping condition, then the entire job just hangs indefinitely. No error. No warning.
Linux via the python:3.9.16-bullseye Docker image.
Additional Comments
The text was updated successfully, but these errors were encountered:
jameslamb
changed the title
Early Stopping Causes DaskLGBMClassifier to Hange
[dask] [python-package] Early Stopping Causes DaskLGBMClassifier to Hange
Mar 5, 2024
jameslamb
changed the title
[dask] [python-package] Early Stopping Causes DaskLGBMClassifier to Hange
[dask] [python-package] Early stopping causes DaskLGBMClassifier to hang
Mar 5, 2024
Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!
If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5
Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!
If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5
Ah okay. In that case, the dask interface should throw a warning if early_stopping_rounds is passed in. Maybe even an error since passing it in seems to actually cause issues.
Description
I am training a LightGBM model in a distributed cluster setting using the Dask interface. When I set the early_stopping_rounds parameter, it causes the training job to hang indefinitely whenever the condition for early stopping seems to be triggered.
For example, here are the logs for one of the machines in the cluster where early_stopping_rounds is set to a value of 4:
It seems like in a distributed training Dask setting, if any of the individual workers in the cluster hits the early stopping condition, then the entire job just hangs indefinitely. No error. No warning.
Reproducible example
Environment info
LightGBM version or commit hash: 3.3.5
Command(s) you used to install LightGBM
Linux via the python:3.9.16-bullseye Docker image.
Additional Comments
The text was updated successfully, but these errors were encountered: