[dask] [python-package] Early stopping causes DaskLGBMClassifier to hang #6351

tristers-at-square · 2024-03-04T23:45:45Z

Description

I am training a LightGBM model in a distributed cluster setting using the Dask interface. When I set the early_stopping_rounds parameter, it causes the training job to hang indefinitely whenever the condition for early stopping seems to be triggered.

For example, here are the logs for one of the machines in the cluster where early_stopping_rounds is set to a value of 4:

2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
-- | --
  | 2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
  | 2024-03-04T16:58:45.000-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
  | 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
  | 2024-03-04T16:58:50.001-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
  | 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
  | 2024-03-04T16:58:56.011-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
  | 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
  | 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
  | 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
  | 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Info] Finished linking network in 382.644965 seconds

It seems like in a distributed training Dask setting, if any of the individual workers in the cluster hits the early stopping condition, then the entire job just hangs indefinitely. No error. No warning.

Reproducible example

hyperparameters = {
    "num_estimators": 225,
    "early_stopping_rounds": 4,
    "max_depth": 8
}
lightgbm_trainer = lightgbm.DaskLGBMClassifier(
      client=client, silent=False, **hyperparameters
)
callbacks = [
    lightgbm.log_evaluation(period=1),
]
info_level_verbosity = 1
lightgbm_trainer.fit(
    X=X_train,
    y=y_train,
    sample_weight=train_sample_weights,
    eval_set=eval_set,
    eval_names=eval_names,
    eval_metric=args.eval_metric,
    callbacks=callbacks,
    verbose=info_level_verbosity,
)

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

lightgbm==3.3.5

Linux via the python:3.9.16-bullseye Docker image.

Additional Comments

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-03-05T14:24:24Z

Thanks for using LightGBM.

Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!

If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5

tristers-at-square · 2024-03-05T17:31:48Z

Thanks for using LightGBM.

Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!

If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5

Ah okay. In that case, the dask interface should throw a warning if early_stopping_rounds is passed in. Maybe even an error since passing it in seems to actually cause issues.

Will also give the new version a shot, thanks!

jameslamb changed the title ~~Early Stopping Causes DaskLGBMClassifier to Hange~~ [dask] [python-package] Early Stopping Causes DaskLGBMClassifier to Hange Mar 5, 2024

jameslamb added the dask label Mar 5, 2024

jameslamb changed the title ~~[dask] [python-package] Early Stopping Causes DaskLGBMClassifier to Hange~~ [dask] [python-package] Early stopping causes DaskLGBMClassifier to hang Mar 5, 2024

jameslamb added the awaiting response label Mar 5, 2024

github-actions bot removed the awaiting response label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] [python-package] Early stopping causes DaskLGBMClassifier to hang #6351

[dask] [python-package] Early stopping causes DaskLGBMClassifier to hang #6351

tristers-at-square commented Mar 4, 2024 •

edited by jameslamb

Loading

jameslamb commented Mar 5, 2024

tristers-at-square commented Mar 5, 2024

[dask] [python-package] Early stopping causes DaskLGBMClassifier to hang #6351

[dask] [python-package] Early stopping causes DaskLGBMClassifier to hang #6351

Comments

tristers-at-square commented Mar 4, 2024 • edited by jameslamb Loading

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Mar 5, 2024

tristers-at-square commented Mar 5, 2024

tristers-at-square commented Mar 4, 2024 •

edited by jameslamb

Loading