Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] [python-package] Early stopping causes DaskLGBMClassifier to hang #6351

Open
tristers-at-square opened this issue Mar 4, 2024 · 2 comments
Labels

Comments

@tristers-at-square
Copy link

tristers-at-square commented Mar 4, 2024

Description

I am training a LightGBM model in a distributed cluster setting using the Dask interface. When I set the early_stopping_rounds parameter, it causes the training job to hang indefinitely whenever the condition for early stopping seems to be triggered.

For example, here are the logs for one of the machines in the cluster where early_stopping_rounds is set to a value of 4:

2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
-- | --
  | 2024-03-04T16:58:39.986-05:00 | [85]#011train's binary_logloss: 0.0897921#011validation's binary_logloss: 0.0998643
  | 2024-03-04T16:58:45.000-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
  | 2024-03-04T16:58:45.000-05:00 | [86]#011train's binary_logloss: 0.0896056#011validation's binary_logloss: 0.100004
  | 2024-03-04T16:58:50.001-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
  | 2024-03-04T16:58:50.002-05:00 | [87]#011train's binary_logloss: 0.0894872#011validation's binary_logloss: 0.0999423
  | 2024-03-04T16:58:56.011-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
  | 2024-03-04T16:58:56.011-05:00 | [88]#011train's binary_logloss: 0.0893585#011validation's binary_logloss: 0.0999157
  | 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
  | 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
  | 2024-03-04T16:59:00.013-05:00 | [89]#011train's binary_logloss: 0.0892702#011validation's binary_logloss: 0.0998719
  | 2024-03-04T16:59:00.013-05:00 | [LightGBM] [Info] Finished linking network in 382.644965 seconds

It seems like in a distributed training Dask setting, if any of the individual workers in the cluster hits the early stopping condition, then the entire job just hangs indefinitely. No error. No warning.

Reproducible example

hyperparameters = {
    "num_estimators": 225,
    "early_stopping_rounds": 4,
    "max_depth": 8
}
lightgbm_trainer = lightgbm.DaskLGBMClassifier(
      client=client, silent=False, **hyperparameters
)
callbacks = [
    lightgbm.log_evaluation(period=1),
]
info_level_verbosity = 1
lightgbm_trainer.fit(
    X=X_train,
    y=y_train,
    sample_weight=train_sample_weights,
    eval_set=eval_set,
    eval_names=eval_names,
    eval_metric=args.eval_metric,
    callbacks=callbacks,
    verbose=info_level_verbosity,
)

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

lightgbm==3.3.5

Linux via the python:3.9.16-bullseye Docker image.

Additional Comments

@jameslamb jameslamb changed the title Early Stopping Causes DaskLGBMClassifier to Hange [dask] [python-package] Early Stopping Causes DaskLGBMClassifier to Hange Mar 5, 2024
@jameslamb jameslamb added the dask label Mar 5, 2024
@jameslamb jameslamb changed the title [dask] [python-package] Early Stopping Causes DaskLGBMClassifier to Hange [dask] [python-package] Early stopping causes DaskLGBMClassifier to hang Mar 5, 2024
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!

If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5

@tristers-at-square
Copy link
Author

Thanks for using LightGBM.

Early stopping is not currently supported in the Dask interface. You can subscribe to #3712 to be notified when that work is picked up. We'd also welcome a contribution if you'd like to contribute it!

If you're using lightgbm.dask, please upgrade to at least LightGBM 4.0 (and preferably to the latest version, v4.3.0). There have been 2+ years of improvements and bug fixes since v3.3.5

Ah okay. In that case, the dask interface should throw a warning if early_stopping_rounds is passed in. Maybe even an error since passing it in seems to actually cause issues.

Will also give the new version a shot, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants