Feature Request: add timeout parameter to the .fit() method #6596

fingoldo · 2024-08-08T09:48:06Z

Adding the timeout parameter to the .fit() method, that should force the library to return best known solution found so far as soon as provided number of seconds since the start of training are passed, will allow to satisfy training SLAs, when a user has only a limited time budget to finish certain model training. Also, this will make possible fair comparison of different hyperparameters.

Reaching the timeout should have the same effect as reaching max iterations, maybe with additional warning and/or attribute set so that the training job's finishing reason is clear to the end user.

jameslamb · 2024-08-08T13:37:25Z

Thanks for using LightGBM and taking the time to open this.

I'm -1 on adding this to LightGBM. I understand why this might be useful, but I don't think LightGBM is the right place for this logic. This would introduce some non-trivial maintenance burden and complexity.

This would be better handled outside of LightGBM, in code that you use to invoke it.

Since you mentioned .fit(), I assume you're specifically talking about using lightgbm (the Python package for LightGBM). You could, for example, use asyncio's builtin support for timing out Python function calls: https://docs.python.org/3/library/asyncio-task.html#timeouts.

Alternatively, you could use a lightgbm callback for this purpose. Something like the following:

import lightgbm as lgb
from datetime import datetime
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000, n_features=20)
dtrain = lgb.Dataset(X, label=y)

class TimeoutCallback:
    def __init__(self, timeout_seconds: int):
        self.before_iteration = False
        self.timeout_seconds = timeout_seconds
        self._start = datetime.utcnow()

    def __call__(self, *args, **kwargs) -> None:
        if (datetime.utcnow() - self._start).total_seconds() > self.timeout_seconds:
            raise RuntimeError(f"timing out: elapsed time has exceeded {self.timeout_seconds} seconds")

bst = lgb.train(
    params={
        "objective": "regression",
        "num_leaves": 100
    },
    train_set=dtrain,
    num_boost_round=1000,
    callbacks=[TimeoutCallback(2)]
)

I just tested that with LightGBM 4.5.0 and saw the following:

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001736 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 10000, number of used features: 20
[LightGBM] [Info] Start training from score 0.256686
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/engine.py", line 317, in train
    cb(
  File "<stdin>", line 8, in __call__
RuntimeError: timing out: elapsed time has exceeded 2 seconds

That's not perfect, as it only runs after each iteration and individual iterations could run for much longer on a realistic dataset. But hopefully that imperfection also shows one example of how complex this would be to implement in LightGBM.

I'm only one vote here though, maybe other maintainers will have a different perspective.

fingoldo · 2024-08-08T16:33:28Z

I did not think of this approach! If i'm using early stopping, are best "weights" applied to the model after this exception is thrown? in other words, is best_iter set correctly? Goal would be to stay within time budget but not lose training progress up to the point.

jameslamb · 2024-08-08T16:44:15Z

Oh interesting! It wasn't clear to me that you would want to see training time out but also keep that model.

No, in the Python package best_iter and other early stopping behavior is only set after early stopping is explicitly triggered, not along the way as training proceeds.

A Python exception is used to tell the training process that early stopping has been triggered, and to carry forward details like best iteration and evaluation results.

LightGBM/python-package/lightgbm/callback.py

Line 436 in e7edb6c

raise EarlyStopException(self.best_iter[i], self.best_score_list[i])

LightGBM/python-package/lightgbm/callback.py

Lines 40 to 44 in e7edb6c

    
           class EarlyStopException(Exception): 
        
               """Exception of early stopping. 
        
               Raise this from a callback passed in via keyword argument ``callbacks`` 
        
               in ``cv()`` or ``train()`` to trigger early stopping.

LightGBM/python-package/lightgbm/engine.py

Lines 327 to 330 in e7edb6c

    
           except callback.EarlyStopException as earlyStopException: 
        
               booster.best_iteration = earlyStopException.best_iteration + 1 
        
               evaluation_result_list = earlyStopException.best_score 
        
               break

You could rely on that behavior in your own callback, and have it raise a lightgbm.EarlyStopException instead of a RuntimeError like in my example. That'd allow you to treat "training has been running for too long" as a triggering condition for early stopping.

Alternatively... have you tried optuna? I haven't used this particular feature of it, but it looks like they directly offer a time_budget: https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.integration.lightgbm.LightGBMTuner.html

time_budget – A time budget for parameter tuning in seconds.

(that might be for the entire experiment though, not per-trial... I'm not sure)

fingoldo · 2024-08-08T17:25:46Z

Hah! ) I'm planning to create my own cool hyperparameters tuner, that's one of the reasons why I'm interested in this functionality. I can easily see how to do time budgeting at the level of the tuner - just in the hyperparameters checking loop, after next combination has been tried, but the underlying estimator has to finish its training gracefully before that, which for some combinations can take extremely long time.

Writing great hyperparameters optimizer is one more use case for this timeout feature. Now I think it's the EarlyStopping callback I should subclass (as I almost can't imagine training without early stopping).

Does it make sense to prepare a PR that adds timeout parameter to the EarlyStopping callback?

That said, it still seems more natural to me to be able to specify timeout in the fit or init methods of the estimator directly, same as we do with n_iters, just in this case we are interested in maximum number of seconds not trees.

jameslamb · 2025-02-15T06:59:34Z

it still seems more natural to me to be able to specify timeout in the fit or init methods of the estimator directly, same as we do with n_iters

I understand why you want that, but it'd be pretty difficult to get that right in a thorough way.

Just a few practical challenges:

lightgbm has Python code invoking C++ functions... we'd have to be careful to interrupt those currently-running C++ functions in a way that doesn't result in memory leaks or other negative side effects
LGBMClassifier.fit() is also called during distributed training in the Dask interface...
- the lightgbm.dask estimators would need to understand how to handle the case where individual worker processes hit a timeout
- and they might only be able to measure the time that they'd been running, which wouldn't include time waiting for a worker to be available in the cluster, which could lead to confusing results
we'd have to think about how to handle interrupting other user-provided callbacks (which can contain arbitrary code and run for an arbitrarily long time)

Even if we chose to ignore all of these concerns and treat them as "not yet implemented", that'd still add complexity in the form of additional warnings, errors, and notes in documentation.

Does it make sense to prepare a PR that adds timeout parameter to the EarlyStopping callback?

Sorry it took so long to respond to this @fingoldo . No, I wouldn't support such a PR here in LightGBM.

To be honest, I agree with @trivialfis (dmlc/xgboost#10684 (comment))... this feature is not something that should be in libraries like CatBoost / LightGBM / XGBoost.

I think it's better implemented outside of those libraries, e.g. in the hyperparameter tuner you're writing and similar tools like hyperopt and optuna. Early stopping logic in this library is already quite complex, with plenty of scope to become more complex (e.g. #6424 (comment)).

I think this should be treated as "won't do" and closed. I'll leave it open a bit longer to give you and others a chance to comment.

jameslamb added the feature request label Aug 8, 2024

jameslamb added the awaiting response label Aug 8, 2024

github-actions bot removed the awaiting response label Aug 8, 2024

This was referenced Aug 9, 2024

Feature Request: add timeout parameter to the .fit() method dmlc/xgboost#10684

Closed

Feature Request: add timeout parameter to the .fit() method catboost/catboost#2717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: add timeout parameter to the .fit() method #6596

Feature Request: add timeout parameter to the .fit() method #6596

fingoldo commented Aug 8, 2024

jameslamb commented Aug 8, 2024

fingoldo commented Aug 8, 2024

jameslamb commented Aug 8, 2024 •

edited

Loading

fingoldo commented Aug 8, 2024 •

edited

Loading

jameslamb commented Feb 15, 2025

Feature Request: add timeout parameter to the .fit() method #6596

Feature Request: add timeout parameter to the .fit() method #6596

Comments

fingoldo commented Aug 8, 2024

jameslamb commented Aug 8, 2024

fingoldo commented Aug 8, 2024

jameslamb commented Aug 8, 2024 • edited Loading

fingoldo commented Aug 8, 2024 • edited Loading

jameslamb commented Feb 15, 2025

jameslamb commented Aug 8, 2024 •

edited

Loading

fingoldo commented Aug 8, 2024 •

edited

Loading