Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Early stopping not reproducible when nthreads>1 #5758

Closed
CHY-WANG opened this issue Mar 1, 2023 · 3 comments
Closed

[python-package] Early stopping not reproducible when nthreads>1 #5758

CHY-WANG opened this issue Mar 1, 2023 · 3 comments
Labels

Comments

@CHY-WANG
Copy link

CHY-WANG commented Mar 1, 2023

Description

Training models with exactly the same setting (including seed) can result in different numbers of trees, when setting nthreads>1 and early_stopping_rounds>0. I found a workaround by modifying the python package, but not sure if this problem only occurs in python.

Reproducible example

The example is in Python. The result is not 100% reproducible because the randomness comes from multithreading, but models with different numbers of trees do happen frequently.
Code:

import lightgbm
import numpy as np
import pandas as pd

# Generate some toy data
train_data = pd.DataFrame({"y":np.zeros(1024*8)})
predictors = []
for i in range(10):
    train_data["x" + str(i)] = np.sin(np.array(range(1024*8))+i/1000) + 1
    train_data["y"] = train_data["y"] = train_data["x" + str(i)]
    predictors.append("x" + str(i))
valid_data = train_data.copy()
train_data_lgb = lightgbm.Dataset(train_data[["x" + str(i) for i in range(9)]],
                                  label = np.array(train_data["y"]),
                                  free_raw_data = False)
valid_data_lgb = lightgbm.Dataset(valid_data[["x" + str(i) for i in range(9)]],
                                  label = np.array(valid_data["y"]),
                                  free_raw_data = False)

# Use goss
params = {'boosting_type': 'goss',
          'cat_l2': 55.08312413819303,
          'feature_fraction': 0.5,
          'learning_rate': 0.02,
          'max_cat_threshold': 26,
          'max_depth': 4,
          'min_data_in_leaf': 35,
          'min_gain_to_split': 0.37881681266148276,
          'other_rate': 0.1,
          'reg_alpha': 4.1101593522818165,
          'reg_lambda': 65.70629589637446,
          'top_rate': 0.3,
          'bagging_fraction': 1.0,
          'bagging_freq': 0,
          'bagging_seed': 121,
          'feature_fraction_seed': 214,
          'feature_pre_filter': False,
          'num_iterations': 1000,
          'num_leaves': 127,
          'seed': 214,
          'verbosity': -1,
          'objective': 'gamma'}

# Train 100 models with the same parameters with multithread and early stopping
# Record numbers of trees in the 100 models
params["nthreads"] = 8
num_trees = []
for i in range(100):
    model_8 = lightgbm.train(params,
                        train_data_lgb,
                        valid_sets = [train_data_lgb, valid_data_lgb],
                        valid_names = ['train', 'valid'],
                        verbose_eval = False,
                        early_stopping_rounds = 25)
    num_trees += [model_8.num_trees()]
print(pd.Series(num_trees).value_counts())

Output:

506    80
522    20
dtype: int64

The irreproducible problem will go away when setting nthread=1

params["nthreads"] = 1
num_trees = []
for i in range(100):
    model_8 = lightgbm.train(params,
                        train_data_lgb,
                        valid_sets = [train_data_lgb, valid_data_lgb],
                        valid_names = ['train', 'valid'],
                        verbose_eval = False,
                        early_stopping_rounds = 25)
    num_trees += [model_8.num_trees()]
print(pd.Series(num_trees).value_counts())

Output:

481    100
dtype: int64

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

pip install lightgbm==3.3.5

Additional Comments

After a closer look at the trees built in each model, we found why the early stop happens at different number of trees. At some iteration, the training function may fail to grow a tree because it cannot find any split. Theoretically, the evaluation result of this iteration should be the same as last iteration because the model remains the same. However, because the evaluation is done with multithreading, the evaluation result may change by some numeric error. As a result an early stop that should have happened may fail to happen, because it sees an improvement in the evaluation result, which is actually just numeric error.

This problem can be fix by checking both improvement in evaluation result and increase in number of trees when checking for early stop. It should consider an iteration to be better than a previous one only if both the evaluation improves and number of trees increases. For example, I fixed it by modifying the early stop call back https://github.com/microsoft/LightGBM/blob/v3.3.5/python-package/lightgbm/callback.py#L254
to the following, where in the beggining I set current_iter = [-1]:

  def _callback(env: CallbackEnv) -> None:
      if not cmp_op:
          _init(env)
      if not enabled[0]:
          return
      is_updated = current_iter[0] != env.model.current_iteration()
      current_iter[0] = env.model.current_iteration()
      for i in range(len(env.evaluation_result_list)):
          score = env.evaluation_result_list[i][2]
          if is_updated and (
              best_score_list[i] is None or cmp_op[i](score, best_score[i])
          ):
              best_score[i] = score
              best_iter[i] = env.iteration
              best_score_list[i] = env.evaluation_result_list
          # split is needed for "<dataset type> <metric>" case (e.g. "train l1")
          eval_name_splitted = env.evaluation_result_list[i][1].split(" ")
          if first_metric_only and first_metric[0] != eval_name_splitted[-1]:
              continue  # use only the first metric for early stopping
          if (
              env.evaluation_result_list[i][0] == "cv_agg"
              and eval_name_splitted[0] == "train"
              or env.evaluation_result_list[i][0] == env.model._train_data_name
          ):
              _final_iteration_check(env, eval_name_splitted, i)
              continue  # train data for lgb.cv or sklearn wrapper (underlying lgb.train)
          elif env.iteration - best_iter[i] >= stopping_rounds:
              if verbose:
                  eval_result_str = "\t".join(
                      [_format_eval_result(x) for x in best_score_list[i]]
                  )
                  _log_info(
                      f"Early stopping, best iteration is:\n[{best_iter[i] + 1}]\t{eval_result_str}"
                  )
                  if first_metric_only:
                      _log_info(f"Evaluated only: {eval_name_splitted[-1]}")
              raise EarlyStopException(best_iter[i], best_score_list[i])
          _final_iteration_check(env, eval_name_splitted, i)
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM, and for your detailed report. Sorry it took so long for someone to respond here.

I don't support changing the behavior of early stopping in the Python package in the way you're proposing, which I believe is:

# current
* eval metrics have not improved for {early_stopping_rounds} consecutive iterations

# proposed
* eval metrics have not improved for {early_stopping_rounds} consecutive iterations (ignoring any iterations which produced no-split trees)

In my opinion, the risk of bugs and maintenance burden due to added complexity introduced by that type of change in an already-complex part of the codebase introduces isn't worth it in exchange for improving the reproducibility of lgb.cv() when using multithreading + random sampling of rows / columns.

The behavior you've observed is only because you're passing feature_fraction = 0.5.

With feature_fraction = 1.0 and bagging_fraction = 1.0, all of the rows and columns will be considered for potential splits in every iteration, meaning that once you observe on iteration where LightGBM fails to find a split, all future iterations could also be expected not to have any splits.

If you want close-to-reproducible behavior from lgb.cv(), while still using multithreading, consider the following changes:

  • reduce min_gain_to_split
  • set feature_fraction = 1.0 and bagging_fraction = 1.0

If you want to treat very small improvements in eval metrics as "not actually an improvement", for the purpose of early stopping, use the min_delta option introduced in #4580. See https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.early_stopping.html.

Note

min_delta has not made it into a LightGBM release yet. To use it, you'll have to install LightGBM from source, following the instructions at https://github.com/microsoft/LightGBM/blob/master/python-package/README.rst#install-from-github. Follow #5153 to be notified when the next release of LightGBM is out.

@github-actions
Copy link

github-actions bot commented Jul 2, 2023

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions github-actions bot closed this as completed Jul 2, 2023
@github-actions
Copy link

github-actions bot commented Oct 4, 2023

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants