Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] respect 'verbose' argument in lgb.cv() (fixes #4667) #4903

Merged
merged 11 commits into from
Jan 6, 2022
1 change: 1 addition & 0 deletions R-package/R/lgb.cv.R
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,7 @@ lgb.cv <- function(params = list()

booster <- Booster$new(params = params, train_set = dtrain)
booster$add_valid(data = dtest, name = "valid")
booster$reset_parameter(params)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: I don't totally understand why this change fixes #4667. At first I was worried that this function might suffer from a larger problem like "lgb.cv() does not respect values passed to params", but that is not true.

Consider the following example. Setting min_gain_to_split to a large number for the binary objective results in no splits, as expected.

library(lightgbm)
data("agaricus.train")
dtrain <- lgb.Dataset(
    data = agaricus.train$data
    , label = agaricus.train$label
)
lgb.cv(
    params = list(
        min_gain_to_split = 100000.0
    )
    , data = dtrain
    , verbose = 0L
    , nrounds = 3L
    , nfold = 2L
    , obj = "binary"
)

The logs are full of the following

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements

Changing to min_gain_to_split = 0.0, those warnings go away and training produces trees with multiple leaves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this related to the lack of corresponding code lines in R-package?

finally:
train_set._reverse_update_params()
for valid_set in reduced_valid_sets:
valid_set._reverse_update_params()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I'm not really sure. The code in lgb.train() and lgb.cv() in the R package is very similar, and lgb.train() doesn't have this issue with the parameter verbose.

Anyway, I think this fix is still worth merging even if it isn't fully understood...I don't think adding another $update_params() with the passed-in params introduces any risk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think adding another $update_params() with the passed-in params introduces any risk.

Agree. But this introduces performance degradation because this method calls C API function under the hood and is being executed for each fold.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this introduces performance degradation

Sure, but I think that in this case, the cost of calling this method once per fold (one time at set up, not on each iteration, just to be clear) is worth it in exchange for ensuring that lgb.cv()'s behavior is correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm letting you to decide.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looking into this more closely tonight, I think I figured out:

  1. why verbose is not being respected
  2. why it seems like ONLY verbose is not being respected and all other parameters are working correctly

Short Summary

Two bugs in the R package:

  1. Dataset$slice() calls LGBM_DatasetGetSubset(), which will globally set the log level to INFO if verbosity isn't in the parameters passed to it. And verbosity is never passed through.
  2. When Dataset$update_params(new_params) is called on a Dataset that has already been constructed, it doesn't store new_params in memory in the R object (private$params)

I just pushed 619eaf4 which reverts the addition of booster$reset_parameter() and adds fixes for these two bugs.

Much Longer Summary

I was able to reduce the internals of lgb.cv() to the following code that helps narrow down the issue.

sample code (click me)
library(lightgbm)

data("agaricus.train", package = "lightgbm")

params <- list(
    max_bin = 123
    , verbose = -7L
)

dtrain <- lgb.Dataset(
    data = agaricus.train$data
    , label = agaricus.train$label
    , params = params
)
dtest <- slice(
    dataset = dtrain
    , idxset = seq_len(100L)
)

# with a validation set, you get logs
booster_w_valid <- lightgbm:::Booster$new(
    params = params
    , train_set = dtrain
)
booster_w_valid$add_valid(data = dtest, name = "valid")
booster_w_valid$update()

# without a validation set, no logs!
booster_no_valid <- lightgbm:::Booster$new(
    params = params
    , train_set = dtrain
)
booster_no_valid$update()

root cause: The use of slice() on a Dataset resets LightGBM's log level globally to the default value (INFO).

More details:

  1. Dataset$slice() gets params from Dataset$get_params()
  2. Dataset$get_params() filters down to only Dataset-specific parameters.
    • get_params = function() {
      dataset_params <- unname(unlist(.DATASET_PARAMETERS()))
      ret <- list()
      for (param_key in names(private$params)) {
      if (param_key %in% dataset_params) {
      ret[[param_key]] <- private$params[[param_key]]
      }
      }
      return(ret)
      },
  3. verbose is not in the Dataset-specific parameters.
  4. Dataset$slice() eventually calls Dataset$construct() which calls LGBM_DatasetGetSubset(), which runs config.Set()
  5. config.Set() calls Log::ResetLogLevel() which resets log level globally for the current process.
    • LightGBM/src/io/config.cpp

      Lines 241 to 249 in af5b40e

      if (verbosity == 1) {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Info);
      } else if (verbosity == 0) {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Warning);
      } else if (verbosity >= 2) {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Debug);
      } else {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Fatal);
      }
  6. If verbosity isn't found in params, it defaults to INFO-level

Dataset.subset() in the Python package doesn't suffer from this issue because it references self.params (the dictionary of parameters stored as an attribute on the Dataset instance) instead of using Dataset.get_params(), so verbose is not filtered out.

params = self.params

At first, I tried to just fix this by adding "verbosity" to the list of parameters returned by Dataset$get_params(). Unfortunately, that can lead to a new type of problem where the value of verbosity stored in a constructed Dataset will take precedence over values passed to lgb.train() / lgb.cv() if that Dataset is re-used multiple times.

params <- utils::modifyList(params, train_set$get_params())

Next, I tried just changing the use of Dataset$get_params() to use Dataset$private$params in Dataset$slice(). That uncovered another bug. That still suffered from the problem that once the training Dataset was constructed, verbosity was locked in to the value from the last time it was constructed.

  1. When calling Dataset$construct(), if it's already been constructed then R calls LGBM_DatasetUpdateParamChecking()
  2. That call is try-catched. If it fails, then Dataset$construct() does the right thing based on whether or not raw data is still available. However, if that call does NOT fail, private$params is not updated!
    • } else {
      tryCatch({
      .Call(
      LGBM_DatasetUpdateParamChecking_R
      , lgb.params2str(params = private$params)
      , lgb.params2str(params = params)
      )
      }, error = function(e) {
      # If updating failed but raw data is not available, raise an error because
      # achieving what the user asked for is not possible
      if (is.null(private$raw_data)) {
      stop(e)
      }
      # If updating failed but raw data is available, modify the params
      # on the R side and re-set ("deconstruct") the Dataset
      private$params <- utils::modifyList(private$params, params)
      self$finalize()
      })

return(
list(booster = booster)
)
Expand Down
50 changes: 50 additions & 0 deletions R-package/tests/testthat/test_basic.R
Original file line number Diff line number Diff line change
Expand Up @@ -1901,6 +1901,56 @@ test_that("early stopping works with lgb.cv()", {
)
})

test_that("lgb.cv() respects changes to logging verbosity", {
dtrain <- lgb.Dataset(
data = train$data
, label = train$label
)
# (verbose = 1) should be INFO and WARNING level logs
lgb_cv_logs <- capture.output({
cv_bst <- lgb.cv(
params = list()
, nfold = 2L
, nrounds = 5L
, data = dtrain
, obj = "binary"
, verbose = 1L
)
})
expect_true(any(grepl("\\[LightGBM\\] \\[Info\\]", lgb_cv_logs)))
expect_true(any(grepl("\\[LightGBM\\] \\[Warning\\]", lgb_cv_logs)))

# (verbose = 0) should be WARNING level logs only
lgb_cv_logs <- capture.output({
cv_bst <- lgb.cv(
params = list()
, nfold = 2L
, nrounds = 5L
, data = dtrain
, obj = "binary"
, verbose = 0L
)
})
expect_false(any(grepl("\\[LightGBM\\] \\[Info\\]", lgb_cv_logs)))
expect_true(any(grepl("\\[LightGBM\\] \\[Warning\\]", lgb_cv_logs)))

# (verbose = -1) no logs
lgb_cv_logs <- capture.output({
cv_bst <- lgb.cv(
params = list()
, nfold = 2L
, nrounds = 5L
, data = dtrain
, obj = "binary"
, verbose = -1L
)
})
# NOTE: this is not length(lgb_cv_logs) == 0 because lightgbm's
# dependencies might print other messages
expect_false(any(grepl("\\[LightGBM\\] \\[Info\\]", lgb_cv_logs)))
expect_false(any(grepl("\\[LightGBM\\] \\[Warning\\]", lgb_cv_logs)))
})

context("linear learner")

test_that("lgb.train() fit on linearly-relatead data improves when using linear learners", {
Expand Down