Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] respect 'verbose' argument in lgb.cv() (fixes #4667) #4903

Merged
merged 11 commits into from
Jan 6, 2022

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Dec 21, 2021

Fixes #4667.

On master, the tests added in this PR fail and the code below prints INFO and WARNING level logs.

library(lightgbm)
data("agaricus.train")
dtrain <- lgb.Dataset(
    data = agaricus.train$data
    , label = agaricus.train$label
)
lgb.cv(
    params = list()
    , data = dtrain
    , verbose = -1L
    , nrounds = 3L
    , nfold = 2L
    , obj = "binary"
)

As of this PR, the tests pass and no logs are produced.

@@ -333,6 +333,7 @@ lgb.cv <- function(params = list()

booster <- Booster$new(params = params, train_set = dtrain)
booster$add_valid(data = dtest, name = "valid")
booster$reset_parameter(params)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: I don't totally understand why this change fixes #4667. At first I was worried that this function might suffer from a larger problem like "lgb.cv() does not respect values passed to params", but that is not true.

Consider the following example. Setting min_gain_to_split to a large number for the binary objective results in no splits, as expected.

library(lightgbm)
data("agaricus.train")
dtrain <- lgb.Dataset(
    data = agaricus.train$data
    , label = agaricus.train$label
)
lgb.cv(
    params = list(
        min_gain_to_split = 100000.0
    )
    , data = dtrain
    , verbose = 0L
    , nrounds = 3L
    , nfold = 2L
    , obj = "binary"
)

The logs are full of the following

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements

Changing to min_gain_to_split = 0.0, those warnings go away and training produces trees with multiple leaves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this related to the lack of corresponding code lines in R-package?

finally:
train_set._reverse_update_params()
for valid_set in reduced_valid_sets:
valid_set._reverse_update_params()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I'm not really sure. The code in lgb.train() and lgb.cv() in the R package is very similar, and lgb.train() doesn't have this issue with the parameter verbose.

Anyway, I think this fix is still worth merging even if it isn't fully understood...I don't think adding another $update_params() with the passed-in params introduces any risk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think adding another $update_params() with the passed-in params introduces any risk.

Agree. But this introduces performance degradation because this method calls C API function under the hood and is being executed for each fold.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this introduces performance degradation

Sure, but I think that in this case, the cost of calling this method once per fold (one time at set up, not on each iteration, just to be clear) is worth it in exchange for ensuring that lgb.cv()'s behavior is correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm letting you to decide.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looking into this more closely tonight, I think I figured out:

  1. why verbose is not being respected
  2. why it seems like ONLY verbose is not being respected and all other parameters are working correctly

Short Summary

Two bugs in the R package:

  1. Dataset$slice() calls LGBM_DatasetGetSubset(), which will globally set the log level to INFO if verbosity isn't in the parameters passed to it. And verbosity is never passed through.
  2. When Dataset$update_params(new_params) is called on a Dataset that has already been constructed, it doesn't store new_params in memory in the R object (private$params)

I just pushed 619eaf4 which reverts the addition of booster$reset_parameter() and adds fixes for these two bugs.

Much Longer Summary

I was able to reduce the internals of lgb.cv() to the following code that helps narrow down the issue.

sample code (click me)
library(lightgbm)

data("agaricus.train", package = "lightgbm")

params <- list(
    max_bin = 123
    , verbose = -7L
)

dtrain <- lgb.Dataset(
    data = agaricus.train$data
    , label = agaricus.train$label
    , params = params
)
dtest <- slice(
    dataset = dtrain
    , idxset = seq_len(100L)
)

# with a validation set, you get logs
booster_w_valid <- lightgbm:::Booster$new(
    params = params
    , train_set = dtrain
)
booster_w_valid$add_valid(data = dtest, name = "valid")
booster_w_valid$update()

# without a validation set, no logs!
booster_no_valid <- lightgbm:::Booster$new(
    params = params
    , train_set = dtrain
)
booster_no_valid$update()

root cause: The use of slice() on a Dataset resets LightGBM's log level globally to the default value (INFO).

More details:

  1. Dataset$slice() gets params from Dataset$get_params()
  2. Dataset$get_params() filters down to only Dataset-specific parameters.
    • get_params = function() {
      dataset_params <- unname(unlist(.DATASET_PARAMETERS()))
      ret <- list()
      for (param_key in names(private$params)) {
      if (param_key %in% dataset_params) {
      ret[[param_key]] <- private$params[[param_key]]
      }
      }
      return(ret)
      },
  3. verbose is not in the Dataset-specific parameters.
  4. Dataset$slice() eventually calls Dataset$construct() which calls LGBM_DatasetGetSubset(), which runs config.Set()
  5. config.Set() calls Log::ResetLogLevel() which resets log level globally for the current process.
    • LightGBM/src/io/config.cpp

      Lines 241 to 249 in af5b40e

      if (verbosity == 1) {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Info);
      } else if (verbosity == 0) {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Warning);
      } else if (verbosity >= 2) {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Debug);
      } else {
      LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Fatal);
      }
  6. If verbosity isn't found in params, it defaults to INFO-level

Dataset.subset() in the Python package doesn't suffer from this issue because it references self.params (the dictionary of parameters stored as an attribute on the Dataset instance) instead of using Dataset.get_params(), so verbose is not filtered out.

params = self.params

At first, I tried to just fix this by adding "verbosity" to the list of parameters returned by Dataset$get_params(). Unfortunately, that can lead to a new type of problem where the value of verbosity stored in a constructed Dataset will take precedence over values passed to lgb.train() / lgb.cv() if that Dataset is re-used multiple times.

params <- utils::modifyList(params, train_set$get_params())

Next, I tried just changing the use of Dataset$get_params() to use Dataset$private$params in Dataset$slice(). That uncovered another bug. That still suffered from the problem that once the training Dataset was constructed, verbosity was locked in to the value from the last time it was constructed.

  1. When calling Dataset$construct(), if it's already been constructed then R calls LGBM_DatasetUpdateParamChecking()
  2. That call is try-catched. If it fails, then Dataset$construct() does the right thing based on whether or not raw data is still available. However, if that call does NOT fail, private$params is not updated!
    • } else {
      tryCatch({
      .Call(
      LGBM_DatasetUpdateParamChecking_R
      , lgb.params2str(params = private$params)
      , lgb.params2str(params = params)
      )
      }, error = function(e) {
      # If updating failed but raw data is not available, raise an error because
      # achieving what the user asked for is not possible
      if (is.null(private$raw_data)) {
      stop(e)
      }
      # If updating failed but raw data is available, modify the params
      # on the R side and re-set ("deconstruct") the Dataset
      private$params <- utils::modifyList(private$params, params)
      self$finalize()
      })

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite that I'm not fully satisfied by this fix that causes performance degradation and believe there can be a way of specifying params without resetting them, I don't want to block this PR from merging. Please consider this my approval just as a comment that allows you to click "Merge" button. This isn't a "real" review because I don't understand why this fixes original issue (as you do #4903 (comment)).

@jameslamb
Copy link
Collaborator Author

Ok thanks @StrikerRUS ! I'm going to try to investigate a bit more before merging this.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Jan 3, 2022

I'm going to try to investigate a bit more

Happy to say I found the root cause for this bug, and was able to revert the additional call to Booster$reset_parameter()! See #4903 (comment) for a detailed description.

This taught me something important about LightGBM's internals. Any time config.Set() is called, it will globally reset LightGBM's log level.

LightGBM/src/io/config.cpp

Lines 241 to 249 in af5b40e

if (verbosity == 1) {
LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Info);
} else if (verbosity == 0) {
LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Warning);
} else if (verbosity >= 2) {
LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Debug);
} else {
LightGBM::Log::ResetLogLevel(LightGBM::LogLevel::Fatal);
}

This becomes important considering #4904 (comment)

throughout LightGBM, "params" isn't used to represent "all of the configuration for LightGBM" but more like "overrides to default configuration values"

Any code creating a config object on the C/C++ side based on a params_str that doesn't include verbosity will end up resetting the log level to INFO for the current process, until some other calls sets it to something else.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant investigation! I'm super excited that you've uncovered deep underlying bug. Thank you very much!

R-package/R/lgb.Dataset.R Outdated Show resolved Hide resolved
@jameslamb jameslamb merged commit 1d8bfd8 into master Jan 6, 2022
@jameslamb jameslamb deleted the fix/r-lgb-cv-verbose branch January 6, 2022 00:55
@StrikerRUS StrikerRUS mentioned this pull request Jan 6, 2022
13 tasks
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-package] lgb.cv ignores verbose argument
2 participants