Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package][R-package] load parameters from model file (fixes #2613) #5424

Merged
merged 28 commits into from
Oct 11, 2022

Conversation

jmoralez
Copy link
Collaborator

Given that #4802 seems to have stalled, this intends to supersede it, mainly to make #4323 easier by having access to the features that would have to be turned into factors and also unblock #5246.

Fixes #2613.

@jmoralez
Copy link
Collaborator Author

@StrikerRUS @jameslamb I'd appreciate your take on this approach. If you agree I can make the required changes to the R-package as well. The main idea is:

  1. Parse the contents of loaded_parameter_ into a JSON string.
  2. Parse the parameter types from the config.
  3. Load the parameters as strings in the wrappers and parse them to their corresponding types.
  4. Assign the result as the parameters of the created Booster when calling the load_from_string/load_from_file methods.

@jmoralez jmoralez changed the title Retrieve params load parameters from model file Aug 16, 2022
@StrikerRUS
Copy link
Collaborator

I'm afraid some floating point values will not survive float->string->float round trip...
Do we need something like the following?
dmlc/xgboost#5772

@jmoralez
Copy link
Collaborator Author

I'm afraid some floating point values will not survive float->string->float round trip...

Yes, this will probably loose precision on some cases. We can address loading the parameters here and maybe have a feature request for making sure we get matches that are as close as possible for floating point values.

@StrikerRUS
Copy link
Collaborator

Well, I don't think there can be a better solution other than a proposed one.

… check interaction constraints are properly loaded
@jmoralez jmoralez marked this pull request as ready for review August 29, 2022 17:04
@jmoralez
Copy link
Collaborator Author

I don't think precision loss in the parameters could be that big of a deal. There's already some on the thresholds and leaf values when writing the model file, so the parameters should be ok.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one quick comment about testing which might help in your work on this. I'll try to review more thoroughly later this week.

R-package/tests/testthat/test_lgb.Booster.R Outdated Show resolved Hide resolved
@jmoralez
Copy link
Collaborator Author

@shiyu1994 does cuda_exp set force_col_wise to false?

@StrikerRUS
Copy link
Collaborator

@jmoralez

does cuda_exp set force_col_wise to false?

Yes. Some GPU parameters are hardcoded:

LightGBM/src/io/config.cpp

Lines 336 to 355 in 83627ff

if (device_type == std::string("gpu") || device_type == std::string("cuda")) {
// force col-wise for gpu, and cuda version
force_col_wise = true;
force_row_wise = false;
if (deterministic) {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
}
} else if (device_type == std::string("cuda_exp")) {
// force row-wise for cuda_exp version
force_col_wise = false;
force_row_wise = true;
if (deterministic) {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
}
}
// force gpu_use_dp for CUDA
if (device_type == std::string("cuda") && !gpu_use_dp) {
Log::Warning("CUDA currently requires double precision calculations.");
gpu_use_dp = true;
}

@shiyu1994
Copy link
Collaborator

@jmoralez
Yes, cuda_exp forces force_col_wise=false and force_row_wise=true.

@jameslamb jameslamb self-requested a review August 31, 2022 19:38
@jameslamb
Copy link
Collaborator

jameslamb commented Aug 31, 2022

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2967193209

Status: success ✔️.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to review this thoroughly today.

I love that you were able to accomplish this with only private changes to the R and Python package, and one new public entrypoint in c_api. Nice work! That makes me confident that we could change this implementation in the future without breaking users.

I don't have any specific suggestions at the moment (tests look awesome!), but I asked two questions that I feel I need a better understanding of before I can approve this.

@@ -2764,6 +2764,7 @@ def __init__(
ctypes.byref(out_num_class)))
self.__num_class = out_num_class.value
self.pandas_categorical = _load_pandas_categorical(file_name=model_file)
params = self._get_loaded_param()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params = self._get_loaded_param()
params_from_model_file = self._get_loaded_param()
params = {**params_from_model_file, **params}

In this code block, there are two sources of parameter values:

  • params keyword argument from the constructor
  • parameters parsed out of the model file

I think that in the R and Python packages, wherever those conflict, the value passed through the keyword argument params should be preferred. But I'd like to hear your and @StrikerRUS 's opinions before you make any changes.

Why we might want to support this behavior

That's consistent with how the R and Python packages treat the params argument generally (#4904 (comment)).

And it'd be useful for non-Dask distributed training (for example), where some of the parameters like machines might change between the training run that produced an initial model and another one that performs training continuation.

Why we might not want to support this behavior

I believe that the R and Python packages already ignore passed-in params if you're creating a Booster from a model file.

So maybe we want to continue to ignore them until a specific user report or reproducible example demonstrates that it's a problem or surprising.

Copy link
Collaborator Author

@jmoralez jmoralez Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer ignoring them, since they won't be used by the loaded booster and it could cause confusion. We can raise a warning when passing both params and model_file to the Booster constructor to make this more evident. When using the loaded booster as init model users can override the parameters for the new iterations in the train function, so that should be enough.

I haven't done much incremental learning so I hadn't given much thought to this haha, please feel free to correct me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can raise a warning when passing both params and model_file to the Booster constructor to make this more evident.

I strongly believe we need a warning for users here.

And I think it's better to override loaded params from model string by the params passed in via kwargs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can add the warning. What would be an example use case for overriding the parameters from the model file when loading the booster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything around continued training might result in loading a model from a file and then performing additional training with different parameters.

For examples:

  • my example in this thread, about wanting to change machines (since the IP addresses of machines used for distributed training might be different from when the model file was created)
  • wanting to use a different num_leaves or learning_rate when performing an update like "train a few more iterations on newly-arrived data" (see my explanation in this Stack Overflow answer)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm struggling to understand when Booster.update would be called on a model loaded from file. It requires a bit of setup like defining the training set, and if you pass the training set to the constructor it takes a different path where it calls BoosterCreate with only the training set and the parameters.

The other case would be passing it to lgb.train as init_model and that already allows you to define the parameters for the new iterations.

Comment on lines +1226 to +1227
assert set_params == params
assert bst.params['categorical_feature'] == [1, 2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are checking that the params attribute on the Python Booster object matches what is in the file...but that being true doesn't necessarily guarantee that if you used this Booster for continued training that it would still actually use those parameters at the C++ side, right?

I'm bringing this up because I'm struggling to understand the relationship between GBDT::loaded_parameter_

loaded_parameter_ = ss.str();

and GBDT::config_

/*! \brief Config of gbdt */
std::unique_ptr<Config> config_;

In #2613 (comment), I'd recommended using GBDT::config_ as the source of the parameter information, because it seems to me like the parameters section of the model file is loaded up into that property loaded_parameter_, but not actually used to configure the GBDT object. But now I'm not sure that that's right either haha.

For example....if you tried continued training after loading a model from text file like this, would LightGBM respect feature_fraction=0.7? Or would it fall back to the default of 1.0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that being true doesn't necessarily guarantee that if you used this Booster for continued training that it would still actually use those parameters at the C++ side, right?

I believe the loaded_parameter_ attribute was added in #1495 to have the model parameters stored, but it isn't actually used anywhere other than when writing the parameters back to a file when there's no config.

if (config_ != nullptr) {
ss << "\nparameters:" << '\n';
ss << config_->ToString() << "\n";
ss << "end of parameters" << '\n';
} else if (!loaded_parameter_.empty()) {
ss << "\nparameters:" << '\n';
ss << loaded_parameter_ << "\n";
ss << "end of parameters" << '\n';
}
Not sure when this happens though, maybe from the CLI using refit task.

So I see these parameters more as "informative" of how the Booster was trained, rather than to be used to continue training. The original issue wanted to have objective be loaded back to be used in SHAP and I think having them may help trying to replicate previous trainings or having the categorical_feature for inference. I actually thought of maybe having an argument to decide whether or not to load them, they may not be useful and that computation could be skipped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke Could you please help to understand the connection between config_ and loaded_parameter_?

Copy link
Collaborator

@guolinke guolinke Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think the "loaded_parameter_" is used to check the parameters used for the model file. For continued training, the current (new) "config_" is used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got them from loaded_parameter_ because it seemed easier. If the contents of both attributes are equivalent we can document that, i.e. when loading a model from file the contents of the booster's parameter attribute is also the configuration for continued training.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmoralez @jameslamb any further comments about this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning about passing parameters to the booster constructor in 9467814 because in order to be able to call Booster.update you need a training set, and if you initialize a booster with the training set lgb.Booster(model_file=..., train_set=...) the model_file argument gets ignored, so I don't think there's a way to do incremental training based on the Booster object. Thus, I don't think it really matters what the parameters for the loaded booster are on the cpp side, because the incremental training would be through lgb.train that already supports overriding the parameters for the new iterations. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds good to me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jmoralez

It's an excellent point that Booster's constructor has logic like

if train_set is not None:
    ...
elif model_file is not None:
    ...

Meaning that you can't have both. I agree with the warning you've added and appreciate you adding a test for it as well!

I'm sorry for holding up this PR for so long over this. I'm still confused about the relationship between config_ and loaded_parameter_, but I don't have any more concerns about this specific PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also add, as I mentioned in #5424 (review) ... I think we should merge this because future changes probably wouldn't require any user-facing breaking changes. The fact that the new function you've added in c_api takes in a BoosterHandle means that we could swap it to using a different property of the Booster (e.g. config_ instead of loaded_parameter_) in the future without an API-level breaking change. Very nice!

@jmoralez jmoralez changed the title load parameters from model file [python-package][R-package] load parameters from model file (fixes #2613) Sep 1, 2022
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome work, and sorry I held this up for so long. The logic for how parameters flow through LightGBM is fairly complicated, and I wanted to be sure I understood the full implications of this change.

I think we should merge this 😀

@jameslamb
Copy link
Collaborator

jameslamb commented Oct 11, 2022

@jmoralez You put SO MUCH work into this, only seems right that you be the one to push the merge button 😊

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load back saved parameters with save_model to Booster object
5 participants