-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Load back parameters from saved model file (fixes #2613) #4802
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zyxue , sorry for the very long delay without a review! We really appreciate you working on this. I'm ready to review this and help move it forward.
- Please remove all commented-out code.
- Please merge in the latest changes from
master
- Please try to get this to a state where at least one test in the Python package is passing for one parameter...then we can iterate from there based on review comments.
At first glance, I have one suggestion about a change to the approach...instead of passing the raw content of the parameters
block from a text format back through LGBM_BoosterGetConfig()
, I think it would be better to have code on the C/C++ side parse that information and pass back a JSON string.
That way, every interface to LightGBM (for example, the Python and R packages in this repo) doesn't need to have some version of this code (with one line per parameter):
for line in io.StringIO(_params_str):
if line.startswith('[boosting: '):
self.params['boosting'] = line.strip().replace(f"[boosting: ", "").replace("]", "")
and could instead just pass the result of LGBM_BoosterGetConfig()
to a JSON parser like json
in Python or {jsonlite}
in R.
@zyxue thanks for coming back to this pull request! As you work on it, please do not rebase + force push. Use merge commits instead (e.g. This project squashes all pull request commits into a single commit on merge, so you don't need to worry about having too many commits here. |
@jameslamb , gotcha, thank you for the tip. |
Is https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/utils/json11.h expected to be the same as https://raw.githubusercontent.com/dropbox/json11/master/json11.hpp ? the example on https://github.com/dropbox/json11 doesn't seem to work with the json11 in lightgbm... |
That is where the |
@jameslamb , I've reimplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! I left some suggestions for a reorganization of the code on the C/C++ side. But let's see what other maintainers say.
char* out_str | ||
) { | ||
API_BEGIN(); | ||
Booster* ref_booster = reinterpret_cast<Booster*>(handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to propose a different organization of this code, but please don't make any changes to the organization until another maintainer like @StrikerRUS, @shiyu1994, or @tongwu-msft comments.
Instead of all this logic being in src/c_api.cpp
, I think most of the implementation for LGBM_BoosterGetConfig()
should live somewhere else. Similar to how LGBM_BoosterSaveModelToString()
is defined here in src/c_api.cpp
.
Line 2233 in cf38071
int LGBM_BoosterSaveModelToString(BoosterHandle handle, |
And then it is just a thin wrapper on GBDT::SaveModelToString()
.
LightGBM/src/boosting/gbdt_model_text.cpp
Line 410 in a06fadf
bool GBDT::SaveModelToFile(int start_iteration, int num_iteration, int feature_importance_type, const char* filename) const { |
I also think it's desirable for any code that needs to list every parameter by name should be in the automatically-generated methods in src/io/config_auto.cpp
.
So I think this should be approached as following:
- Add a method like
GBDT::GetLoadedConfig()
which reformatsloaded_parameter_
into the appropriate format a parameters string and then initializes aConfig
object by callingConfig::GetMembersFromString()
LightGBM/src/io/config_auto.cpp
Line 321 in cf38071
void Config::GetMembersFromString(const std::unordered_map<std::string, std::string>& params) {
- Add a method in
src/io/config_auto.cpp
likeConfig::DumpConfigToJson()
, which returns a JSON representation of aConfig
object.- NOTE: all the code there is generated by https://github.com/microsoft/LightGBM/blob/a06fadfb7ac3fdb26da3d4afc061a8b976070c50/helpers/parameter_generator.py. But if I was working on this, my approach would be to just write code directly in
src/io/config_auto.cpp
, get the tests into a good state, and then makeparameter_generator.py
write that code. - this will be useful if LightGBM adopts the proposal at [RFC] Unify model format customize string or Json #4887 in the future
- NOTE: all the code there is generated by https://github.com/microsoft/LightGBM/blob/a06fadfb7ac3fdb26da3d4afc061a8b976070c50/helpers/parameter_generator.py. But if I was working on this, my approach would be to just write code directly in
- Alter
LGBM_BoosterGetConfig()
to get the output ofConfig::DumpConfigToJson()
.
Sorry if this is overly complicated. Let's see what other maintainers say. And we are here to help! Thanks again for your help with this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reformats
loaded_parameter_
into the appropriate format a parameters string
@jameslamb , do you actually mean reformatting loaded_parameter_
into a std::unordered_map<std::string, std::string>
object?
I'm a bit confused by the name GetMembersFromString
, it actually means GetMembersFrom an unordered_map, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean doing whatever is necessary to turn loaded_parameter_
into a Config
object, yes. You're right that the method Config::GetMembersFromString()
takes an unordered map with string keys and string values.
|
As I said, please wait until another maintainer offers their opinion. While you wait on that, you can try working through the other suggestions like #4802 (comment). |
} else if (line.rfind("[tree_learner: ", 0) == 0) { | ||
obj["tree_learner"] = Json{extract_param("tree_learner", line)}; | ||
} else if (line.rfind("[verbose: ", 0) == 0) { | ||
obj["verbose"] = Json{std::stoi(extract_param("verbose", line))}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can maintain a Key-value map, and load parameters in a for-loop, rather than use many if-elsees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per #5244, I would find extracting the categorical_feature
param helpful as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would find extracting the categorical_feature param helpful as well
this is now being addressed in #5424 . @johnpaulett if you're interested in influencing the shape that this support takes in LightGBM, you're welcome to leave review comments on that PR.
@shiyu1994 can you take a look for this PR? |
Due to a lack of activity on this pull request, we've decided to move forward with a separate PR for this feature in #5424, so I'm going to close this. @zyxue thanks very much for your interest in LightGBM and for attempting this! If you have more time to work with us in the future, we'd welcome additional contributions. I'd be happy to suggest some smaller contributions which might not involve as much discussion and require as much of your time and energy. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This is still just an idea to collect feedbacks.
The approach in this PR:
loaded_parameter_
an attribute ofBoosting
instead ofGBDT
so it can be accessed inc_api.cpp
.LGBM_BoosterGetConfig
to returnloaded_parameter_
as a string.loaded_parameter_
can be parsed in Python code into proper types for theparams
dictionary.test
test_booster_load_params_when_passed_model_str
passesThis PR is related to #2613
A couple of questions:
loaded_parameter_
a parameter ofBoosting
instead ofGBDT
already? The other boosting types don't have parameters?c_api.cpp
can only deal withgbdt
as I see it's hard-coded inLightGBM/src/c_api.cpp
Line 109 in 874e635