Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] add scikit-learn-style API for early stopping #5808
base: master
Are you sure you want to change the base?
[python-package] add scikit-learn-style API for early stopping #5808
Changes from 15 commits
ad43e17
f05e5e0
76f3c19
457c7f6
0db1941
10fac65
d10ca54
3b8eb0a
e47acc0
66701ac
39d333e
cad7eb6
1234ccf
d54c96a
9c1c8b4
724c7fe
c957fce
2d7da78
069a84e
c430ec1
416323a
73562ff
38edc42
9a32376
f33ebd3
a61726f
44316d7
4cbfc84
93acf6a
2b049c9
61371cb
65c4e2f
0a8e843
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this type hint removed? If it was just an accident, please put it back to reduce the size of the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not any tests in
test_sklearn.py
that passvalidation_fraction
. Please add some, covering both the default behavior and that passing a non-default value (like0.4
) works as expected.I don't know the exact code paths off the top of my head, would appreciate if you can investigate... but I think it should be possible to test this by checking the size of the datasets added to
valid_sets
and confirming that they're as expected (e.g. that the automatically-aded validation set has 4,000 rows if the input dataX
has 40,000 rows andvalidation_fraction=0.1
is passed).If that's not observable through the public API, try to use mocking/patching to observe it instead of adding any additional properties to the Booster / estimators' public API.
Comment in-thread here if you have questions or need help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment, I have added a couple of tests for that, I have used patching to achieve that, let me know if you think there is more that can be improved in those tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #5808 (comment), you said:
But this block of code directly contradicts that... it enables early stopping by default if there are more than 10K rows in the training data.
From all the prior discussion on this PR, it seems clear to me that @borchero and @jmoralez both requested you not do that, and instead limit the scope of this PR to just providing an API for early stopping that matches how
scikit-learn
estimators do it. I agree with them... it shouldn't be enabled by default as part of this PR.Please, do the following:
[python-package] add scikit-learn-style API for early stopping
early_stopping_round = "auto"
(and the corresponding changes further down where you moved parameter processing to after input data validation, so it could referenceself._n_rows_train
HistGradientBoostingClassifier
/HistGradientBoostingRegressor
currently have for enabling early stoppingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jameslamb, thanks a lot for your review, I highly appreciate that.
You are right with your comment, it was a mistake on my side, I have fixed it in my latest commits. Thank you for spotting it.
In the latest version i had:
HistGradientBoostingClassifier
/HistGradientBoostingRegressor
and the current implementation are matching except that:HistGradientBoostingClassifier
/HistGradientBoostingRegressor
supportearly_stopping_round = "auto"
as this is their default parameter value, but as we discussed above we do not want this behaviour.early_stopping
is an alias forearly_stopping_round(s)
andn_iter_no_change
. In the current implementation settingearly_stopping_round(s)
to an integer will override the default value forn_iter_no_change
, but settingearly_stopping
to an integer will not enable early stopping and will not override the default value forn_iter_no_change
. I have opted for this implementation to preserve backward compatibility as much as possible. Please let me know if you also agree with it or you have a different opinion, happy to change it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that explanation.
Agreed, thank you for removing it.
Thanks for being careful about backwards compatibility!
It's unfortunate that we have
early_stopping
as an alias forearly_stopping_rounds
and thatscikit-learn
's estimators now assign a different meaning to that :/LightGBM/include/LightGBM/config.h
Lines 394 to 397 in 1090a93
https://github.com/scikit-learn/scikit-learn/blob/6cccd99aee3483eb0f7562afdd3179ccccab0b1d/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L1572-L1575
Seems like that has been in
scikit-learn
for a long time (since v0.23):https://github.com/scikit-learn/scikit-learn/blob/6cccd99aee3483eb0f7562afdd3179ccccab0b1d/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L1577
I see your implementation of this comment in
LGBMModel._process_params()
:I think I see a path to resolving this. It'll be complex, but it would keep LightGBM in compliance with the rules from https://scikit-learn.org/1.5/developers/develop.html#instantiation.
n_iter_no_change
(and propertyself.n_iter_no_change
)... but ensure there are unit tests confirming that passing keyword argumentn_iter_no_change
through the scikit-learn estimators correctly sets the number of early stopping roundsearly_stopping
have typeUnion[bool, int]
LGMBModel._process_params()
, roughly like this:rewrite this block:
LightGBM/python-package/lightgbm/sklearn.py
Lines 850 to 854 in 1090a93
to something like this:
And then any other places in the code where you have
self.early_stopping is True
should useisinstance(self.early_stopping, bool) and self.early_stopping is True
, to avoid this SyntaxWarning:I think that can work and provide backwards compatibility while allowing us to match scikit-learn's API. And then eventually (after several
lightgbm
releases), that warning could be converted to an error and the type ofearly_stopping
could be narrowed to justbool
.What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great suggestion! thanks a lot for it, I have implemented it in the latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of how the validation set is created from the train set using the
_make_n_folds
function.if 1/validation_fraction is not an integer, the result will be that the actual validation set size will not match the validation fraction specified by the user.
For example, if the validation fraction is 0.4 the number of splits calculated here will be 2, which will result in a fraction of 0.5, instead of 0.4.
Using something like train_test_split from scikit-learn would solve the issue for the classification and the regression case, but for ranking tasks our best option is GroupShuffleSplit, which will inevitably suffer from the same issue expressed above. The options that I thought to solve this issue are:
I would lean more towards option 2, but this will make the MR bigger.
@jameslamb I would like to hear your opinion on it, do you perhaps already have something else in mind?