-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Fix inconsistency in predict()
output shape for 1-tree models
#6753
Conversation
python-package/lightgbm/basic.py
Outdated
@@ -1248,7 +1248,7 @@ def predict( | |||
if pred_leaf: | |||
preds = preds.astype(np.int32) | |||
is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix)) | |||
if not is_sparse and preds.size != nrow: | |||
if not is_sparse and preds.size != nrow or pred_leaf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add or pred_contrib
here as well?..
for normal and raw score, its length is equal to
num_class * num_data
;
for leaf index, its length is equal tonum_class * num_data * num_iteration
;
for feature contributions, its length is equal tonum_class * num_data * (num_feature + 1)
.
https://lightgbm.readthedocs.io/en/latest/C-API.html#c.LGBM_BoosterPredictForMat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it seems unnecessary since there's no place where the shape is used after pred_contrib=True
. pred_leaf
is necessary because of
LightGBM/python-package/lightgbm/basic.py
Lines 4882 to 4888 in b33a12e
leaf_preds: np.ndarray = predictor.predict( # type: ignore[assignment] | |
data=data, | |
start_iteration=-1, | |
pred_leaf=True, | |
validate_features=validate_features, | |
) | |
nrow, ncol = leaf_preds.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no place where the shape is used after pred_contrib=True.
@RektPunk Booster.predict()
is an important part of LightGBM's public API. If it's producing incorrect results, that needs to be fixed.
We've set the expectation that .predict(X, pred_leaf=True)
, for example, returns a matrix with shape (X.shape[0], num_trees)
for regression, and right now that is not true for a single-tree model.
import lightgbm as lgb
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=10_000, n_features=10)
dtrain = lgb.Dataset(X, label=y)
params = {
"objective": "regression",
"verbosity": -1,
}
lgb.train(params, dtrain, num_boost_round=1).predict(X, pred_leaf=True).shape
# (10000,)
lgb.train(params, dtrain, num_boost_round=2).predict(X, pred_leaf=True).shape
# (10000, 2)
Even if no other code inside lightgbm
referenced the .shape
attribute of that output, this would still be a bug and we'd still want to fix it... because any other code using lightgbm
should be able to form a consistent expectation of the return shape, and not need to have a special case for 1-tree models.
Should we add
or pred_contrib
here as well?
I think we should, for completeness, but the existing condition is technically already enough to get the correct output shape.
preds.size != nrow
will always be true for pred_contrib
, because the output contains 1 value per feature AND the Shapley base values.
In other words... it will always have nrow*(num_features + 1)
elements, so the reshaping will always be done, so there's no bug here for pred_contrib
like there is for pred_leaf
.
So please, let's do the following:
- change this condition to
if not is_sparse and preds.size != nrow or (pred_leaf or pred_contrib)
- add explicit tests checking the output shapes for
pred_leaf
andpred_contrib
with 1-iteration and 2-iteration models (using the snippet I provided above as a reference)
@RektPunk if you'd prefer that I add those additional tests, let me know and I'll push to your branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the condition as if not is_sparse and (preds.size != nrow or pred_leaf or pred_contrib)
since previous one make test failed: link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah great, thank you! Sorry for suggesting the wrong form... I never remember exactly how a condition like if A and B or C or D
is evaluated in Python 😅
The parentheses make it easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries at all, Thank you so much for your kind suggestion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I tested this locally and confirmed that this fixes the reported bug... thanks for diagnosing that and taking the time to submit a patch!
I left some suggestions on improving lightgbm
's tests to improve our confidence in this and prevent future refactorings from re-introducing bugs like this.
I've also changed the title of the PR, to reflect the fact that the bug here is not specific to refit()
.
params = {"objective": "binary", "num_trees": 1, "verbosity": -1} | ||
model = lgb.train(params, lgb_train, num_boost_round=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
params = {"objective": "binary", "num_trees": 1, "verbosity": -1} | |
model = lgb.train(params, lgb_train, num_boost_round=1) | |
params = {"objective": "binary", "verbosity": -1} | |
model = lgb.train(params, lgb_train, num_boost_round=1) |
Passing both num_trees
through params
and keyword argument num_boost_round=1
is unnecessary... they mean the same thing. Let's just use one here.
Can you also please add tests for regression and multi-class classification? Note that #6737 was specifically about regression... we should make sure this fixes that issue for that specific case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi-class problem, since preds.size != nrow, I think there is no fail without this fix.
That's fine, but please add a multi-class classification test anyway. That'll help prevent us from introducing this bug for multi-class classification in the future.
python-package/lightgbm/basic.py
Outdated
@@ -1248,7 +1248,7 @@ def predict( | |||
if pred_leaf: | |||
preds = preds.astype(np.int32) | |||
is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix)) | |||
if not is_sparse and preds.size != nrow: | |||
if not is_sparse and preds.size != nrow or pred_leaf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no place where the shape is used after pred_contrib=True.
@RektPunk Booster.predict()
is an important part of LightGBM's public API. If it's producing incorrect results, that needs to be fixed.
We've set the expectation that .predict(X, pred_leaf=True)
, for example, returns a matrix with shape (X.shape[0], num_trees)
for regression, and right now that is not true for a single-tree model.
import lightgbm as lgb
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=10_000, n_features=10)
dtrain = lgb.Dataset(X, label=y)
params = {
"objective": "regression",
"verbosity": -1,
}
lgb.train(params, dtrain, num_boost_round=1).predict(X, pred_leaf=True).shape
# (10000,)
lgb.train(params, dtrain, num_boost_round=2).predict(X, pred_leaf=True).shape
# (10000, 2)
Even if no other code inside lightgbm
referenced the .shape
attribute of that output, this would still be a bug and we'd still want to fix it... because any other code using lightgbm
should be able to form a consistent expectation of the return shape, and not need to have a special case for 1-tree models.
Should we add
or pred_contrib
here as well?
I think we should, for completeness, but the existing condition is technically already enough to get the correct output shape.
preds.size != nrow
will always be true for pred_contrib
, because the output contains 1 value per feature AND the Shapley base values.
In other words... it will always have nrow*(num_features + 1)
elements, so the reshaping will always be done, so there's no bug here for pred_contrib
like there is for pred_leaf
.
So please, let's do the following:
- change this condition to
if not is_sparse and preds.size != nrow or (pred_leaf or pred_contrib)
- add explicit tests checking the output shapes for
pred_leaf
andpred_contrib
with 1-iteration and 2-iteration models (using the snippet I provided above as a reference)
@RektPunk if you'd prefer that I add those additional tests, let me know and I'll push to your branch.
refit
to work with a single Treepredict()
output shape for 1-tree models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some more suggestions for the tests.
dtrain = lgb.Dataset(X, label=y) | ||
params = {"objective": "regression", "verbosity": -1} | ||
assert lgb.train(params, dtrain, num_boost_round=1).predict(X, pred_leaf=True).shape == (10_000, 1) | ||
assert lgb.train(params, dtrain, num_boost_round=2).predict(X, pred_leaf=True).shape == (10_000, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Please put this test down here by other .predict()
tests:
LightGBM/tests/python_package_test/test_engine.py
Line 3827 in b33a12e
def test_predict_stump(rng, use_init_score): |
And let's please:
- test all of
.predict(X, pred_leaf=True)
,.predict(X, pred_contrib=True)
, and.predict(X)
- change the tests names to e.g.
test_predict_regression_output_shape
- add similar tests for binary classification and 3-class multi-class classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I add multiple shape check tests for predict
including pred_leaf
and pred_contrib
for binary, multiclass test as you mentioned in d73f189
params = {"objective": "binary", "num_trees": 1, "verbosity": -1} | ||
model = lgb.train(params, lgb_train, num_boost_round=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi-class problem, since preds.size != nrow, I think there is no fail without this fix.
That's fine, but please add a multi-class classification test anyway. That'll help prevent us from introducing this bug for multi-class classification in the future.
params = {"objective": "regression", "verbosity": -1} | ||
model = lgb.train(params, lgb_train, num_boost_round=1) | ||
model_refit = model.refit(X, y) | ||
assert isinstance(model_refit, lgb.Booster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this test for regression is going to repeat all of the same code (totally fine, repetition in test code can be helpful!), then let's please just make it a separate test case.
def test_refit_with_one_tree_regression():
...
def test_refit_with_one_tree_binary_classification():
...
def test_refit_with_one_tree_multiclass_classification():
...
That way, the test could be targeted individually like
pytest './tests/python_package_test/test_engine.py::test_refit_with_one_tree_binary_classification'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I add multiclass example and split the test into in 03e41c6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tests! I left some small suggestions for re-organizing them.
Those are mostly minor personal preference, so if you'd prefer that I push such changes here directly let me know and I'd be happy to.
@@ -2307,6 +2313,33 @@ def test_refit(): | |||
assert err_pred > new_err_pred | |||
|
|||
|
|||
def test_refit_with_one_tree_regression(): | |||
X, y = make_regression(n_samples=10_000, n_features=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X, y = make_regression(n_samples=10_000, n_features=10) | |
X, y = make_regression(n_samples=1_000, n_features=2) |
Let's use a slightly smaller dataset, to make this a little faster and to reduce how much memory it uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I changed test with smaller example in 6ad5c49
Hi @jameslamb, just a quick follow-up on the review. I’ve addressed the comments—please let me know if there’s anything else I should look into. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks very much @RektPunk ! We really appreciate you picking up and fixing this bug 😊
Fixes #6737