Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Fix inconsistency in predict() output shape for 1-tree models #6753

Merged
merged 16 commits into from
Dec 22, 2024

Conversation

RektPunk
Copy link
Contributor

@RektPunk RektPunk commented Dec 14, 2024

Fixes #6737

@@ -1248,7 +1248,7 @@ def predict(
if pred_leaf:
preds = preds.astype(np.int32)
is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix))
if not is_sparse and preds.size != nrow:
if not is_sparse and preds.size != nrow or pred_leaf:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add or pred_contrib here as well?..

for normal and raw score, its length is equal to num_class * num_data;
for leaf index, its length is equal to num_class * num_data * num_iteration;
for feature contributions, its length is equal to num_class * num_data * (num_feature + 1).
https://lightgbm.readthedocs.io/en/latest/C-API.html#c.LGBM_BoosterPredictForMat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it seems unnecessary since there's no place where the shape is used after pred_contrib=True. pred_leaf is necessary because of

leaf_preds: np.ndarray = predictor.predict( # type: ignore[assignment]
data=data,
start_iteration=-1,
pred_leaf=True,
validate_features=validate_features,
)
nrow, ncol = leaf_preds.shape
.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no place where the shape is used after pred_contrib=True.

@RektPunk Booster.predict() is an important part of LightGBM's public API. If it's producing incorrect results, that needs to be fixed.

We've set the expectation that .predict(X, pred_leaf=True), for example, returns a matrix with shape (X.shape[0], num_trees) for regression, and right now that is not true for a single-tree model.

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000, n_features=10)

dtrain = lgb.Dataset(X, label=y)
params = {
    "objective": "regression",
    "verbosity": -1,
}

lgb.train(params, dtrain, num_boost_round=1).predict(X, pred_leaf=True).shape
# (10000,)

lgb.train(params, dtrain, num_boost_round=2).predict(X, pred_leaf=True).shape
# (10000, 2)

Even if no other code inside lightgbm referenced the .shape attribute of that output, this would still be a bug and we'd still want to fix it... because any other code using lightgbm should be able to form a consistent expectation of the return shape, and not need to have a special case for 1-tree models.

Should we add or pred_contrib here as well?

I think we should, for completeness, but the existing condition is technically already enough to get the correct output shape.

preds.size != nrow will always be true for pred_contrib, because the output contains 1 value per feature AND the Shapley base values.

In other words... it will always have nrow*(num_features + 1) elements, so the reshaping will always be done, so there's no bug here for pred_contrib like there is for pred_leaf.

So please, let's do the following:

  1. change this condition to if not is_sparse and preds.size != nrow or (pred_leaf or pred_contrib)
  2. add explicit tests checking the output shapes for pred_leaf and pred_contrib with 1-iteration and 2-iteration models (using the snippet I provided above as a reference)

@RektPunk if you'd prefer that I add those additional tests, let me know and I'll push to your branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quick review. As you said, the condition is changed in 088a2b9, and explicit test is added in ba39a6f. I'm not sure the exact location of newly added test though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the condition as if not is_sparse and (preds.size != nrow or pred_leaf or pred_contrib) since previous one make test failed: link.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great, thank you! Sorry for suggesting the wrong form... I never remember exactly how a condition like if A and B or C or D is evaluated in Python 😅

The parentheses make it easier to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries at all, Thank you so much for your kind suggestion :)

@StrikerRUS StrikerRUS added the fix label Dec 14, 2024
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I tested this locally and confirmed that this fixes the reported bug... thanks for diagnosing that and taking the time to submit a patch!

I left some suggestions on improving lightgbm's tests to improve our confidence in this and prevent future refactorings from re-introducing bugs like this.

I've also changed the title of the PR, to reflect the fact that the bug here is not specific to refit().

Comment on lines 2313 to 2314
params = {"objective": "binary", "num_trees": 1, "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params = {"objective": "binary", "num_trees": 1, "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)
params = {"objective": "binary", "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)

Passing both num_trees through params and keyword argument num_boost_round=1 is unnecessary... they mean the same thing. Let's just use one here.

Can you also please add tests for regression and multi-class classification? Note that #6737 was specifically about regression... we should make sure this fixes that issue for that specific case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, It is fixed in b6bb3c9, and also regression example is added in 46a0ddc. For multi-class problem, since preds.size != nrow, I think there is no fail without this fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi-class problem, since preds.size != nrow, I think there is no fail without this fix.

That's fine, but please add a multi-class classification test anyway. That'll help prevent us from introducing this bug for multi-class classification in the future.

@@ -1248,7 +1248,7 @@ def predict(
if pred_leaf:
preds = preds.astype(np.int32)
is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix))
if not is_sparse and preds.size != nrow:
if not is_sparse and preds.size != nrow or pred_leaf:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no place where the shape is used after pred_contrib=True.

@RektPunk Booster.predict() is an important part of LightGBM's public API. If it's producing incorrect results, that needs to be fixed.

We've set the expectation that .predict(X, pred_leaf=True), for example, returns a matrix with shape (X.shape[0], num_trees) for regression, and right now that is not true for a single-tree model.

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000, n_features=10)

dtrain = lgb.Dataset(X, label=y)
params = {
    "objective": "regression",
    "verbosity": -1,
}

lgb.train(params, dtrain, num_boost_round=1).predict(X, pred_leaf=True).shape
# (10000,)

lgb.train(params, dtrain, num_boost_round=2).predict(X, pred_leaf=True).shape
# (10000, 2)

Even if no other code inside lightgbm referenced the .shape attribute of that output, this would still be a bug and we'd still want to fix it... because any other code using lightgbm should be able to form a consistent expectation of the return shape, and not need to have a special case for 1-tree models.

Should we add or pred_contrib here as well?

I think we should, for completeness, but the existing condition is technically already enough to get the correct output shape.

preds.size != nrow will always be true for pred_contrib, because the output contains 1 value per feature AND the Shapley base values.

In other words... it will always have nrow*(num_features + 1) elements, so the reshaping will always be done, so there's no bug here for pred_contrib like there is for pred_leaf.

So please, let's do the following:

  1. change this condition to if not is_sparse and preds.size != nrow or (pred_leaf or pred_contrib)
  2. add explicit tests checking the output shapes for pred_leaf and pred_contrib with 1-iteration and 2-iteration models (using the snippet I provided above as a reference)

@RektPunk if you'd prefer that I add those additional tests, let me know and I'll push to your branch.

@jameslamb jameslamb changed the title [python-package] Enable refit to work with a single Tree [python-package] Fix inconsistency in predict() output shape for 1-tree models Dec 15, 2024
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some more suggestions for the tests.

dtrain = lgb.Dataset(X, label=y)
params = {"objective": "regression", "verbosity": -1}
assert lgb.train(params, dtrain, num_boost_round=1).predict(X, pred_leaf=True).shape == (10_000, 1)
assert lgb.train(params, dtrain, num_boost_round=2).predict(X, pred_leaf=True).shape == (10_000, 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Please put this test down here by other .predict() tests:

def test_predict_stump(rng, use_init_score):

And let's please:

  • test all of .predict(X, pred_leaf=True), .predict(X, pred_contrib=True), and .predict(X)
  • change the tests names to e.g. test_predict_regression_output_shape
  • add similar tests for binary classification and 3-class multi-class classification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I add multiple shape check tests for predict including pred_leaf and pred_contrib for binary, multiclass test as you mentioned in d73f189

Comment on lines 2313 to 2314
params = {"objective": "binary", "num_trees": 1, "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi-class problem, since preds.size != nrow, I think there is no fail without this fix.

That's fine, but please add a multi-class classification test anyway. That'll help prevent us from introducing this bug for multi-class classification in the future.

params = {"objective": "regression", "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)
model_refit = model.refit(X, y)
assert isinstance(model_refit, lgb.Booster)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test for regression is going to repeat all of the same code (totally fine, repetition in test code can be helpful!), then let's please just make it a separate test case.

def test_refit_with_one_tree_regression():
   ...

def test_refit_with_one_tree_binary_classification():
   ...

def test_refit_with_one_tree_multiclass_classification():
   ...

That way, the test could be targeted individually like

pytest './tests/python_package_test/test_engine.py::test_refit_with_one_tree_binary_classification'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I add multiclass example and split the test into in 03e41c6.

@RektPunk RektPunk requested a review from jameslamb December 15, 2024 08:49
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests! I left some small suggestions for re-organizing them.

Those are mostly minor personal preference, so if you'd prefer that I push such changes here directly let me know and I'd be happy to.

@@ -2307,6 +2313,33 @@ def test_refit():
assert err_pred > new_err_pred


def test_refit_with_one_tree_regression():
X, y = make_regression(n_samples=10_000, n_features=10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X, y = make_regression(n_samples=10_000, n_features=10)
X, y = make_regression(n_samples=1_000, n_features=2)

Let's use a slightly smaller dataset, to make this a little faster and to reduce how much memory it uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I changed test with smaller example in 6ad5c49

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
@RektPunk RektPunk requested a review from jameslamb December 17, 2024 06:56
@RektPunk
Copy link
Contributor Author

Hi @jameslamb, just a quick follow-up on the review. I’ve addressed the comments—please let me know if there’s anything else I should look into. Thanks!

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks very much @RektPunk ! We really appreciate you picking up and fixing this bug 😊

@jameslamb jameslamb merged commit 60b0155 into microsoft:master Dec 22, 2024
48 checks passed
@RektPunk RektPunk deleted the feat/fix-refit branch December 22, 2024 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] Refit does not work for regression (Python API)
3 participants