Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Fix inconsistency in predict() output shape for 1-tree models #6753

Merged
merged 16 commits into from
Dec 22, 2024
Merged
2 changes: 1 addition & 1 deletion python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1248,7 +1248,7 @@ def predict(
if pred_leaf:
preds = preds.astype(np.int32)
is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix))
if not is_sparse and preds.size != nrow:
if not is_sparse and preds.size != nrow or pred_leaf:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add or pred_contrib here as well?..

for normal and raw score, its length is equal to num_class * num_data;
for leaf index, its length is equal to num_class * num_data * num_iteration;
for feature contributions, its length is equal to num_class * num_data * (num_feature + 1).
https://lightgbm.readthedocs.io/en/latest/C-API.html#c.LGBM_BoosterPredictForMat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it seems unnecessary since there's no place where the shape is used after pred_contrib=True. pred_leaf is necessary because of

leaf_preds: np.ndarray = predictor.predict( # type: ignore[assignment]
data=data,
start_iteration=-1,
pred_leaf=True,
validate_features=validate_features,
)
nrow, ncol = leaf_preds.shape
.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no place where the shape is used after pred_contrib=True.

@RektPunk Booster.predict() is an important part of LightGBM's public API. If it's producing incorrect results, that needs to be fixed.

We've set the expectation that .predict(X, pred_leaf=True), for example, returns a matrix with shape (X.shape[0], num_trees) for regression, and right now that is not true for a single-tree model.

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000, n_features=10)

dtrain = lgb.Dataset(X, label=y)
params = {
    "objective": "regression",
    "verbosity": -1,
}

lgb.train(params, dtrain, num_boost_round=1).predict(X, pred_leaf=True).shape
# (10000,)

lgb.train(params, dtrain, num_boost_round=2).predict(X, pred_leaf=True).shape
# (10000, 2)

Even if no other code inside lightgbm referenced the .shape attribute of that output, this would still be a bug and we'd still want to fix it... because any other code using lightgbm should be able to form a consistent expectation of the return shape, and not need to have a special case for 1-tree models.

Should we add or pred_contrib here as well?

I think we should, for completeness, but the existing condition is technically already enough to get the correct output shape.

preds.size != nrow will always be true for pred_contrib, because the output contains 1 value per feature AND the Shapley base values.

In other words... it will always have nrow*(num_features + 1) elements, so the reshaping will always be done, so there's no bug here for pred_contrib like there is for pred_leaf.

So please, let's do the following:

  1. change this condition to if not is_sparse and preds.size != nrow or (pred_leaf or pred_contrib)
  2. add explicit tests checking the output shapes for pred_leaf and pred_contrib with 1-iteration and 2-iteration models (using the snippet I provided above as a reference)

@RektPunk if you'd prefer that I add those additional tests, let me know and I'll push to your branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quick review. As you said, the condition is changed in 088a2b9, and explicit test is added in ba39a6f. I'm not sure the exact location of newly added test though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the condition as if not is_sparse and (preds.size != nrow or pred_leaf or pred_contrib) since previous one make test failed: link.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great, thank you! Sorry for suggesting the wrong form... I never remember exactly how a condition like if A and B or C or D is evaluated in Python 😅

The parentheses make it easier to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries at all, Thank you so much for your kind suggestion :)

if preds.size % nrow == 0:
preds = preds.reshape(nrow, -1)
else:
Expand Down
9 changes: 9 additions & 0 deletions tests/python_package_test/test_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -2307,6 +2307,15 @@ def test_refit():
assert err_pred > new_err_pred


def test_refit_with_one_tree():
X, y = load_breast_cancer(return_X_y=True)
lgb_train = lgb.Dataset(X, label=y)
params = {"objective": "binary", "num_trees": 1, "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params = {"objective": "binary", "num_trees": 1, "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)
params = {"objective": "binary", "verbosity": -1}
model = lgb.train(params, lgb_train, num_boost_round=1)

Passing both num_trees through params and keyword argument num_boost_round=1 is unnecessary... they mean the same thing. Let's just use one here.

Can you also please add tests for regression and multi-class classification? Note that #6737 was specifically about regression... we should make sure this fixes that issue for that specific case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, It is fixed in b6bb3c9, and also regression example is added in 46a0ddc. For multi-class problem, since preds.size != nrow, I think there is no fail without this fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi-class problem, since preds.size != nrow, I think there is no fail without this fix.

That's fine, but please add a multi-class classification test anyway. That'll help prevent us from introducing this bug for multi-class classification in the future.

model_refit = model.refit(X, y)
assert isinstance(model_refit, lgb.Booster)


def test_refit_dataset_params(rng):
# check refit accepts dataset_params
X, y = load_breast_cancer(return_X_y=True)
Expand Down
Loading