Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] use 2d collections for predictions, grads and hess in multiclass custom objective #4925

Merged
merged 7 commits into from
Feb 23, 2022

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Jan 4, 2022

Closes #4046.

This makes the predictions input for a custom objective be a (num_data, num_class) matrix and allows the user to return matrices of the same shape as gradients and hessians.

@jmoralez jmoralez changed the title [python-package] use 2d collections for predictions, grads and hess in multiclass custom objective [WIP][python-package] use 2d collections for predictions, grads and hess in multiclass custom objective Feb 15, 2022
@jmoralez
Copy link
Collaborator Author

@StrikerRUS could you take a quick look at this? It's still missing the pandas collections (grads and hess as dataframes, etc.) and list of lists, do you think they should be allowed as well? Since y_true and y_pred are numpy arrays I don't know if there are many cases where a pandas collection or list would be returned as grad or hess.

@StrikerRUS
Copy link
Collaborator

@jmoralez
Thanks for the ping!

It's still missing the pandas collections (grads and hess as dataframes, etc.) and list of lists, do you think they should be allowed as well?

I don't have any objections for restricting data types here to numpy arrays only for the sake of great codebase simplification. I don't see any problems with calling np.array(data) for lists and data.values for pandas even if user prefer to make data manipulations with these types. And LightGBM 4.0 is a good time for a such breaking change.

@jmoralez jmoralez changed the title [WIP][python-package] use 2d collections for predictions, grads and hess in multiclass custom objective [python-package] use 2d collections for predictions, grads and hess in multiclass custom objective Feb 17, 2022
@jmoralez jmoralez marked this pull request as ready for review February 17, 2022 04:19
Comment on lines +116 to +122
if grad.ndim == 2: # multi-class
num_data = grad.shape[0]
if weight.size != num_data:
raise ValueError("grad and hess should be of shape [n_samples, n_classes]")
weight = weight.reshape(num_data, 1)
grad *= weight
hess *= weight
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grad and hess are weighted in the sklearn interface but they're not in basic, should we weigh them there as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke Hey! Don't you remember the reason for doing this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with interfaces in basic.py, the weighting will be done in the C++ side finally. I'll double check why weighting is done here directly with sklearn interfaces.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 You've merged this PR without resolving this conversation. Could you please share your findings about weighting derivatives here?

Copy link
Collaborator

@shiyu1994 shiyu1994 Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I did not notice that what we discussed above is customized objective. I thought we are discussing native objectives of LightGBM. I suddenly noticed that weights with customized objective function is not handled correctly for Python API. See the code below.

import numpy as np
import lightgbm as lgb

def fobj(preds, train_data):
    labels = train_data.get_label()
    return preds - labels, np.ones_like(labels)

def test():
    np.random.seed(123)
    num_data = 10000
    num_feature = 100
    train_X = np.random.randn(num_data, num_feature)
    train_y = np.mean(train_X, axis=-1)
    valid_X = np.random.randn(num_data, num_feature)
    valid_y = np.mean(valid_X, axis=-1)
    weights = np.random.rand(num_data)
    train_data = lgb.Dataset(train_X, train_y, weight=weights)
    valid_data = lgb.Dataset(valid_X, valid_y)
    params = {
        "verbose": 2,
        "metric": "rmse",
        "learning_rate": 0.2,
        "num_trees": 20,
    }
    booster = lgb.train(train_set=train_data, valid_sets=[valid_data], valid_names=["valid"], params=params, fobj=fobj)

if __name__ == "__main__":
    test()

If we comment out the weights in the training dataset construction. The code will provide exactly the same output as below.

[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Debug] init for col-wise cost 0.000012 seconds, init for row-wise cost 0.001697 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004134 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 10000, number of used features: 100
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[1]	valid's rmse: 0.100043
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 6
[2]	valid's rmse: 0.099099
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[3]	valid's rmse: 0.0982311
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[4]	valid's rmse: 0.0974867
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[5]	valid's rmse: 0.0965613
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[6]	valid's rmse: 0.0957191
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[7]	valid's rmse: 0.0949163
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 6
[8]	valid's rmse: 0.0940159
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[9]	valid's rmse: 0.0932777
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[10]	valid's rmse: 0.0924858
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[11]	valid's rmse: 0.0917661
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[12]	valid's rmse: 0.0909356
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[13]	valid's rmse: 0.0901323
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[14]	valid's rmse: 0.0894671
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[15]	valid's rmse: 0.0888048
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[16]	valid's rmse: 0.0881257
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[17]	valid's rmse: 0.0874723
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[18]	valid's rmse: 0.0868133
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 8
[19]	valid's rmse: 0.0862182
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
[20]	valid's rmse: 0.0856057

We need a separate PR to fix this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I find an additional issue. The latest master branch did not produce any evaluation results in the log as above. I get the log with version 3.3.2 instead. This is another issue we need to investigate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suddenly noticed that weights with customized objective function is not handled correctly for Python API.

Yes, that's what I noticed when I saw that in the scikit-learn interface grad and hess are weighted before boosting. I don't know if it's because in basic you get a Dataset and have access to the weights and can weigh them in the objective function and in sklearn you can't but if that's the case it's worth mentioning in the docs.

The latest master branch did not produce any evaluation results in the log as above.

I believe this is because callbacks are now preferred (#4878), to log the evaluation you have to specify callbacks=[lgb.log_evaluation(1)]

@@ -3159,8 +3165,8 @@ def eval(self, data, name, feval=None):
is_higher_better : bool
Is eval result higher better, e.g. AUC is ``is_higher_better``.

For multi-class task, the preds is group by class_id first, then group by row_id.
If you want to get i-th row preds in j-th class, the access way is preds[j * num_data + i].
For multi-class task, preds are a [n_samples, n_classes] numpy 2-D array,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also check that customized evaluation function with multi class works correctly? I've read the code, and it seems that the customized evaluation function will finally take the output of __inner_predict as input, which is of the shape n_sample * n_class. This is inconsistent with the hint here.

feval_ret = eval_function(self.__inner_predict(data_idx), cur_data)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're right. I've only modified the portions required for fobj, I'll work on feval.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the reshaping to __inner_predict in 5a56a30 so that it works in both places and added a test to check that we get the same result using the built-in log loss and computing it manually.

@shiyu1994
Copy link
Collaborator

@jmoralez Thank you for working on this! I just left a comment about customized evaluation function. Other parts LGTM.

@@ -2999,6 +2998,9 @@ def update(self, train_set=None, fobj=None):
if not self.__set_objective_to_none:
self.reset_parameter({"objective": "none"}).__set_objective_to_none = True
grad, hess = fobj(self.__inner_predict(0), self.train_set)
if self.num_model_per_iteration() > 1:
Copy link
Collaborator Author

@jmoralez jmoralez Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to use _Booster__num_class here instead to avoid the lib call? I don't fully understand where __num_class gets converted to _Booster__num_class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is safe since Booster.__num_class comes from the lib call. See

_safe_call(_LIB.LGBM_BoosterGetNumClasses(
self.handle,
ctypes.byref(out_num_class)))
self.__num_class = out_num_class.value

and
out_num_class = ctypes.c_int(0)
_safe_call(_LIB.LGBM_BoosterGetNumClasses(
self.handle,
ctypes.byref(out_num_class)))
self.__num_class = out_num_class.value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that the attribute changes name. I see it's used as self.__num_class in some places but if I add a breakpoint at that line it doesn't have that attribute but has self._Booster__num_class which is the part that confuses me. Do you think the performance impact of calling the lib on each iteration is noticeable and should be changed to use the attribute instead?

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Waiting for the CI tests to finish.

@@ -2999,6 +2998,9 @@ def update(self, train_set=None, fobj=None):
if not self.__set_objective_to_none:
self.reset_parameter({"objective": "none"}).__set_objective_to_none = True
grad, hess = fobj(self.__inner_predict(0), self.train_set)
if self.num_model_per_iteration() > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is safe since Booster.__num_class comes from the lib call. See

_safe_call(_LIB.LGBM_BoosterGetNumClasses(
self.handle,
ctypes.byref(out_num_class)))
self.__num_class = out_num_class.value

and
out_num_class = ctypes.c_int(0)
_safe_call(_LIB.LGBM_BoosterGetNumClasses(
self.handle,
ctypes.byref(out_num_class)))
self.__num_class = out_num_class.value

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved given that CI tests are passed.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] init_score and data structures in custom functions shape for multiclass classification
3 participants