-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] Remove reshape
argument in predict
#4971
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for writing this up!
I'm convinced that {lightgbm}
should produce a (num_observations, num_classes)
matrix for multi-class classification, based on your examples of {randomForest}
and {ranger}
.
Instead of silently changing this behavior by altering a default, I think for the next release {lightgbm}
should just remove the reshape
argument from predict.lgb.Booster()
and Predictor$predict()
entirely. I cannot think of a reason that getting a vector would be preferable to getting a matrix like this:
LightGBM/R-package/demo/multiclass.R
Lines 59 to 63 in 60e72d5
# A (30x3) matrix with the predictions, use parameter reshape | |
# class1 class2 class3 | |
# obs1 obs1 obs1 | |
# obs2 obs2 obs2 | |
# .... .... .... |
Removing reshape
entirely would:
- make
{lightgbm}
's behavior more consistent with other statistical packages in R (including R's standard library) - break existing user code in a loud way with error that makes it clear what has change
- alleviate the inconsistency you've pointed out of the vector produced by
predict()
being in row-major order when R assumes column-major order - simplify the codebase
Before making any changes like that, let's get some opinions from others.
@Laurae2 @mayer79 @StrikerRUS @shiyu1994 @guolinke what do you think about the following proposed breaking changes for the R package?
- remove argument
reshape
inlgb.predict.Booster()
andPredictor$predict()
- if
reshape
is found in...
forlgb.predict.Booster()
, raise an informative error explaining that that parameter is no longer supported - for predictions like those from multi-class classification, where more than one number is produced per row in the input data, have
lgb.predict.Booster()
andPredictor$predict()
return a numeric matrix of shape(num_observations, num_classes)
I support this. What about similar LightGBM/python-package/lightgbm/basic.py Lines 777 to 778 in 417c732
|
reshape
argument in predict
reshape
argument in predict
Updated with a removal of the argument instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, and for being thorough by updating the docs and demos as well. I'm excited to see the prediction logic be simplified and become less-surprising at the same time 🎉
I think this will also reduce the complexity of the changes + tests in #4977.
I left a few suggestions below. Please also add a new unit test, at a minimum one like "predictions for multi-class classification are returned as a matrix with the expected dimensions". Every time user-facing behavior is changed, tests should be added to ensure that future development doesn't accidentally break that behavior.
R-package/R/lgb.Booster.R
Outdated
if ("reshape" %in% names(additional_params)) { | ||
warning("'reshape' argument is no longer supported.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ("reshape" %in% names(additional_params)) { | |
warning("'reshape' argument is no longer supported.") | |
} | |
if ("reshape" %in% names(additional_params)) { | |
stop"'reshape' argument is no longer supported.") | |
} |
As mentioned in #4971 (review), please make this an error, not a warning, so that users of the package are notified of this change loudly, right here at the point where they need to change something.
For breaking changes like this, I have a strong preference for them being loud and close to the source of the error, to avoid the case where this shows up somewhere further downstream in users' code (e.g., in some subsetting logic that expected a vector and is now getting a matrix).
R-package/demo/multiclass.R
Outdated
|
||
# We can also get the predicted scores before the Sigmoid/Softmax application | ||
my_preds <- predict(model, test[, 1L:4L], rawscore = TRUE) | ||
|
||
# Raw score predictions as matrix instead of vector | ||
my_preds <- predict(model, test[, 1L:4L], rawscore = TRUE, reshape = TRUE) | ||
my_preds <- predict(model, test[, 1L:4L], rawscore = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is now identical to the one above it, since the only thing that was difference about it was the use of reshape = TRUE
. Please remove this line and its comment.
R-package/demo/multiclass.R
Outdated
|
||
# We can also get the leaf index | ||
my_preds <- predict(model, test[, 1L:4L], predleaf = TRUE) | ||
|
||
# Predict leaf index as matrix instead of vector | ||
my_preds <- predict(model, test[, 1L:4L], predleaf = TRUE, reshape = TRUE) | ||
my_preds <- predict(model, test[, 1L:4L], predleaf = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is now identical to the one above it, since the only thing that was difference about it was the use of reshape = TRUE
. Please remove this line and its comment.
Thanks for the quick feedback @StrikerRUS ! I'd love to see similar changes in the Python package (though not in this PR, please), but would want to do some research to see if |
I would really appreciate it! |
Updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great to me. Thanks very much for the tests. I'm very happy to see this simplification in the prediction code.
@jmoralez , want a chance to review now that you're back?
@StrikerRUS I created #5115 to capture the work for making a similar change on the Python side. Thankfully the default there was to always reshape, so I think the change there will be less disruptive to users (since I expect that the most common behavior people want is for multi-class classification and other cases like |
Thanks again for the help, @david-cortes ! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
ref #4968
When requesing a prediction type with several entries per row, the prediction function in the R package by default produces outputs as a 1-D vector in row-major order. This can be quite confusing and lead to user errors when used externally, especially since it is not documented that the output is row-major while R assumes that conversions to matrices are to be done assuming column-major order and uses vectors in column-major order in built-in operators such as
+
,*
,[
, or[<-
and functions such assweep
.What's more, base R's
stats
package and many core R packages for decision tree ensembles such asranger
orrandomForest
produce multi-output predictions as a matrix. Examples:This PR changes the default towards
reshape=TRUE
and adds a note about the storage order in the documentation.