-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R package] Accept data frames as inputs #4207
Conversation
Would be helpful to have access to the full package check log from the CI pipeline so as to know which specific test is failing and whether it's related to this PR or not. The only info I could extract from it is this:
Which doesn't where it failed. The failure seems to be only in R 3.6 so I suspect it could be related to modification of the global RNG state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for your interest in LightGBM and your efforts here. I quickly scanned this today and I've left a few comments suggesting some reductions in the scope of this PR.
We can provide a more thorough review once those changes are made.
label = NULL, | ||
weight = NULL, | ||
init_score = NULL, | ||
group = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change and the corresponding changes below necessary to support the feature "Accept data frames as inputs"? Or do you just have a personal preference for explicit keyword arguments instead of the info
construct that exists in this class today?
If it's not strictly required to support this feature, please revert this change, make this PR work with the existing interface of lgb.Dataset
, and open an issue describing what you'd like to change about the interface to lgb.Dataset()
and how you think that proposed change improves LightGBM.
That will reduce the size of the diff here, which should allow us to provide a higher-quality review and bring this PR to a resolution more quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's necessary, as otherwise it wouldn't be possible to pass them as column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise it wouldn't be possible to pass them as column names.
Ok, please revert any changes related to being able to pass these characteristics as column names. I'm happy to have a discussion about that in a separate feature request issue, but I don't think that change is strictly required to accomplish the behavior "accept data frames as inputs".
We have a strong preference in this project for incremental progress through as-small-as-possible pull requests focused on a single thing. That enables the team of maintainers here (mostly volunteers) to provide higher-quality reviews and prevents the situation where pull requests touching a lot of code drag on for a long time and block or complicate other development.
Managed to reproduce the failing test in R 3.6.3 in windows with RTools35. The failing test is among the newly introduced ones (L1750 in # check that it accepts data frames even when fitting to matrices
X <- mtcars[, -1L]
model_new <- lightgbm(data = as.matrix(X), label = y,
params = list(objective = "regression", min_data = 1L),
verbose = -1L)
pred <- predict(model_new, as.matrix(X))
pred_new <- predict(model_new, X)
expect_equal(pred, pred_new) What differs is the predictions as they come from the C++ function Will investigate further. |
@jameslamb I finally managed to find out the root cause behind the tests that are failing. It's actually just a coincidence that it failed here as there's a much bigger problem beneath. The issue is actually with the function In my setup, for the tests that fail, I get wildly different pointer addresses:
I suspect it might be related to R variables that have references to non-owned data, and the issue probably arises from trying to re-define the structs behind SEXPs instead of including the R headers, as those are not meant to be constant between R versions. The problem can be easily reproduced outside of this PR by something as simple as this: X <- as.matrix(df)
some_model <- lightgbm(...)
pred_before <- predict(some_model, X)
mode(X) <- "double"
pred_after <- predict(some_model, X) The two will not match as the second one will get a pointer to random data and will be incorrect. BTW this means that the minimum R version is actually >=4.0.0 as the replacements for R headers do not work correctly with 3.6. If the data were larger I think it might even be able to crash the R session by producing a segmentation fault. |
I'm willing to believe that there are some issues with the fact that LightGBM doesn't use R's headers directly. This is something we've been wanting to fix for a while, just haven't gotten to it yet. You can see #3016 for all of the relevant context. But I'm not totally convinced. Every build of this project includes 7 different CI jobs that test the R package on R 3.6.3, and those all use exactly the same unit tests as the R 4.x CI jobs. The R package's tests pretty thoroughly cover the package and check the correctness of results, and there aren't any conditions in the code or tests that are like "if R version < 4.0...". Are you able to provide a self-contained reproducible example on Are you very familiar with C++ and particularly R's headers? Would you be interested in trying to address #3016? |
Created an issue about it, and yes it's reproducible with the current version from master: #4216 |
I'm working right now on #3016 and #4216, which I think should resolve some of the issues you uncovered in this PR's tests. Thanks for all the information you've given us so far about those and for drawing my attention back to them (the R package has not gotten much new development recently). Once those issues are resolved, I'll ask you to update this PR to |
Ok @david-cortes , we've merged several changes over the last two weeks and #3016 is now resolved...LightGBM's custom R-to-C interface has been totally replaced with the builtins from Can you please update this PR with the latest changes from Thanks very much. |
@jameslamb It's ready for review. The failing tests are not from this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the idea and the contributions!
I left a few comments related to code style. Please, wherever possible, try to match the existing style of the R package. There are limits to what we can automatically check for with {lintr}
, so I apologize for the fact that you have to receive some style recommendations as review comments.
I'm supportive of allowing users to bring data frames as the format for training data! I think it's a welcome and very useful addition. However, I believe this pull request is currently trying to do too much at once, and is difficult to review as a result. Could you please limit the scope to "accept all-numeric data frames as inputs"? That would allow us to really focus on the API and how to change or restructure the package internals.
Then these other features could be documented in feature requests and could be added in later PRs:
- automatically converting categorical-looking columns in LightGBM-specific way (including "remembering" factor levels)
- allowing users to use column names for
label
,weight
,init_score
, andgroup
instead of needing to break them out into separate vectors
data <- Dataset$public_methods$process_data_frame_columns( | ||
data, | ||
private$colnames, | ||
private$categorical_feature, | ||
private$factor_levels | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data <- Dataset$public_methods$process_data_frame_columns( | |
data, | |
private$colnames, | |
private$categorical_feature, | |
private$factor_levels | |
) | |
data <- Dataset$public_methods$process_data_frame_columns( | |
data = data | |
, colnames = private$colnames | |
, categorical_feature = private$categorical_feature | |
, factor_levels = private$factor_levels | |
) |
Please match the style in used in the rest of the R package, which is comma-first and closing parentheses vertically aligned with the beginning of the line.
And please use keyword arguments for any function calls to internal methods with more than one argument. This makes the code a bit easier to read and prevents bugs caused by mistakes in the order that arguments are given.
if (!is.null(label)) info[["label"]] <- label | ||
if (!is.null(weight)) info[["weight"]] <- weight | ||
if (!is.null(init_score)) info[["init_score"]] <- init_score | ||
if (!is.null(group)) info[["group"]] <- group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!is.null(label)) info[["label"]] <- label | |
if (!is.null(weight)) info[["weight"]] <- weight | |
if (!is.null(init_score)) info[["init_score"]] <- init_score | |
if (!is.null(group)) info[["group"]] <- group | |
if (!is.null(label)) { | |
info[["label"]] <- label | |
} | |
if (!is.null(weight)) { | |
info[["weight"]] <- weight | |
} | |
if (!is.null(init_score)) { | |
info[["init_score"]] <- init_score | |
} | |
if (!is.null(group)) { | |
info[["group"]] <- group | |
} |
Following the style in the rest of the R package, please use {}
for the expression in every control statement (for
, if
, while
) and do not perform assignments in one-liners.
if (!nrow(data) || !ncol(data)) | ||
stop("'data' is empty.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!nrow(data) || !ncol(data)) | |
stop("'data' is empty.") | |
if (!nrow(data) || !ncol(data)) { | |
stop("'data' is empty.") | |
} |
if (is.null(reference)) { | ||
|
||
# Factors are taken directly in data frames, so should not be supplied | ||
if (!is.null(categorical_feature)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this decision. If my data are already in integer or numeric format, this would mean that to tell {lightgbm}
to treat them as categorical I'd have to convert them to factors first, right? I don't support adding that friction into users' workflows.
I'm supportive of the "automatically treat factors as categorical features" proposal, but I'd prefer to have behavior like "if the input is a data.frame
and that data.frame
contains any factors, convert those factors to integers and add the column names to `categorical_feature".
if (!is.null(colnames)) | ||
stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!is.null(colnames)) | |
stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.") | |
if (!is.null(colnames)) { | |
stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.") | |
} |
data, | ||
reference$get_colnames(), | ||
reference$get_categorical_feature(), | ||
reference$get_factor_levels() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data, | |
reference$get_colnames(), | |
reference$get_categorical_feature(), | |
reference$get_factor_levels() | |
data = data, | |
colnames = reference$get_colnames(), | |
categorical_feature = reference$get_categorical_feature(), | |
factor_levels = reference$get_factor_levels() |
if (!is.null(colnames)) | ||
data <- data[, colnames, with = FALSE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!is.null(colnames)) | |
data <- data[, colnames, with = FALSE] | |
if (!is.null(colnames)) { | |
data <- data[, colnames, with = FALSE] | |
} |
if (any(sapply(data, function(x) is.character(x) || is.factor(x)))) | ||
stop("'data' contains categorical columns, but 'reference' did not have encodings for them.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (any(sapply(data, function(x) is.character(x) || is.factor(x)))) | |
stop("'data' contains categorical columns, but 'reference' did not have encodings for them.") | |
if (any(sapply(data, function(x) is.character(x) || is.factor(x)))) { | |
stop("'data' contains categorical columns, but 'reference' did not have encodings for them.") | |
} |
Dataset$private_methods$substitute_from_df_cols( | ||
data, label, weight, NULL, | ||
substitute(label), substitute(weight), NULL, | ||
environment() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset$private_methods$substitute_from_df_cols( | |
data, label, weight, NULL, | |
substitute(label), substitute(weight), NULL, | |
environment() | |
) | |
Dataset$private_methods$substitute_from_df_cols( | |
data = data | |
, label = label | |
, weight = weight | |
, init_score = init_score | |
, label_name = substitute(label) | |
, weight_name = substitute(weight) | |
, init_score_name = substitute(init_score) | |
, env_where_to_substitute = environment() | |
) |
label = NULL, | ||
weight = NULL, | ||
init_score = NULL, | ||
group = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise it wouldn't be possible to pass them as column names.
Ok, please revert any changes related to being able to pass these characteristics as column names. I'm happy to have a discussion about that in a separate feature request issue, but I don't think that change is strictly required to accomplish the behavior "accept data frames as inputs".
We have a strong preference in this project for incremental progress through as-small-as-possible pull requests focused on a single thing. That enables the team of maintainers here (mostly volunteers) to provide higher-quality reviews and prevents the situation where pull requests touching a lot of code drag on for a long time and block or complicate other development.
I'd prefer to leave the implementation of this idea to someone else, at least for the time being. I also do not think it'd be very useful (and I'd say it'd be counterintuitive from a user POV) to accept data frames without adding non-standard evaluation for column names and handling of factors, as usually the reason why one wants to pass data frames to a model is to let it handle the categorical features and use column names as labels/weights/etc. Otherwise it could be more easily done by adding an |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This PR adds functionality for accepting data frames in the model fitting and prediction functions, keeping track of the column order and encodings of categorical features.
I wasn't very sure how to structure it as the relationships between the internal object types are not quite accommodating to the idea, so I decided to save metadata in the dataset class, which is then saved again in the booster class when it receives a dataset object with data frame metadata.
Additionally, it allows using a data frame column as label if passing a name that matches with it.
There was some issue with the hard-coded numbers in the tests, as it seems calling
inherits
inside an R6 method somehow modifies the global RNG status and the tests would fail if the same numbers were kept, since the randomly-generated data is now changed. I've modified them to match against what it ends up generating after these changes.BTW I'm not very sure if missing values in categorical features are meant to be passed as negative integers like in Python. Is this how it's supposed to be, or should it use R's own
NA
value?