Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R package] Accept data frames as inputs #4207

Closed

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Apr 21, 2021

This PR adds functionality for accepting data frames in the model fitting and prediction functions, keeping track of the column order and encodings of categorical features.

I wasn't very sure how to structure it as the relationships between the internal object types are not quite accommodating to the idea, so I decided to save metadata in the dataset class, which is then saved again in the booster class when it receives a dataset object with data frame metadata.

Additionally, it allows using a data frame column as label if passing a name that matches with it.

There was some issue with the hard-coded numbers in the tests, as it seems calling inherits inside an R6 method somehow modifies the global RNG status and the tests would fail if the same numbers were kept, since the randomly-generated data is now changed. I've modified them to match against what it ends up generating after these changes.

BTW I'm not very sure if missing values in categorical features are meant to be passed as negative integers like in Python. Is this how it's supposed to be, or should it use R's own NA value?

@david-cortes
Copy link
Contributor Author

david-cortes commented Apr 21, 2021

Would be helpful to have access to the full package check log from the CI pipeline so as to know which specific test is failing and whether it's related to this PR or not. The only info I could extract from it is this:

2021-04-21T04:26:23.1469020Z * checking tests ...
2021-04-21T04:26:39.1305080Z   Running ‘testthat.R’ [16s/16s]
2021-04-21T04:26:39.1818810Z  � � � ERROR
2021-04-21T04:26:39.1875160Z Running the tests in ‘tests/testthat.R’ failed.
2021-04-21T04:26:39.1875810Z Last 13 lines of output:
2021-04-21T04:26:39.1876510Z   30/32 mismatches (average diff: 11.4)
2021-04-21T04:26:39.1877320Z   [1] 21.0 - 30.4 ==  -9.4
2021-04-21T04:26:39.1878520Z   [2] 21.0 - 30.4 ==  -9.4
2021-04-21T04:26:39.1879300Z   [3] 22.8 - 30.4 ==  -7.6
2021-04-21T04:26:39.1879920Z   [4] 21.4 - 30.4 ==  -9.0
2021-04-21T04:26:39.1880660Z   [5] 18.7 - 30.4 == -11.7
2021-04-21T04:26:39.1881450Z   [6] 18.1 - 30.4 == -12.3
2021-04-21T04:26:39.1882070Z   [7] 14.3 - 30.4 == -16.1
2021-04-21T04:26:39.1882670Z   [8] 24.4 - 30.4 ==  -6.0
2021-04-21T04:26:39.1883980Z   [9] 22.8 - 30.4 ==  -7.6
2021-04-21T04:26:39.1884310Z   ...
2021-04-21T04:26:39.1884580Z   
2021-04-21T04:26:39.1884910Z   [ FAIL 1 | WARN 0 | SKIP 3 | PASS 645 ]
2021-04-21T04:26:39.1885330Z   Error: Test failures
2021-04-21T04:26:39.1885710Z   Execution halted
2021-04-21T04:26:43.9557750Z * checking PDF version of manual ... � OK
2021-04-21T04:26:43.9562660Z * checking for detritus in the temp directory ... OK
2021-04-21T04:26:43.9567380Z * DONE
2021-04-21T04:26:43.9572480Z 
2021-04-21T04:26:43.9574120Z Status: 1 ERROR
2021-04-21T04:26:43.9576130Z See
2021-04-21T04:26:43.9577240Z   ‘/Users/runner/work/LightGBM/LightGBM/lightgbm.Rcheck/00check.log’
2021-04-21T04:26:43.9577820Z for details.

Which doesn't where it failed.

The failure seems to be only in R 3.6 so I suspect it could be related to modification of the global RNG state.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for your interest in LightGBM and your efforts here. I quickly scanned this today and I've left a few comments suggesting some reductions in the scope of this PR.

We can provide a more thorough review once those changes are made.

R-package/R/lgb.Dataset.R Outdated Show resolved Hide resolved
R-package/R/lgb.Dataset.R Outdated Show resolved Hide resolved
Comment on lines +35 to +38
label = NULL,
weight = NULL,
init_score = NULL,
group = NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change and the corresponding changes below necessary to support the feature "Accept data frames as inputs"? Or do you just have a personal preference for explicit keyword arguments instead of the info construct that exists in this class today?

If it's not strictly required to support this feature, please revert this change, make this PR work with the existing interface of lgb.Dataset, and open an issue describing what you'd like to change about the interface to lgb.Dataset() and how you think that proposed change improves LightGBM.

That will reduce the size of the diff here, which should allow us to provide a higher-quality review and bring this PR to a resolution more quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary, as otherwise it wouldn't be possible to pass them as column names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it wouldn't be possible to pass them as column names.

Ok, please revert any changes related to being able to pass these characteristics as column names. I'm happy to have a discussion about that in a separate feature request issue, but I don't think that change is strictly required to accomplish the behavior "accept data frames as inputs".

We have a strong preference in this project for incremental progress through as-small-as-possible pull requests focused on a single thing. That enables the team of maintainers here (mostly volunteers) to provide higher-quality reviews and prevents the situation where pull requests touching a lot of code drag on for a long time and block or complicate other development.

@david-cortes
Copy link
Contributor Author

Managed to reproduce the failing test in R 3.6.3 in windows with RTools35. The failing test is among the newly introduced ones (L1750 in test_basic.R):

  # check that it accepts data frames even when fitting to matrices
  X <- mtcars[, -1L]
  model_new <- lightgbm(data = as.matrix(X), label = y,
                        params = list(objective = "regression", min_data = 1L),
                        verbose = -1L)
  pred <- predict(model_new, as.matrix(X))
  pred_new <- predict(model_new, X)
  expect_equal(pred, pred_new)

What differs is the predictions as they come from the C++ function LGBM_BoosterPredictForMat_R. It's odd that the predictions differ, since the inputs to the function match exactly according to testthat.

Will investigate further.

@david-cortes
Copy link
Contributor Author

david-cortes commented Apr 22, 2021

@jameslamb I finally managed to find out the root cause behind the tests that are failing. It's actually just a coincidence that it failed here as there's a much bigger problem beneath.

The issue is actually with the function R_REAL_PTR in the C++ wrapper, which does not output the same result as the R C function REAL.

In my setup, for the tests that fail, I get wildly different pointer addresses:

  • REAL(data) = 0x145b88a8 (obtained by using R's own .Call with proper SEXP from the R headers)
  • R_REAL_PTR(data = 0x118edaa0

I suspect it might be related to R variables that have references to non-owned data, and the issue probably arises from trying to re-define the structs behind SEXPs instead of including the R headers, as those are not meant to be constant between R versions.

The problem can be easily reproduced outside of this PR by something as simple as this:

X <- as.matrix(df)
some_model <- lightgbm(...)
pred_before <- predict(some_model, X)
mode(X) <- "double"
pred_after <- predict(some_model, X)

The two will not match as the second one will get a pointer to random data and will be incorrect.

BTW this means that the minimum R version is actually >=4.0.0 as the replacements for R headers do not work correctly with 3.6. If the data were larger I think it might even be able to crash the R session by producing a segmentation fault.

@jameslamb
Copy link
Collaborator

BTW this means that the minimum R version is actually >=4.0.0 as the replacements for R headers do not work correctly with 3.6

I'm willing to believe that there are some issues with the fact that LightGBM doesn't use R's headers directly. This is something we've been wanting to fix for a while, just haven't gotten to it yet. You can see #3016 for all of the relevant context.

But I'm not totally convinced. Every build of this project includes 7 different CI jobs that test the R package on R 3.6.3, and those all use exactly the same unit tests as the R 4.x CI jobs. The R package's tests pretty thoroughly cover the package and check the correctness of results, and there aren't any conditions in the code or tests that are like "if R version < 4.0...".

Are you able to provide a self-contained reproducible example on master (not this branch) that demonstrates the bug that you think exists? If you can, we would very much appreciate an issue with that example and a description of your theory about what is wrong.


Are you very familiar with C++ and particularly R's headers? Would you be interested in trying to address #3016?

@david-cortes
Copy link
Contributor Author

BTW this means that the minimum R version is actually >=4.0.0 as the replacements for R headers do not work correctly with 3.6

I'm willing to believe that there are some issues with the fact that LightGBM doesn't use R's headers directly. This is something we've been wanting to fix for a while, just haven't gotten to it yet. You can see #3016 for all of the relevant context.

But I'm not totally convinced. Every build of this project includes 7 different CI jobs that test the R package on R 3.6.3, and those all use exactly the same unit tests as the R 4.x CI jobs. The R package's tests pretty thoroughly cover the package and check the correctness of results, and there aren't any conditions in the code or tests that are like "if R version < 4.0...".

Are you able to provide a self-contained reproducible example on master (not this branch) that demonstrates the bug that you think exists? If you can, we would very much appreciate an issue with that example and a description of your theory about what is wrong.

Are you very familiar with C++ and particularly R's headers? Would you be interested in trying to address #3016?

Created an issue about it, and yes it's reproducible with the current version from master: #4216

@jameslamb
Copy link
Collaborator

I'm working right now on #3016 and #4216, which I think should resolve some of the issues you uncovered in this PR's tests. Thanks for all the information you've given us so far about those and for drawing my attention back to them (the R package has not gotten much new development recently).

Once those issues are resolved, I'll ask you to update this PR to master and then I'll come give this a review.

@jameslamb
Copy link
Collaborator

Ok @david-cortes , we've merged several changes over the last two weeks and #3016 is now resolved...LightGBM's custom R-to-C interface has been totally replaced with the builtins from Rinternals.h. So I think the issues you mentioned in #4207 (comment) should be resolved.

Can you please update this PR with the latest changes from master, and @ me whenever you think it's ready for another review?

Thanks very much.

@david-cortes
Copy link
Contributor Author

@jameslamb It's ready for review. The failing tests are not from this PR.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the idea and the contributions!

I left a few comments related to code style. Please, wherever possible, try to match the existing style of the R package. There are limits to what we can automatically check for with {lintr}, so I apologize for the fact that you have to receive some style recommendations as review comments.

I'm supportive of allowing users to bring data frames as the format for training data! I think it's a welcome and very useful addition. However, I believe this pull request is currently trying to do too much at once, and is difficult to review as a result. Could you please limit the scope to "accept all-numeric data frames as inputs"? That would allow us to really focus on the API and how to change or restructure the package internals.

Then these other features could be documented in feature requests and could be added in later PRs:

  • automatically converting categorical-looking columns in LightGBM-specific way (including "remembering" factor levels)
  • allowing users to use column names for label, weight, init_score, and group instead of needing to break them out into separate vectors

Comment on lines +526 to +531
data <- Dataset$public_methods$process_data_frame_columns(
data,
private$colnames,
private$categorical_feature,
private$factor_levels
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data <- Dataset$public_methods$process_data_frame_columns(
data,
private$colnames,
private$categorical_feature,
private$factor_levels
)
data <- Dataset$public_methods$process_data_frame_columns(
data = data
, colnames = private$colnames
, categorical_feature = private$categorical_feature
, factor_levels = private$factor_levels
)

Please match the style in used in the rest of the R package, which is comma-first and closing parentheses vertically aligned with the beginning of the line.

And please use keyword arguments for any function calls to internal methods with more than one argument. This makes the code a bit easier to read and prevents bugs caused by mistakes in the order that arguments are given.

Comment on lines +53 to +56
if (!is.null(label)) info[["label"]] <- label
if (!is.null(weight)) info[["weight"]] <- weight
if (!is.null(init_score)) info[["init_score"]] <- init_score
if (!is.null(group)) info[["group"]] <- group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!is.null(label)) info[["label"]] <- label
if (!is.null(weight)) info[["weight"]] <- weight
if (!is.null(init_score)) info[["init_score"]] <- init_score
if (!is.null(group)) info[["group"]] <- group
if (!is.null(label)) {
info[["label"]] <- label
}
if (!is.null(weight)) {
info[["weight"]] <- weight
}
if (!is.null(init_score)) {
info[["init_score"]] <- init_score
}
if (!is.null(group)) {
info[["group"]] <- group
}

Following the style in the rest of the R package, please use {} for the expression in every control statement (for, if, while) and do not perform assignments in one-liners.

Comment on lines +72 to +73
if (!nrow(data) || !ncol(data))
stop("'data' is empty.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!nrow(data) || !ncol(data))
stop("'data' is empty.")
if (!nrow(data) || !ncol(data)) {
stop("'data' is empty.")
}

if (is.null(reference)) {

# Factors are taken directly in data frames, so should not be supplied
if (!is.null(categorical_feature))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this decision. If my data are already in integer or numeric format, this would mean that to tell {lightgbm} to treat them as categorical I'd have to convert them to factors first, right? I don't support adding that friction into users' workflows.

I'm supportive of the "automatically treat factors as categorical features" proposal, but I'd prefer to have behavior like "if the input is a data.frame and that data.frame contains any factors, convert those factors to integers and add the column names to `categorical_feature".

Comment on lines +82 to +83
if (!is.null(colnames))
stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!is.null(colnames))
stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.")
if (!is.null(colnames)) {
stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.")
}

Comment on lines +130 to +133
data,
reference$get_colnames(),
reference$get_categorical_feature(),
reference$get_factor_levels()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data,
reference$get_colnames(),
reference$get_categorical_feature(),
reference$get_factor_levels()
data = data,
colnames = reference$get_colnames(),
categorical_feature = reference$get_categorical_feature(),
factor_levels = reference$get_factor_levels()

Comment on lines +505 to +506
if (!is.null(colnames))
data <- data[, colnames, with = FALSE]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!is.null(colnames))
data <- data[, colnames, with = FALSE]
if (!is.null(colnames)) {
data <- data[, colnames, with = FALSE]
}

Comment on lines +527 to +528
if (any(sapply(data, function(x) is.character(x) || is.factor(x))))
stop("'data' contains categorical columns, but 'reference' did not have encodings for them.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (any(sapply(data, function(x) is.character(x) || is.factor(x))))
stop("'data' contains categorical columns, but 'reference' did not have encodings for them.")
if (any(sapply(data, function(x) is.character(x) || is.factor(x)))) {
stop("'data' contains categorical columns, but 'reference' did not have encodings for them.")
}

Comment on lines +110 to +114
Dataset$private_methods$substitute_from_df_cols(
data, label, weight, NULL,
substitute(label), substitute(weight), NULL,
environment()
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Dataset$private_methods$substitute_from_df_cols(
data, label, weight, NULL,
substitute(label), substitute(weight), NULL,
environment()
)
Dataset$private_methods$substitute_from_df_cols(
data = data
, label = label
, weight = weight
, init_score = init_score
, label_name = substitute(label)
, weight_name = substitute(weight)
, init_score_name = substitute(init_score)
, env_where_to_substitute = environment()
)

Comment on lines +35 to +38
label = NULL,
weight = NULL,
init_score = NULL,
group = NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it wouldn't be possible to pass them as column names.

Ok, please revert any changes related to being able to pass these characteristics as column names. I'm happy to have a discussion about that in a separate feature request issue, but I don't think that change is strictly required to accomplish the behavior "accept data frames as inputs".

We have a strong preference in this project for incremental progress through as-small-as-possible pull requests focused on a single thing. That enables the team of maintainers here (mostly volunteers) to provide higher-quality reviews and prevents the situation where pull requests touching a lot of code drag on for a long time and block or complicate other development.

@david-cortes
Copy link
Contributor Author

I'd prefer to leave the implementation of this idea to someone else, at least for the time being.

I also do not think it'd be very useful (and I'd say it'd be counterintuitive from a user POV) to accept data frames without adding non-standard evaluation for column names and handling of factors, as usually the reason why one wants to pass data frames to a model is to let it handle the categorical features and use column names as labels/weights/etc. Otherwise it could be more easily done by adding an as.matrix(data) call at the beginning of the function.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants