[R package] Accept data frames as inputs #4207

david-cortes · 2021-04-21T01:38:57Z

This PR adds functionality for accepting data frames in the model fitting and prediction functions, keeping track of the column order and encodings of categorical features.

I wasn't very sure how to structure it as the relationships between the internal object types are not quite accommodating to the idea, so I decided to save metadata in the dataset class, which is then saved again in the booster class when it receives a dataset object with data frame metadata.

Additionally, it allows using a data frame column as label if passing a name that matches with it.

There was some issue with the hard-coded numbers in the tests, as it seems calling inherits inside an R6 method somehow modifies the global RNG status and the tests would fail if the same numbers were kept, since the randomly-generated data is now changed. I've modified them to match against what it ends up generating after these changes.

BTW I'm not very sure if missing values in categorical features are meant to be passed as negative integers like in Python. Is this how it's supposed to be, or should it use R's own NA value?

…nt OSes

david-cortes · 2021-04-21T04:31:49Z

Would be helpful to have access to the full package check log from the CI pipeline so as to know which specific test is failing and whether it's related to this PR or not. The only info I could extract from it is this:

2021-04-21T04:26:23.1469020Z * checking tests ...
2021-04-21T04:26:39.1305080Z   Running ‘testthat.R’ [16s/16s]
2021-04-21T04:26:39.1818810Z  � � � ERROR
2021-04-21T04:26:39.1875160Z Running the tests in ‘tests/testthat.R’ failed.
2021-04-21T04:26:39.1875810Z Last 13 lines of output:
2021-04-21T04:26:39.1876510Z   30/32 mismatches (average diff: 11.4)
2021-04-21T04:26:39.1877320Z   [1] 21.0 - 30.4 ==  -9.4
2021-04-21T04:26:39.1878520Z   [2] 21.0 - 30.4 ==  -9.4
2021-04-21T04:26:39.1879300Z   [3] 22.8 - 30.4 ==  -7.6
2021-04-21T04:26:39.1879920Z   [4] 21.4 - 30.4 ==  -9.0
2021-04-21T04:26:39.1880660Z   [5] 18.7 - 30.4 == -11.7
2021-04-21T04:26:39.1881450Z   [6] 18.1 - 30.4 == -12.3
2021-04-21T04:26:39.1882070Z   [7] 14.3 - 30.4 == -16.1
2021-04-21T04:26:39.1882670Z   [8] 24.4 - 30.4 ==  -6.0
2021-04-21T04:26:39.1883980Z   [9] 22.8 - 30.4 ==  -7.6
2021-04-21T04:26:39.1884310Z   ...
2021-04-21T04:26:39.1884580Z   
2021-04-21T04:26:39.1884910Z   [ FAIL 1 | WARN 0 | SKIP 3 | PASS 645 ]
2021-04-21T04:26:39.1885330Z   Error: Test failures
2021-04-21T04:26:39.1885710Z   Execution halted
2021-04-21T04:26:43.9557750Z * checking PDF version of manual ... � OK
2021-04-21T04:26:43.9562660Z * checking for detritus in the temp directory ... OK
2021-04-21T04:26:43.9567380Z * DONE
2021-04-21T04:26:43.9572480Z 
2021-04-21T04:26:43.9574120Z Status: 1 ERROR
2021-04-21T04:26:43.9576130Z See
2021-04-21T04:26:43.9577240Z   ‘/Users/runner/work/LightGBM/LightGBM/lightgbm.Rcheck/00check.log’
2021-04-21T04:26:43.9577820Z for details.

Which doesn't where it failed.

The failure seems to be only in R 3.6 so I suspect it could be related to modification of the global RNG state.

jameslamb

Thanks very much for your interest in LightGBM and your efforts here. I quickly scanned this today and I've left a few comments suggesting some reductions in the scope of this PR.

We can provide a more thorough review once those changes are made.

R-package/R/lgb.Dataset.R

jameslamb · 2021-04-21T22:34:45Z

R-package/R/lgb.Dataset.R

+                          label = NULL,
+                          weight = NULL,
+                          init_score = NULL,
+                          group = NULL,


Is this change and the corresponding changes below necessary to support the feature "Accept data frames as inputs"? Or do you just have a personal preference for explicit keyword arguments instead of the info construct that exists in this class today?

If it's not strictly required to support this feature, please revert this change, make this PR work with the existing interface of lgb.Dataset, and open an issue describing what you'd like to change about the interface to lgb.Dataset() and how you think that proposed change improves LightGBM.

That will reduce the size of the diff here, which should allow us to provide a higher-quality review and bring this PR to a resolution more quickly.

It's necessary, as otherwise it wouldn't be possible to pass them as column names.

otherwise it wouldn't be possible to pass them as column names.

Ok, please revert any changes related to being able to pass these characteristics as column names. I'm happy to have a discussion about that in a separate feature request issue, but I don't think that change is strictly required to accomplish the behavior "accept data frames as inputs".

We have a strong preference in this project for incremental progress through as-small-as-possible pull requests focused on a single thing. That enables the team of maintainers here (mostly volunteers) to provide higher-quality reviews and prevents the situation where pull requests touching a lot of code drag on for a long time and block or complicate other development.

david-cortes · 2021-04-21T23:41:35Z

Managed to reproduce the failing test in R 3.6.3 in windows with RTools35. The failing test is among the newly introduced ones (L1750 in test_basic.R):

  # check that it accepts data frames even when fitting to matrices
  X <- mtcars[, -1L]
  model_new <- lightgbm(data = as.matrix(X), label = y,
                        params = list(objective = "regression", min_data = 1L),
                        verbose = -1L)
  pred <- predict(model_new, as.matrix(X))
  pred_new <- predict(model_new, X)
  expect_equal(pred, pred_new)

What differs is the predictions as they come from the C++ function LGBM_BoosterPredictForMat_R. It's odd that the predictions differ, since the inputs to the function match exactly according to testthat.

Will investigate further.

david-cortes · 2021-04-22T00:44:13Z

@jameslamb I finally managed to find out the root cause behind the tests that are failing. It's actually just a coincidence that it failed here as there's a much bigger problem beneath.

The issue is actually with the function R_REAL_PTR in the C++ wrapper, which does not output the same result as the R C function REAL.

In my setup, for the tests that fail, I get wildly different pointer addresses:

REAL(data) = 0x145b88a8 (obtained by using R's own .Call with proper SEXP from the R headers)
R_REAL_PTR(data = 0x118edaa0

I suspect it might be related to R variables that have references to non-owned data, and the issue probably arises from trying to re-define the structs behind SEXPs instead of including the R headers, as those are not meant to be constant between R versions.

The problem can be easily reproduced outside of this PR by something as simple as this:

X <- as.matrix(df)
some_model <- lightgbm(...)
pred_before <- predict(some_model, X)
mode(X) <- "double"
pred_after <- predict(some_model, X)

The two will not match as the second one will get a pointer to random data and will be incorrect.

BTW this means that the minimum R version is actually >=4.0.0 as the replacements for R headers do not work correctly with 3.6. If the data were larger I think it might even be able to crash the R session by producing a segmentation fault.

jameslamb · 2021-04-22T00:59:02Z

BTW this means that the minimum R version is actually >=4.0.0 as the replacements for R headers do not work correctly with 3.6

I'm willing to believe that there are some issues with the fact that LightGBM doesn't use R's headers directly. This is something we've been wanting to fix for a while, just haven't gotten to it yet. You can see #3016 for all of the relevant context.

But I'm not totally convinced. Every build of this project includes 7 different CI jobs that test the R package on R 3.6.3, and those all use exactly the same unit tests as the R 4.x CI jobs. The R package's tests pretty thoroughly cover the package and check the correctness of results, and there aren't any conditions in the code or tests that are like "if R version < 4.0...".

Are you able to provide a self-contained reproducible example on master (not this branch) that demonstrates the bug that you think exists? If you can, we would very much appreciate an issue with that example and a description of your theory about what is wrong.

Are you very familiar with C++ and particularly R's headers? Would you be interested in trying to address #3016?

david-cortes · 2021-04-22T01:00:56Z

BTW this means that the minimum R version is actually >=4.0.0 as the replacements for R headers do not work correctly with 3.6

I'm willing to believe that there are some issues with the fact that LightGBM doesn't use R's headers directly. This is something we've been wanting to fix for a while, just haven't gotten to it yet. You can see #3016 for all of the relevant context.

But I'm not totally convinced. Every build of this project includes 7 different CI jobs that test the R package on R 3.6.3, and those all use exactly the same unit tests as the R 4.x CI jobs. The R package's tests pretty thoroughly cover the package and check the correctness of results, and there aren't any conditions in the code or tests that are like "if R version < 4.0...".

Are you able to provide a self-contained reproducible example on master (not this branch) that demonstrates the bug that you think exists? If you can, we would very much appreciate an issue with that example and a description of your theory about what is wrong.

Are you very familiar with C++ and particularly R's headers? Would you be interested in trying to address #3016?

Created an issue about it, and yes it's reproducible with the current version from master: #4216

jameslamb · 2021-05-02T02:43:27Z

I'm working right now on #3016 and #4216, which I think should resolve some of the issues you uncovered in this PR's tests. Thanks for all the information you've given us so far about those and for drawing my attention back to them (the R package has not gotten much new development recently).

Once those issues are resolved, I'll ask you to update this PR to master and then I'll come give this a review.

jameslamb · 2021-05-12T21:53:56Z

Ok @david-cortes , we've merged several changes over the last two weeks and #3016 is now resolved...LightGBM's custom R-to-C interface has been totally replaced with the builtins from Rinternals.h. So I think the issues you mentioned in #4207 (comment) should be resolved.

Can you please update this PR with the latest changes from master, and @ me whenever you think it's ready for another review?

Thanks very much.

david-cortes · 2021-05-13T18:13:28Z

@jameslamb It's ready for review. The failing tests are not from this PR.

jameslamb

Thanks very much for the idea and the contributions!

I left a few comments related to code style. Please, wherever possible, try to match the existing style of the R package. There are limits to what we can automatically check for with {lintr}, so I apologize for the fact that you have to receive some style recommendations as review comments.

I'm supportive of allowing users to bring data frames as the format for training data! I think it's a welcome and very useful addition. However, I believe this pull request is currently trying to do too much at once, and is difficult to review as a result. Could you please limit the scope to "accept all-numeric data frames as inputs"? That would allow us to really focus on the API and how to change or restructure the package internals.

Then these other features could be documented in feature requests and could be added in later PRs:

automatically converting categorical-looking columns in LightGBM-specific way (including "remembering" factor levels)
allowing users to use column names for label, weight, init_score, and group instead of needing to break them out into separate vectors

jameslamb · 2021-05-13T20:15:29Z

R-package/R/lgb.Booster.R

+          data <- Dataset$public_methods$process_data_frame_columns(
+                    data,
+                    private$colnames,
+                    private$categorical_feature,
+                    private$factor_levels
+                  )


Suggested change

data <- Dataset$public_methods$process_data_frame_columns(

data,

private$colnames,

private$categorical_feature,

private$factor_levels

)

data <- Dataset$public_methods$process_data_frame_columns(

data = data

, colnames = private$colnames

, categorical_feature = private$categorical_feature

, factor_levels = private$factor_levels

)

Please match the style in used in the rest of the R package, which is comma-first and closing parentheses vertically aligned with the beginning of the line.

And please use keyword arguments for any function calls to internal methods with more than one argument. This makes the code a bit easier to read and prevents bugs caused by mistakes in the order that arguments are given.

jameslamb · 2021-05-13T20:17:17Z

R-package/R/lgb.Dataset.R

+      if (!is.null(label)) info[["label"]] <- label
+      if (!is.null(weight)) info[["weight"]] <- weight
+      if (!is.null(init_score)) info[["init_score"]] <- init_score
+      if (!is.null(group)) info[["group"]] <- group


Suggested change

if (!is.null(label)) info[["label"]] <- label

if (!is.null(weight)) info[["weight"]] <- weight

if (!is.null(init_score)) info[["init_score"]] <- init_score

if (!is.null(group)) info[["group"]] <- group

if (!is.null(label)) {

info[["label"]] <- label

}

if (!is.null(weight)) {

info[["weight"]] <- weight

}

if (!is.null(init_score)) {

info[["init_score"]] <- init_score

}

if (!is.null(group)) {

info[["group"]] <- group

}

Following the style in the rest of the R package, please use {} for the expression in every control statement (for, if, while) and do not perform assignments in one-liners.

jameslamb · 2021-05-13T20:17:43Z

R-package/R/lgb.Dataset.R

+        if (!nrow(data) || !ncol(data))
+          stop("'data' is empty.")


Suggested change

if (!nrow(data) || !ncol(data))

stop("'data' is empty.")

if (!nrow(data) || !ncol(data)) {

stop("'data' is empty.")

}

jameslamb · 2021-05-13T20:21:04Z

R-package/R/lgb.Dataset.R

+        if (is.null(reference)) {
+
+          # Factors are taken directly in data frames, so should not be supplied
+          if (!is.null(categorical_feature))


I don't agree with this decision. If my data are already in integer or numeric format, this would mean that to tell {lightgbm} to treat them as categorical I'd have to convert them to factors first, right? I don't support adding that friction into users' workflows.

I'm supportive of the "automatically treat factors as categorical features" proposal, but I'd prefer to have behavior like "if the input is a data.frame and that data.frame contains any factors, convert those factors to integers and add the column names to `categorical_feature".

jameslamb · 2021-05-13T20:22:16Z

R-package/R/lgb.Dataset.R

+          if (!is.null(colnames))
+            stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.")


Suggested change

if (!is.null(colnames))

stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.")

if (!is.null(colnames)) {

stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.")

}

jameslamb · 2021-05-13T20:23:35Z

R-package/R/lgb.Dataset.R

+            data,
+            reference$get_colnames(),
+            reference$get_categorical_feature(),
+            reference$get_factor_levels()


Suggested change

data,

reference$get_colnames(),

reference$get_categorical_feature(),

reference$get_factor_levels()

data = data,

colnames = reference$get_colnames(),

categorical_feature = reference$get_categorical_feature(),

factor_levels = reference$get_factor_levels()

jameslamb · 2021-05-13T20:24:03Z

R-package/R/lgb.Dataset.R

+      if (!is.null(colnames))
+        data <- data[, colnames, with = FALSE]


Suggested change

if (!is.null(colnames))

data <- data[, colnames, with = FALSE]

if (!is.null(colnames)) {

data <- data[, colnames, with = FALSE]

}

jameslamb · 2021-05-13T20:24:21Z

R-package/R/lgb.Dataset.R

+        if (any(sapply(data, function(x) is.character(x) || is.factor(x))))
+          stop("'data' contains categorical columns, but 'reference' did not have encodings for them.")


Suggested change

if (any(sapply(data, function(x) is.character(x) || is.factor(x))))

stop("'data' contains categorical columns, but 'reference' did not have encodings for them.")

if (any(sapply(data, function(x) is.character(x) || is.factor(x)))) {

stop("'data' contains categorical columns, but 'reference' did not have encodings for them.")

}

jameslamb · 2021-05-13T20:25:59Z

R-package/R/lgb.cv.R

+      Dataset$private_methods$substitute_from_df_cols(
+        data, label, weight, NULL,
+        substitute(label), substitute(weight), NULL,
+        environment()
+      )


Suggested change

Dataset$private_methods$substitute_from_df_cols(

data, label, weight, NULL,

substitute(label), substitute(weight), NULL,

environment()

)

Dataset$private_methods$substitute_from_df_cols(

data = data

, label = label

, weight = weight

, init_score = init_score

, label_name = substitute(label)

, weight_name = substitute(weight)

, init_score_name = substitute(init_score)

, env_where_to_substitute = environment()

)

jameslamb · 2021-05-13T20:30:55Z

R-package/R/lgb.Dataset.R

+                          label = NULL,
+                          weight = NULL,
+                          init_score = NULL,
+                          group = NULL,


otherwise it wouldn't be possible to pass them as column names.

Ok, please revert any changes related to being able to pass these characteristics as column names. I'm happy to have a discussion about that in a separate feature request issue, but I don't think that change is strictly required to accomplish the behavior "accept data frames as inputs".

We have a strong preference in this project for incremental progress through as-small-as-possible pull requests focused on a single thing. That enables the team of maintainers here (mostly volunteers) to provide higher-quality reviews and prevents the situation where pull requests touching a lot of code drag on for a long time and block or complicate other development.

david-cortes · 2021-05-13T23:36:34Z

I'd prefer to leave the implementation of this idea to someone else, at least for the time being.

I also do not think it'd be very useful (and I'd say it'd be counterintuitive from a user POV) to accept data frames without adding non-standard evaluation for column names and handling of factors, as usually the reason why one wants to pass data frames to a model is to let it handle the categorical features and use column names as labels/weights/etc. Otherwise it could be more easily done by adding an as.matrix(data) call at the beginning of the function.

github-actions · 2023-08-23T22:38:23Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

accept data frames as inputs

e398823

david-cortes requested review from jameslamb and Laurae2 as code owners April 21, 2021 01:38

david-cortes added 5 commits April 20, 2021 22:54

solve linter complains

7bd4ed6

attempt at solving issues with random seed reproducibility in differe…

f90304a

…nt OSes

more linter complains

a911c37

missing file

d6671df

fix error

08d8a34

jameslamb added the feature label Apr 21, 2021

jameslamb requested changes Apr 21, 2021

View reviewed changes

jameslamb added the in progress label Apr 21, 2021

revert corrections

9fb2ffa

david-cortes mentioned this pull request Apr 22, 2021

(R) Functions that replace R headers are not compatible with all R versions #4216

Closed

This was referenced Apr 22, 2021

[R-package] Move error handling into C++ side #4163

Merged

[R-package] fix warnings in unit tests #4225

Merged

solve merge conflict

cb55287

jameslamb requested changes May 13, 2021

View reviewed changes

david-cortes closed this May 13, 2021

jameslamb mentioned this pull request May 16, 2021

[RFC][R-package] Request: provide more idiomatic interface #4295

Open

jameslamb mentioned this pull request May 26, 2021

[R-package] Accept data frames as inputs #4323

Closed

8 tasks

github-actions bot removed the in progress label Aug 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R package] Accept data frames as inputs #4207

[R package] Accept data frames as inputs #4207

david-cortes commented Apr 21, 2021 •

edited

Loading

david-cortes commented Apr 21, 2021 •

edited

Loading

jameslamb left a comment

jameslamb Apr 21, 2021

david-cortes Apr 21, 2021

jameslamb May 13, 2021

david-cortes commented Apr 21, 2021

david-cortes commented Apr 22, 2021 •

edited

Loading

jameslamb commented Apr 22, 2021

david-cortes commented Apr 22, 2021

jameslamb commented May 2, 2021

jameslamb commented May 12, 2021

david-cortes commented May 13, 2021

jameslamb left a comment

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

jameslamb May 13, 2021

david-cortes commented May 13, 2021

github-actions bot commented Aug 23, 2023

		if (!is.null(colnames))
		stop("Cannot pass 'colnames' for data.frame. Column names will be taken from it directly.")

-            data,
-            reference$get_colnames(),
-            reference$get_categorical_feature(),
-            reference$get_factor_levels()
+            data = data,
+            colnames = reference$get_colnames(),
+            categorical_feature = reference$get_categorical_feature(),
+            factor_levels = reference$get_factor_levels()

		if (!is.null(colnames))
		data <- data[, colnames, with = FALSE]

		if (any(sapply(data, function(x) is.character(x) \|\| is.factor(x))))
		stop("'data' contains categorical columns, but 'reference' did not have encodings for them.")

-      Dataset$private_methods$substitute_from_df_cols(
-        data, label, weight, NULL,
-        substitute(label), substitute(weight), NULL,
-        environment()
-      )
+    Dataset$private_methods$substitute_from_df_cols(
+      data = data
+      , label = label
+      , weight = weight
+      , init_score = init_score
+      , label_name = substitute(label)
+      , weight_name = substitute(weight)
+      , init_score_name = substitute(init_score)
+      , env_where_to_substitute = environment()
+    )

[R package] Accept data frames as inputs #4207

[R package] Accept data frames as inputs #4207

Conversation

david-cortes commented Apr 21, 2021 • edited Loading

david-cortes commented Apr 21, 2021 • edited Loading

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-cortes commented Apr 21, 2021

david-cortes commented Apr 22, 2021 • edited Loading

jameslamb commented Apr 22, 2021

david-cortes commented Apr 22, 2021

jameslamb commented May 2, 2021

jameslamb commented May 12, 2021

david-cortes commented May 13, 2021

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-cortes commented May 13, 2021

github-actions bot commented Aug 23, 2023

david-cortes commented Apr 21, 2021 •

edited

Loading

david-cortes commented Apr 21, 2021 •

edited

Loading

david-cortes commented Apr 22, 2021 •

edited

Loading