Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Error during basic_walkthrough.R example script #3583

Closed
tonyk7440 opened this issue Nov 21, 2020 · 6 comments · Fixed by #3598
Closed

[R-package] Error during basic_walkthrough.R example script #3583

tonyk7440 opened this issue Nov 21, 2020 · 6 comments · Fixed by #3598

Comments

@tonyk7440
Copy link
Contributor

How you are using LightGBM?

LightGBM component: R package

Environment info

Operating System: Windows 10 Pro 1909

R version: 4.0.2

LightGBM version or commit hash: 3.1.0 via CRAN

Error message and / or logs

While stepping through the first example basic_walkthrough.R I encountered an error on line 155.

> # lgb.Dataset can also be saved using lgb.Dataset.save
> lgb.Dataset.save(dtrain, "dtrain.buffer")
[LightGBM] [Warning] File dtrain.buffer exists, cannot save binary to it
> # To load it in, simply call lgb.Dataset
> dtrain2 <- lgb.Dataset("dtrain.buffer")
> bst <- lgb.train(
+   data = dtrain2
+   , num_leaves = 4L
+   , learning_rate = 1.0
+   , nrounds = 2L
+   , valids = valids
+   , nthread = 2L
+   , objective = "binary"
+ )
Error in data$get_colnames() : 
  dim: cannot get dimensions before dataset has been constructed, please call lgb.Dataset.construct explicitly

Reproducible example(s)

library(lightgbm)
#> Warning: package 'lightgbm' was built under R version 4.0.3
#> Loading required package: R6
library(methods)

# We load in the agaricus dataset
# In this example, we are aiming to predict whether a mushroom is edible
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test

# The loaded data is stored in sparseMatrix, and label is a numeric vector in {0,1}
class(train$label)
#> [1] "numeric"
class(train$data)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
# This is the basic usage of lightgbm you can put matrix in data field
# Note: we are putting in sparse matrix here, lightgbm naturally handles sparse input
# Use sparse matrix when your feature is sparse (e.g. when you are using one-hot encoding vector)
print("Training lightgbm with sparseMatrix")
#> [1] "Training lightgbm with sparseMatrix"
bst <- lightgbm(
  data = train$data
  , label = train$label
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004075 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

# Alternatively, you can put in dense matrix, i.e. basic R-matrix
print("Training lightgbm with Matrix")
#> [1] "Training lightgbm with Matrix"
bst <- lightgbm(
  data = as.matrix(train$data)
  , label = train$label
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004063 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

# You can also put in lgb.Dataset object, which stores label, data and other meta datas needed for advanced features
print("Training lightgbm with lgb.Dataset")
#> [1] "Training lightgbm with lgb.Dataset"
dtrain <- lgb.Dataset(
  data = train$data
  , label = train$label
)
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004049 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

# Verbose = 0,1,2
print("Train lightgbm with verbose 0, no message")
#> [1] "Train lightgbm with verbose 0, no message"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
  , verbose = 0L
)
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004079 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.

print("Train lightgbm with verbose 1, print evaluation metric")
#> [1] "Train lightgbm with verbose 1, print evaluation metric"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , nthread = 2L
  , objective = "binary"
  , verbose = 1L
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002496 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

print("Train lightgbm with verbose 2, also print information about tree")
#> [1] "Train lightgbm with verbose 2, also print information about tree"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , nthread = 2L
  , objective = "binary"
  , verbose = 2L
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.930600
#> [LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.433362
#> [LightGBM] [Debug] init for col-wise cost 0.002159 seconds, init for row-wise cost 0.002144 seconds
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002577 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Debug] Using Sparse Multi-Val Bin
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
#> [1] "[2]:  train's binary_logloss:0.111535"

# You can also specify data as file path to a LibSVM/TCV/CSV format input
# Since we do not have this file with us, the following line is just for illustration
# bst <- lightgbm(
#     data = "agaricus.train.svm"
#     , num_leaves = 4L
#     , learning_rate = 1.0
#     , nrounds = 2L
#     , objective = "binary"
# )
# You can do prediction using the following line
# You can put in Matrix, sparseMatrix, or lgb.Dataset
pred <- predict(bst, test$data)
err <- mean(as.numeric(pred > 0.5) != test$label)
print(paste("test-error=", err))
#> [1] "test-error= 0.0217256362507759"
# Save model to binary local file
lgb.save(bst, "lightgbm.model")

# Load binary model to R
bst2 <- lgb.load("lightgbm.model")
pred2 <- predict(bst2, test$data)

# pred2 should be identical to pred
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred))))
#> [1] "sum(abs(pred2-pred))= 0"
# To use advanced features, we need to put data in lgb.Dataset
dtrain <- lgb.Dataset(data = train$data, label = train$label, free_raw_data = FALSE)
dtest <- lgb.Dataset.create.valid(dtrain, data = test$data, label = test$label)
# valids is a list of lgb.Dataset, each of them is tagged with name
valids <- list(train = dtrain, test = dtest)

# To train with valids, use lgb.train, which contains more advanced features
# valids allows us to monitor the evaluation result on all data in the list
print("Train lightgbm using lgb.train with valids")
#> [1] "Train lightgbm using lgb.train with valids"
bst <- lgb.train(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , nthread = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007114 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597  test's binary_logloss:0.204754"
#> [1] "[2]:  train's binary_logloss:0.111535  test's binary_logloss:0.113096"

# We can change evaluation metrics, or use multiple evaluation metrics
print("Train lightgbm using lgb.train with valids, watch logloss and error")
#> [1] "Train lightgbm using lgb.train with valids, watch logloss and error"
bst <- lgb.train(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , eval = c("binary_error", "binary_logloss")
  , nthread = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002980 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_error:0.0304007  train's binary_logloss:0.198597  test's binary_error:0.0335196  test's binary_logloss:0.204754"
#> [1] "[2]:  train's binary_error:0.0222632  train's binary_logloss:0.111535  test's binary_error:0.0217256  test's binary_logloss:0.113096"

# lgb.Dataset can also be saved using lgb.Dataset.save
lgb.Dataset.save(dtrain, "dtrain.buffer")
#> [LightGBM] [Info] Saving data to binary file dtrain.buffer

# To load it in, simply call lgb.Dataset
dtrain2 <- lgb.Dataset("dtrain.buffer")
bst <- lgb.train(
  data = dtrain2
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , nthread = 2L
  , objective = "binary"
)
#> Error in data$get_colnames(): dim: cannot get dimensions before dataset has been constructed, please call lgb.Dataset.construct explicitly

# information can be extracted from lgb.Dataset using getinfo
label <- getinfo(dtest, "label")
pred <- predict(bst, test$data)
err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label)
print(paste("test-error=", err))
#> [1] "test-error= 0.0217256362507759"

Created on 2020-11-21 by the reprex package (v0.3.0)

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_Ireland.1252  LC_CTYPE=English_Ireland.1252    LC_MONETARY=English_Ireland.1252
[4] LC_NUMERIC=C                     LC_TIME=English_Ireland.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lightgbm_3.1.0 R6_2.4.1      

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11   knitr_1.30        whisker_0.4       magrittr_1.5      lattice_0.20-41   rlang_0.4.7      
 [7] tools_4.0.2       grid_4.0.2        data.table_1.13.0 xfun_0.18         tinytex_0.26      clipr_0.7.0      
[13] htmltools_0.5.0   ellipsis_0.3.1    digest_0.6.25     tibble_3.0.3      lifecycle_0.2.0   crayon_1.3.4     
[19] processx_3.4.4    Matrix_1.2-18     callr_3.4.4       ps_1.3.4          vctrs_0.3.4       fs_1.5.0         
[25] evaluate_0.14     rmarkdown_2.4     reprex_0.3.0      compiler_4.0.2    pillar_1.4.6      jsonlite_1.7.1   
[31] pkgconfig_2.0.3  

Steps to reproduce

  1. Install package from CRAN via install.packages("lightgbm")
  2. Run code from basic_walkthrough.R

I did some debugging and found that adding

lgb.Dataset.construct(dtrain2)

after the dataset is reloaded fixes the issue.

Would you like me to submit a pull request using this suggested fix for above?

Lastly many thanks for the great work on this package!

@guolinke
Copy link
Collaborator

@jameslamb any ideas about this? I think we had tested these demos in cran.

@jameslamb
Copy link
Collaborator

jameslamb commented Nov 23, 2020

Thanks for the thorough report and for using LightGBM @tonyk7440 !

I just installed from master on my Mac (running R 4.0.2) and I can reproduce the error you reported.

Would you like me to submit a pull request using this suggested fix for above?

I'd welcome a pull request, but calling that function in the demos isn't the right fix. lgb.train() should just take care of this for users. I think that this call to $construct() needs to be moved up further in the body of lgb.train():

# Construct datasets, if needed
data$construct()

I think it will be a little complicated, but the fix should be:

If you're interested and have the time this week, we'd welcome the contribution. If not just let me know and I'll fix this.

I think we had tested these demos in cran.

The examples in the package are tested, but not these demos. So these demos definitely need some attention. Whenever we switch from demos to "vignettes" (#1944 ), they'll be tested by R CMD check.

@tonyk7440
Copy link
Contributor Author

@jameslamb Ok great, I will try what you have suggested above! Hopefully will have the pull request ready to review this week

I am a little confused about your third bullet point though. I presume you want to add a test to test_dataset.R to make sure that a dataset can be saved then reloaded and still trained successfully(is that correct?). I'm not sure what you would like the second test to do?

@jameslamb
Copy link
Collaborator

Thanks!

that a dataset can be saved then reloaded and still trained successfully(is that correct?)

The error you hit is because when you ran lgb.Dataset(a_file_name), the resulting object hasn't been constructed yet, and the data$construct() call in lgb.train() is never reached.

So the test should look like this:

test_that("should be able to train immediately after using lgb.Dataset() on a file", {

  dtest <- lgb.Dataset(test_data, label = test_label)
  tmp_file <- tempfile("lgb.Dataset_")
  lgb.Dataset.save(dtest, tmp_file)

  # read from a local file
  dtest_read_in <- lgb.Dataset(tmp_file)

  # should be able to train right away
  bst <- lgb.train(params, dtest_read_in)

  expect_true(lgb.is.Booster(bst))
})

@jameslamb
Copy link
Collaborator

Oh! I realize now that in my third bullet, I made a mistake. That was supposed to say "and then another test like that for lgb.cv()", sorry for the confusion!

jameslamb added a commit that referenced this issue Dec 1, 2020
…3583) (#3598)

* construct dataset earlier in lgb.train and lgb.cv

* Update R-package/tests/testthat/test_dataset.R

Co-authored-by: James Lamb <[email protected]>

* Update R-package/R/lgb.cv.R

Co-authored-by: James Lamb <[email protected]>

* Update R-package/R/lgb.train.R

Co-authored-by: James Lamb <[email protected]>

* Update R-package/tests/testthat/test_dataset.R

Co-authored-by: James Lamb <[email protected]>

* fixing lint issues

* styling updates

* fix failing test

Co-authored-by: James Lamb <[email protected]>
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants