Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] lightgbm::lgb.model.dt.tree() error caused by lgb.dump() error with large models #6380

Open
p-schaefer opened this issue Mar 22, 2024 · 12 comments

Comments

@p-schaefer
Copy link

Description

When models or data sets reach a certain level of complexity, the lgb.dump() will cause an error in R: Error: R character strings are limited to 2^31-1 bytes.

Reproducible example

library(dplyr)
library(lightgbm)
library(nycflights13)

dt<-nycflights13::flights %>%
  mutate(origin=factor(origin),
         dest=factor(dest),
         carrier=factor(carrier)
  ) %>%
  select(-tailnum,-time_hour)

spt1<-round(nrow(dt)*(3/4))
spt2<-round(nrow(dt)*(1/4))
train<-head(dt,spt1)
test<-tail(dt,spt2)

dtrain <- lgb.Dataset(as.matrix(train[,colnames(train)!="arr_delay"]),
                      categorical_feature = c("origin","dest","carrier"),
                      label = train[,colnames(train)=="arr_delay"][[1]])

params <- list(
  objective = "regression"
  , metric = "l2"
  , min_data = 1L
  , learning_rate = 1.0
  , num_threads = 2L
  , max_cat_threshold = 2L
)

model <- lgb.train(
  params = params
  , data = dtrain
  , nrounds = 1000000L
)

json_model <- lightgbm::lgb.dump(model) # This will cause the error

A potential solution may be to dump the data directly to a temporary file, then stream in the data from the temporary file (https://rdrr.io/cran/jsonlite/man/stream_in.html)

Environment info

Session info:
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Etc/UTC
 date     2024-03-22
 rstudio  2023.12.1+402 Ocean Storm (server)
 pandoc   3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package        * version    date (UTC) lib source
 base64enc        0.1-3      2015-07-28 [3] CRAN (R 4.0.2)
 bslib            0.6.1      2023-11-28 [2] CRAN (R 4.3.2)
 cachem           1.0.8      2023-05-01 [2] CRAN (R 4.3.0)
 callr            3.7.5      2024-02-19 [2] CRAN (R 4.3.2)
 cellranger       1.1.0      2016-07-27 [3] CRAN (R 4.0.1)
 class            7.3-22     2023-05-03 [4] CRAN (R 4.3.1)
 classInt         0.4-10     2023-09-05 [2] CRAN (R 4.3.1)
 cli              3.6.2      2023-12-11 [2] CRAN (R 4.3.2)
 codetools        0.2-19     2023-02-01 [4] CRAN (R 4.2.2)
 colorspace       2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 cowplot          1.1.3      2024-01-22 [2] CRAN (R 4.3.2)
 crosstalk        1.2.1      2023-11-23 [2] CRAN (R 4.3.2)
 DALEX            2.4.3      2023-01-15 [2] CRAN (R 4.2.3)
 data.table       1.15.2     2024-02-29 [2] CRAN (R 4.3.2)
 datamods         1.4.5      2024-02-28 [2] CRAN (R 4.3.2)
 DBI              1.2.2      2024-02-16 [2] CRAN (R 4.3.2)
 dbplyr           2.5.0      2024-03-19 [2] CRAN (R 4.3.3)
 digest           0.6.35     2024-03-11 [2] CRAN (R 4.3.3)
 dplyr          * 1.1.4      2023-11-17 [2] CRAN (R 4.3.2)
 e1071            1.7-14     2023-12-06 [2] CRAN (R 4.3.2)
 ellipsis         0.3.2      2021-04-29 [3] CRAN (R 4.1.1)
 esquisse         1.2.0      2024-01-10 [2] CRAN (R 4.3.2)
 evaluate         0.23       2023-11-01 [2] CRAN (R 4.3.1)
 extrafont        0.19       2023-01-18 [1] CRAN (R 4.3.0)
 extrafontdb      1.0        2012-06-11 [1] CRAN (R 4.3.0)
 fansi            1.0.6      2023-12-08 [2] CRAN (R 4.3.2)
 fastmap          1.1.1      2023-02-24 [3] CRAN (R 4.2.2)
 forcats        * 1.0.0      2023-01-29 [3] CRAN (R 4.2.2)
 fs               1.6.3      2023-07-20 [3] CRAN (R 4.3.1)
 generics         0.1.3      2022-07-05 [2] CRAN (R 4.2.3)
 ggiraph          0.8.9      2024-02-24 [2] CRAN (R 4.3.2)
 ggiraphExtra     0.3.0      2020-10-06 [2] CRAN (R 4.3.2)
 ggplot2        * 3.5.0      2024-02-23 [2] CRAN (R 4.3.2)
 ggrepel          0.9.5      2024-01-10 [2] CRAN (R 4.3.2)
 glue             1.7.0      2024-01-09 [2] CRAN (R 4.3.2)
 gridExtra        2.3        2017-09-09 [2] CRAN (R 4.2.3)
 gtable           0.3.4      2023-08-21 [2] CRAN (R 4.3.1)
 hardhat          1.3.1      2024-02-02 [2] CRAN (R 4.3.2)
 here             1.0.1      2020-12-13 [2] CRAN (R 4.2.3)
 hms              1.1.3      2023-03-21 [3] CRAN (R 4.2.3)
 htmltools        0.5.7      2023-11-03 [2] CRAN (R 4.3.1)
 htmlwidgets      1.6.4      2023-12-06 [2] CRAN (R 4.3.2)
 httpuv           1.6.14     2024-01-26 [2] CRAN (R 4.3.2)
 httr             1.4.7      2023-08-15 [2] CRAN (R 4.3.1)
 iBreakDown       2.1.2      2023-12-01 [2] CRAN (R 4.3.2)
 insight          0.19.9     2024-03-15 [2] CRAN (R 4.3.3)
 jquerylib        0.1.4      2021-04-26 [3] CRAN (R 4.1.2)
 jsonlite         1.8.8      2023-12-04 [1] CRAN (R 4.3.2)
 KernSmooth       2.23-22    2023-07-10 [4] CRAN (R 4.3.1)
 knitr            1.45       2023-10-30 [2] CRAN (R 4.3.1)
 later            1.3.2      2023-12-06 [2] CRAN (R 4.3.2)
 lattice          0.22-5     2023-10-24 [4] CRAN (R 4.3.1)
 lazyeval         0.2.2      2019-03-15 [2] CRAN (R 4.2.3)
 leafem           0.2.3      2023-09-17 [2] CRAN (R 4.3.2)
 leaflet          2.2.1      2023-11-13 [2] CRAN (R 4.3.1)
 lifecycle        1.0.4      2023-11-07 [2] CRAN (R 4.3.1)
 lightgbm       * 4.3.0      2024-01-18 [2] CRAN (R 4.3.2)
 lubridate      * 1.9.3      2023-09-27 [2] CRAN (R 4.3.1)
 magrittr         2.0.3      2022-03-30 [2] CRAN (R 4.2.3)
 mapview          2.11.2     2023-10-13 [2] CRAN (R 4.3.1)
 MASS             7.3-60.0.1 2024-01-13 [4] CRAN (R 4.3.2)
 Matrix           1.6-5      2024-01-11 [2] CRAN (R 4.3.2)
 mgcv             1.9-1      2023-12-21 [4] CRAN (R 4.3.2)
 mime             0.12       2021-09-28 [3] CRAN (R 4.2.0)
 munsell          0.5.0      2018-06-12 [2] CRAN (R 4.2.3)
 mycor            0.1.1      2018-04-10 [2] CRAN (R 4.3.2)
 NADA             1.6-1.1    2020-03-22 [2] CRAN (R 4.3.1)
 nlme             3.1-163    2023-08-09 [4] CRAN (R 4.3.1)
 nycflights13   * 1.0.2      2021-04-12 [1] CRAN (R 4.3.3)
 openxlsx         4.2.5.2    2023-02-06 [2] CRAN (R 4.2.3)
 parsnip          1.2.0      2024-02-16 [2] CRAN (R 4.3.2)
 phosphoricons    0.2.0      2023-05-17 [2] CRAN (R 4.3.1)
 pillar           1.9.0      2023-03-22 [2] CRAN (R 4.2.3)
 pkgconfig        2.0.3      2019-09-22 [2] CRAN (R 4.2.3)
 plotly           4.10.4     2024-01-13 [2] CRAN (R 4.3.2)
 plyr             1.8.9      2023-10-02 [2] CRAN (R 4.3.1)
 png              0.1-8      2022-11-29 [2] CRAN (R 4.2.3)
 pool             1.0.3      2024-02-14 [2] CRAN (R 4.3.2)
 ppcor            1.1        2015-12-03 [2] CRAN (R 4.3.2)
 processx         3.8.4      2024-03-16 [2] CRAN (R 4.3.3)
 promises         1.2.1      2023-08-10 [2] CRAN (R 4.3.1)
 proxy            0.4-27     2022-06-09 [2] CRAN (R 4.2.3)
 ps               1.7.6      2024-01-18 [2] CRAN (R 4.3.2)
 purrr          * 1.0.2      2023-08-10 [2] CRAN (R 4.3.1)
 R6               2.5.1      2021-08-19 [2] CRAN (R 4.2.3)
 raster           3.6-26     2023-10-14 [2] CRAN (R 4.3.1)
 RColorBrewer     1.1-3      2022-04-03 [2] CRAN (R 4.2.3)
 Rcpp             1.0.12     2024-01-09 [1] CRAN (R 4.3.2)
 reactable        0.4.4      2023-03-12 [2] CRAN (R 4.3.1)
 readr          * 2.1.5      2024-01-10 [2] CRAN (R 4.3.2)
 readxl           1.4.3      2023-07-06 [1] CRAN (R 4.3.0)
 reprex           2.1.0      2024-01-11 [2] CRAN (R 4.3.2)
 reshape2         1.4.4      2020-04-09 [2] CRAN (R 4.3.1)
 rio              1.0.1      2023-09-19 [2] CRAN (R 4.3.1)
 rlang            1.1.3      2024-01-10 [2] CRAN (R 4.3.2)
 rmarkdown        2.26       2024-03-05 [2] CRAN (R 4.3.2)
 rpivotTable      0.3.0      2018-01-30 [2] CRAN (R 4.3.1)
 rprojroot        2.0.4      2023-11-05 [2] CRAN (R 4.3.1)
 rstudioapi       0.15.0     2023-07-07 [1] CRAN (R 4.3.0)
 Rttf2pt1         1.3.12     2023-01-22 [1] CRAN (R 4.3.0)
 sass             0.4.9      2024-03-15 [2] CRAN (R 4.3.3)
 satellite        1.0.5      2024-02-10 [2] CRAN (R 4.3.2)
 scales           1.3.0      2023-11-28 [2] CRAN (R 4.3.2)
 sessioninfo      1.2.2      2021-12-06 [2] CRAN (R 4.2.3)
 sf               1.0-15     2023-12-18 [2] CRAN (R 4.3.2)
 shiny            1.8.0      2023-11-17 [2] CRAN (R 4.3.2)
 shinybusy        0.3.3      2024-03-09 [2] CRAN (R 4.3.3)
 shinyWidgets     0.8.3      2024-03-21 [2] CRAN (R 4.3.3)
 sjlabelled       1.2.0      2022-04-10 [2] CRAN (R 4.3.2)
 sjmisc           2.8.9      2021-12-03 [2] CRAN (R 4.3.2)
 sp               2.1-3      2024-01-30 [2] CRAN (R 4.3.2)
 stringi          1.8.3      2023-12-11 [2] CRAN (R 4.3.2)
 stringr        * 1.5.1      2023-11-14 [2] CRAN (R 4.3.2)
 survival         3.5-8      2024-02-14 [4] CRAN (R 4.3.3)
 systemfonts      1.0.6      2024-03-07 [2] CRAN (R 4.3.3)
 tibble         * 3.2.1      2023-03-20 [2] CRAN (R 4.2.3)
 tidyr          * 1.3.1      2024-01-24 [2] CRAN (R 4.3.2)
 tidyselect       1.2.1      2024-03-11 [2] CRAN (R 4.3.3)
 tidyverse      * 2.0.0      2023-02-22 [2] CRAN (R 4.2.3)
 timechange       0.3.0      2024-01-18 [2] CRAN (R 4.3.2)
 treeshap         0.3.1      2024-01-22 [2] CRAN (R 4.3.2)
 tzdb             0.4.0      2023-05-12 [3] CRAN (R 4.3.0)
 units            0.8-5      2023-11-28 [2] CRAN (R 4.3.2)
 utf8             1.2.4      2023-10-22 [2] CRAN (R 4.3.1)
 uuid             1.2-0      2024-01-14 [2] CRAN (R 4.3.2)
 vctrs            0.6.5      2023-12-01 [1] CRAN (R 4.3.2)
 viridisLite      0.4.2      2023-05-02 [2] CRAN (R 4.3.0)
 withr            3.0.0      2024-01-16 [2] CRAN (R 4.3.2)
 writexl          1.5.0      2024-02-09 [2] CRAN (R 4.3.2)
 xfun             0.42       2024-02-08 [2] CRAN (R 4.3.2)
 xgboost          1.7.7.1    2024-01-25 [2] CRAN (R 4.3.2)
 xtable           1.8-4      2019-04-21 [2] CRAN (R 4.2.3)
 yaml             2.3.8      2023-12-11 [2] CRAN (R 4.3.2)
 zip              2.3.1      2024-01-27 [2] CRAN (R 4.3.2)

 [1] /home/pschaefer/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Additional Comments

@jameslamb
Copy link
Collaborator

Thanks very much for the excellent write-up!!

I do wish this had been posted in the existing discussion we were having on the exact same topic at #6288, to not split the conversation. But now that we have this issue with a reproducible example, I'll close that one and we can focus here.

dump the data directly to a temporary file

Interesting idea! I hope we can avoid touching the filesystem to support larger models, if possible, since that can introduce its own set of problems (permissions errors, space issues, files being left behind, etc.).

Some combination of that and these other ideas might help here:

  • not using JSON in the middle of this operation (it's quite costly memory-wise because the keys that become column headers are repeated many times)
  • providing an entrypoint in LightGBM's C API with an iterator over chunks of the data instead of trying to dump it into an R string all at once
  • arranging the data into array format on the C/C++ side and creating an R data frame there

Are you interested in working on this? If not, no worries... we appreciate the thorough write-up and you can subscribe to this issue to be notified if / when someone addresses it.

@mayer79
Copy link
Contributor

mayer79 commented Mar 22, 2024

@jameslamb Maybe we can dump/parse single trees (or m=NULL trees) instead of the full model.

@p-schaefer
Copy link
Author

Thanks everyone. Apologies for opening a new issue on this in so many places. But it seems like there are some potential solutions on the table. Unfortunately, I'm not very familiar with C/C++, so I'm afraid I would be of little help there. But if there is anything I can help with on the R or Python side, I'd be happy too. I could be mistaken, but to me it seems a lot of this is handled on the C side though.

I think @mayer79 suggestion would have utility in a lot of places, but in terms of computation time and overhead, either not using JSON, or arranging the data into array format on the C/C++ side would probably be optimal there.

@jameslamb
Copy link
Collaborator

Maybe we can dump/parse single trees (or m=NULL trees) instead of the full model.

Yep! This is one specific version of the more general statement I made, "an iterator over chunks of the data".

Looking through the C API... the underlying API for dumping to JSON actually already supports iterating over ranges of trees 😁

LightGBM/src/c_api.cpp

Lines 2687 to 2689 in 28536a0

int LGBM_BoosterDumpModel(BoosterHandle handle,
int start_iteration,
int num_iteration,

So I think we can probably do this with 0 API changes.

@mayer79
Copy link
Contributor

mayer79 commented Mar 31, 2024

@jameslamb: Very neat! I can work on this after #6364 is merged.

@jameslamb
Copy link
Collaborator

Great thank you! I'd like to merge #6364 soon, but we're blocked until I can get some help with #6316 (comment)

@p-schaefer
Copy link
Author

Has there been any progress on this? Is there anything I can do to help move this along?

@jameslamb
Copy link
Collaborator

It's being worked on in #6397, you can subscribed there.

@mayer79
Copy link
Contributor

mayer79 commented Jun 13, 2024

Note on the status:

We have finished #6397, and I have started to draft a solution for this issue, but it is not yet in the state I want it to have.

@jameslamb Is there are robust way to find out the number of boosting rounds in a model? I see there is a current_iter() method, but I think this is not working when loading a model from disk. The alternative is a while loop that keeps adding iterations until the returned number of iterations is smaller than a chunk:

Pseudo-code

m= 1
list_of_chunks = list()
while not finished {
  chunk = lgb.model.dt.tree(..., start_iteration = m, num_iteration = chunk_size)
  list_of_chunks.append(chunk)
  m+= chunk_size
  if len(chunk) < chunk_size:
    finished = True
}
all_iterations <- rbindlist(list_of_chunks)
calculate importance

@jameslamb
Copy link
Collaborator

Is there are robust way to find out the number of boosting rounds in a model?

I believe you could call these methods in the C API:

int LGBM_BoosterNumberOfTotalModel(BoosterHandle handle, int* out_models) {

int LGBM_BoosterNumModelPerIteration(BoosterHandle handle, int* out_tree_per_iteration) {

And the number of boosting rounds should be num_models / num_model_per_iteration. I know the terminology of "model" to mean "tree" there is different from much of LightGBM's R and Python APIs, but that what those methods mean:

/*! \brief Trained models(trees) */
std::vector<std::unique_ptr<Tree>> models_;

@mayer79
Copy link
Contributor

mayer79 commented Jun 14, 2024

Ahh, actually, I once had a look at these functions in c_api.cpp!

We can add these and a num_iteration() method to the Booster class and use that to make clean chunks.

Would you make two PRs? One that brings these methods into the R package and then another one that closes the issue?

@jameslamb
Copy link
Collaborator

I think 2 PRs would be better, yes please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants