-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] lightgbm::lgb.model.dt.tree() error caused by lgb.dump() error with large models #6380
Comments
Thanks very much for the excellent write-up!! I do wish this had been posted in the existing discussion we were having on the exact same topic at #6288, to not split the conversation. But now that we have this issue with a reproducible example, I'll close that one and we can focus here.
Interesting idea! I hope we can avoid touching the filesystem to support larger models, if possible, since that can introduce its own set of problems (permissions errors, space issues, files being left behind, etc.). Some combination of that and these other ideas might help here:
Are you interested in working on this? If not, no worries... we appreciate the thorough write-up and you can subscribe to this issue to be notified if / when someone addresses it. |
@jameslamb Maybe we can dump/parse single trees (or m=NULL trees) instead of the full model. |
Thanks everyone. Apologies for opening a new issue on this in so many places. But it seems like there are some potential solutions on the table. Unfortunately, I'm not very familiar with C/C++, so I'm afraid I would be of little help there. But if there is anything I can help with on the R or Python side, I'd be happy too. I could be mistaken, but to me it seems a lot of this is handled on the C side though. I think @mayer79 suggestion would have utility in a lot of places, but in terms of computation time and overhead, either not using JSON, or arranging the data into array format on the C/C++ side would probably be optimal there. |
Yep! This is one specific version of the more general statement I made, "an iterator over chunks of the data". Looking through the C API... the underlying API for dumping to JSON actually already supports iterating over ranges of trees 😁 Lines 2687 to 2689 in 28536a0
So I think we can probably do this with 0 API changes. |
@jameslamb: Very neat! I can work on this after #6364 is merged. |
Great thank you! I'd like to merge #6364 soon, but we're blocked until I can get some help with #6316 (comment) |
Has there been any progress on this? Is there anything I can do to help move this along? |
It's being worked on in #6397, you can subscribed there. |
Note on the status: We have finished #6397, and I have started to draft a solution for this issue, but it is not yet in the state I want it to have. @jameslamb Is there are robust way to find out the number of boosting rounds in a model? I see there is a Pseudo-code m= 1
list_of_chunks = list()
while not finished {
chunk = lgb.model.dt.tree(..., start_iteration = m, num_iteration = chunk_size)
list_of_chunks.append(chunk)
m+= chunk_size
if len(chunk) < chunk_size:
finished = True
}
all_iterations <- rbindlist(list_of_chunks)
calculate importance |
I believe you could call these methods in the C API: Line 2121 in ad1237d
Line 2114 in ad1237d
And the number of boosting rounds should be Lines 545 to 546 in ad1237d
|
Ahh, actually, I once had a look at these functions in c_api.cpp! We can add these and a Would you make two PRs? One that brings these methods into the R package and then another one that closes the issue? |
I think 2 PRs would be better, yes please. |
Description
When models or data sets reach a certain level of complexity, the
lgb.dump()
will cause an error in R:Error: R character strings are limited to 2^31-1 bytes
.Reproducible example
A potential solution may be to dump the data directly to a temporary file, then stream in the data from the temporary file (https://rdrr.io/cran/jsonlite/man/stream_in.html)
Environment info
Session info:
Additional Comments
The text was updated successfully, but these errors were encountered: