-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Streaming data allocation #3995
Comments
Also, I'd like to add automatic C++ CI tests for this. Do we already have C++ tests in the rest of the code? There are some low-level parts which would useful to unit-test directly in C++, such as this. I was looking into it and still in 2021 Google Test seems one of the best solutions. The other alternative was Catch2, but even if it makes testing easier, apparently it limits debugger usage and its compile times are slower. Any comments/preferences? |
I'm definitely +1 for this feature! Described use case seems reasonable to me. As a "feature request for feature request", I'd like to see this in C API, so everyone will be able to use it in codebases with LightGBM 🙂 . Speaking about C++ tests, unfortunately we don't have any right now 😞 . There is only very old feature request for them: #261 (and #3841 as a sub-issue) and very promising ongoing pull request which should help enable Google Test for LightGBM repo on our CI services: #3555. |
+1 this would be very useful for lightgbm in mmlspark codebase, which currently aggregates all of the streaming data as a large Java array, converts that to a large native array and passes that into the lightgbm native codebase. I think this would reduce memory usage by 1/3 (I think 1/3 is used by Java, then 1/3 by temporary native array created from Java, and another 1/3 by dataset in lightgbm). |
Hmm, I'm actually not entirely sure how much memory is used on the Java side in mmlspark. It might actually reduce memory usage by 50% since I think we are actually making a deep copy of the Java data when reading it from the DataFrame in mapPartitions function, so then 1/4 of the data is loaded by the user in spark, 1/4 is currently aggregated and turned into native array, 1/4 of memory is taken up by that native array and finally another 1/4 is taken by lightgbm dataset (I think it makes a copy of the native array passed in). Streaming, if implemented correctly, would cut out two of those 1/4 in the middle. In any case this enhancement would definitely reduce memory usage substantially. |
Great analysis @imatiach-msft :) in our case with the Java provider, the input Java Dataset was taking ~70% of the total memory (due to the overhead of the Java pointers for each value which don't exist at all in C++).
Now we read data in streaming to the That's great to see we're all in the same page guys, and thanks for the input! |
@AlbertoEAF the PR looks amazing! Do you have plans to add sparse support as well? LGBM_DatasetCreateFromMats is only for dense case. We really need sparse support for mmlspark as it's used in many use cases as well. |
Thanks @imatiach-msft :D Well I didn't, but as soon as ChunkedArray is merged, it should be trivial to support collecting data in streaming for the sparse case as well from MMLSpark's code. There are two parts to it:
About CoalescingAs each val coalescedArray = lightgbmlib.new_intArray(chunkedArray.get_add_count());
// Copy data to the coalescedArray:
chunkedArray.coalesce_to(coalescedArray); // Memory spike: 2 copies of data in C++.
chunkedArray.release(); // Spike gone.
... Getting rid of all memory spikesIdeally, in the future a
It's important to remember here that the memory usage of a dataset copy is much higher in Java than C++ as Java's memory representation is less efficient. SummaryAfter What do you think @imatiach-msft ? |
Through the new class ChunkedArray it's possible to collect data in Java in streaming. Rationale described at microsoft#3995.
gentle ping @imatiach-msft :) |
@AlbertoEAF yes, sure, sounds great to me. I guess I will just need to add sparse support sometime in the future. Maybe for now if I upgrade I can just use the optimized new method for dense data, but keep the sparse code on the memory-intense code. |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Motivation
Although LightGBM supports creating a dataset from an array of matrices through
LGBM_DatasetCreateFromMats
, there's no support to load data from a data stream in single-pass fashion without knowing its size beforehand. For bigger datasets in distributed systems this can be a performance limitation.This feature adds support for it in SWIG, although it is not Java-specific and could be exposed in the rest of the code too.
Proposed Solution
Problem:
Assume the input dataset size is not known (stream) and you want to copy the data in a single-pass.
Solution:
It is possible to implement such feature in C++ by a dynamic "chunked array" - i.e., an array of arrays (chunks), where all chunks are of a fixed size. The sequence is the following:
After reading the dataset, it's size can be computed in O(1):
This array of arrays can now be used in the
LGBM_DatasetCreateFromMats
call.Patch notes
Example usage:
In C++:
This was properly wrapped in SWIG, so it works as a regular Java class too.
Here's the patch diff and I'd really like to see LGBM gain this support, can I submit an MR with it?
Special request for input @imatiach-msft @guolinke @StrikerRUS :)
The text was updated successfully, but these errors were encountered: