-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements #81
Comments
Fascinating performance/speed improvements for inference. Secondly, if I were to run either training or inference workloads on my GPU, is there currently an easy way to expose a switch to go from "device=cpu" to "device=gpu" within the lightgbm4j API? Thanks a lot! |
My case is different. I need to make prediction in a single thread as fast as possible. I can't make an informed advice about the parallel scenario. |
Hey @vad0 thanks for your ideas on latency! I've exposed fast-predict methods in the API, and also added a section in the docs on latency implications of predict-mat, predict-row and predict-fast methods. As for exposing SWIG buffers externally to avoid allocations, I'm on the edge about this idea - as then you're one step close to the accidental JVM segfault if you do something wrong with these datastructures. Feel free to reopen the ticket (or propose a PR in a safe way of avoiding allocations at all). |
As for single-row latency, the 4.4 released recently has a fix related to your issue: microsoft/LightGBM#6024 |
Is it possible that the published maven package lightgbm4j-4.3.0-1, which is referenced in the docs, does not contain these latest changes yet? E.g. I don't seem to find LGBMBooster.FastConfig exposed there. |
@jornd13 Yes, these changes were made only on master branch (so previous artifacts were unchanged), but I'm preparing the new release now. |
here is the one: https://github.com/metarank/lightgbm4j/releases/tag/4.4.0-1 . Maven artifacts may take up to 8 hours to be available. |
Thanks a lot for a clear examples of using the SWIG APIs. I tried to use this library in the low latency environment and had to do a bunch of things to make it work faster. Some of them can be described in the documentation - it will save time for users. Others can be implemented in code. My task is to make predictions as fast as possible. Here are the benchmarks.
The base case is predictDefault benchmark which calls
predictForMat
function. We see that is works almost 1ms. Why is it? Because by default LightGBM tries to parallelize computations. It creates threads under the hood, executes jobs asynchronously, waits for results, etc. All of this doesn't make sense if our job is to make predictions. So I added the following argument: "num_threads=1 device=cpu". The result can be seen in predictSingleThread benchmark. It became only 12us. So, I suggest you to add the same argument by default when making predictions.Then I found out that there is a method which predicts only a single row which works for 7.9us on my machine. Please mention it in the docs! 7.9us is good, but can be further improved.
Method predictForMatSingleRow allocates a bunch of off-heap structures under the hood, makes prediction, then frees these structures. If we reuse these structures, then we can save some time. The benchmark predictSingleRowNoAllocation works for 6.9us. I would suggest to overload predictForMatSingleRow method by extracting
dataBuffer
,outBuffer
andoutLength
to parameters.Still we can do better. LightGBM allows to make preparations once, and then make many predictions by calling methods
LGBM_BoosterPredictForMatSingleRowFastInit
andLGBM_BoosterPredictForMatSingleRowFast
respectively. lightgbm4j doesn't expose this functionality. However, if we make predictions this way, then we further reduce single call time to 2.8us. Async profiler shows that LightGBM does not call malloc/free in this case at all. Probably one should add these methods to the friendly API.And finally, we can do even better. JNI is not the fastest way to pass data to the native memory. Since SWIG gives us raw addresses, we can read and write at these addresses using Unsafe. Or we can use a friendlier UnsafeBuffer from agrona. This saves us another 0.29us, which is 10% speed up at this point. Here is my repo with experiments https://github.com/vad0/lightgbm. If anyone is interested in doing some of the things I described, then I will be ready to help.
The text was updated successfully, but these errors were encountered: