-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with call to c_api Predict functions in multi threaded way from golang #3751
Comments
I will try to take a look if I take some time but not sure if I will be able soon |
From the description it seems to be the thread_id the source of the problem then? So can it be that this is more of a preexisting issue that #2760 just exposed? |
Could this be related to https://forum.openmp.org//viewtopic.php?f=3&t=653&start=0 ? |
@JoanFM Let me check with a sample c++ code if my issue is also the same as given in the link. I will get back to you on this soon. |
Hi @AlbertoEAF, Can you check this thread. That one was an older thread which I had posted before I did any observations |
Did u manage to reproduce the issue as in the thread @tarunreddy1018 ? |
Hi @JoanFM , I have not yet tried reproducing the issue, I will first try to check running few sample c++ code as mentioned in the link. This will help in giving a clear picture. If the thread_id (tid) still comes out to be 0, then the issue was something related to our openmp library linking and not really with c api. But will check that and get back to you on the same thread. Mean while can you let me know, starting from v3.0.0 there were changes in locking strategy right. Having "shared_locks" and "unique_locks". So, seeing "shared_locks" in the predict function call made me believe that we can call predict functions in a parallel way. This should give performance improvement. So, were the predict functions written with an intent to make them call in a parallel way? I mean was there any test done, which calls predict function in multi threaded way and that ensures thread safety. If any one else also faced this issue or its just me who is facing this issue. There are 2 things that could be the potential issues here
|
Hey @tarunreddy1018 , I did add the shared and unique locks under the assumption that const functions have no risk to change the object while keeping unique locks for non-const functions. So from this point of view I do not see why it would be unsafe to run it in a multithreaded way, however I am not so sure about the design and the robustness that was put from the beginning into it. When I opened the PR I had the intention to use that in a project but I did not get to the point of using it in a multithreaded way so did not fully test the behavior. |
Thanks for the very detailed report @tarunreddy1018 ;)
Great catch @StrikerRUS! I was looking for that
I think @JoanFM might be dead right, as in my experience lightgbmlib 2.3.150 (which corresponds to some commit between v2.3.1 and v2.3.2) this random scores already existed when using Java. Such is the case that we had to use synchronization on our Java provider code. Here is the proof, we had to wrap our provider's Predict calls (line 147) with a single-access region (no parallelism): and we used at the time lightgbmlib 2.3.150 (https://github.com/feedzai/feedzai-openml-java/blob/24a62dca18ddd5cf538256f1acc621cb3b092220/openml-lightgbm/pom.xml#L26) If we didn't do the synchronization (no parallelism) we would have random scores. I think @tarunreddy1018 's observations are dead right too regarding those thread id's and could be a good source for investigation. I never looked into LightGBM's multi-threading implementation, so here are my 2 cents - take it with a grain of salt: In the end it means that we should probably go back to unique locks in the meantime until the issue is fixed as a stop-gap if that fixes it, and use shared locks only on regions that truly do read-only calls. It might be that this stop-gap is useless if there are previous issues like 2.3.150 seemed to show though. This thread gives more info that further proves the issue with predictions and multi-threading: #3675 (comment). |
A method being const in C++ can still manipulate a lot of data. In LightGBM on top of that we do a lot of pointer manipulation so all guarantees of constness go out the door. |
@tarunreddy1018 can you just reformat your original question to have the code block encased by a line starting with 3 backquotes and cpp (i.e. "```cpp"), followed by the code block and then terminated with a line with 3 backquotes: "```"? Thanks :) |
I remember doing looking at the code and seemed safe but sure I may have missed something. |
@AlbertoEAF reformatted |
Yeah and from my point of view, the problems start before that even. I don't have time to look into this but at least @tarunreddy1018 's thread id observations might be a hint for someone to explore. In the meanwhile I'd suggest @tarunreddy1018 to put a lock in your code at least for now if you have some urgency as I don't think this can easily be fixed soon. It's still faster to use v3.1 than v2.3.1 due to other improvements behind the scenes and the new APIs such as the *Fast() prediction methods. |
@AlbertoEAF So, you would suggest to make a change in c api myself to have a lock in predict call. And build the library from that. And use this library right? This should give some performance improvements compared to 2.3.1 |
@AlbertoEAF Sure, I will check with thread id thing as soon as I get some time. |
@tarunreddy1018 I was suggesting doing the lock on the golang side around predicts, much easier :) Of course you can always modify the code and re-compile lightgbm but I don't think that will get you more performance than the solution above and will let you later try out different versions with and without your lock, meaning that when a new version comes out fixed you only need to remove your lock in golang ;) |
Yes Thanks, Will keep things posted as soon as I find any observations. That should help @AlbertoEAF :) |
I had a little closer look at the code, and it seems that this My question is how it was this multithreading using OMP was being used if there was a lock guard protecting the c_api. I understand u are calling at this function, am I right? void Predict(int start_iteration, int num_iteration, int predict_type, int nrow, int ncol,
std::function<std::vector<std::pair<int, double>>(int row_idx)> get_row_fun,
const Config& config,
double* out_result, int64_t* out_len) const {
SHARED_LOCK(mutex_);
auto predictor = CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);
bool is_predict_leaf = false;
bool predict_contrib = false;
if (predict_type == C_API_PREDICT_LEAF_INDEX) {
is_predict_leaf = true;
} else if (predict_type == C_API_PREDICT_CONTRIB) {
predict_contrib = true;
}
int64_t num_pred_in_one_row = boosting_->NumPredictOneRow(start_iteration, num_iteration, is_predict_leaf, predict_contrib);
auto pred_fun = predictor.GetPredictFunction();
OMP_INIT_EX();
#pragma omp parallel for schedule(static)
for (int i = 0; i < nrow; ++i) {
OMP_LOOP_EX_BEGIN();
auto one_row = get_row_fun(i);
auto pred_wrt_ptr = out_result + static_cast<size_t>(num_pred_in_one_row) * i;
pred_fun(one_row, pred_wrt_ptr);
OMP_LOOP_EX_END();
}
OMP_THROW_EX();
*out_len = num_pred_in_one_row * nrow;
} And the code snippet you shared is coming from the predictor. Since this predictor seems to live in the stack of every thread, I do not see why the I am sure I must be missing something ... Could it be that somehow 'regular' multithreading vs 'OpenMP' multithreading do not play well along together? I am sorry @tarunreddy1018, but I finally did not implement and apply this feature in productio after all. |
Another observation I found is, It feels that what OpenMP is trying to do and what the OpenMP is trying to parallelize one single prediction as much as possible around multiple trees, thus I guess optimizing for the speed of a single Could it be a hint of the possible problem? |
Hi, @JoanFM , The methods that we have been trying were Based on your observation it seems correct that each thread creates a new Predictor object as seen from this code This is the function call stack how the Predictor object is being created by the above two mentioned methods
This clearly shows that the above creation of new SingleRowPredictor is happening only once, Because in the code it looks like it is only creating if it is not created before. This makes all threads across different prediction calls use same Predictor object. Which in turn makes |
So maybe there is a problem that some of the Predictions can be thread safe and other can't. I do think that the OMP num_threads are supposed to work for a single prediction, but I have not much expertise on that |
I was thinking can we have the same logic as we had in other predict functions which are thread-safe. I mean to say that can we also have the other 2 predict functions also create a new Predictor object for each call as they do in other predict function. I believe this makes the other two thread safe as well. Right now the state of those 2 predict functions are such that we can only call one at a time. I am not completely sure if this is something related to bug or that was the intent from the beginning. Because I believe if other predict functions create a new Predictor object for each call why not these 2 predict do the same? |
I would agree, but I am not surr if allocating the buffer at each call may degrade the performance of single threaded too much? but anyhow it would benefit from more consistency. But as you I am not sure if it had some specific intentions |
Which ones create a Predictor all the time? The versions that are not "SingleRow" ? I don't mean it can't be tried, but it would have to be benchmarked, and the single-thread scenario too. Maybe it's really the case that the SingleRow versions are not meant for thread-safety, as it would require too many memory allocations / unit of work. @guolinke can we have your thoughts on the matter or of someone who has dealt with this in the past? |
@AlbertoEAF yes for example
This makes me suspect that all these predict functions create a new Predict object per thread call. And this Predictor Object has What I believe is, these methods are meant to be used for larger datasets when there are many rows of input. And lets says the number of threads is 8 that we specify and it creates 8 omp threads and all these rows will be divided across these 8 omp threads to speed up the computation. So, the amount of performance gain that we get by dividing these large set of rows across limited number of threads is more as compared to calling the Also I think if we want to have |
Hi, I have just verified and the predictions were consistent and as expected in multi threaded way with predict functions other than these two |
Yes exactly @tarunreddy1018 :)
True you can always use the "batch" prediction methods with a single instance if you want to do parallel calls ;). And yes, the single instance versions are optimized variants that minimize allocation overheads. I think it would be nice to have confirmation from @guolinke, @StrikerRUS, or another of the main devs that this is not a bug. If it's really the case that they were never meant to be thread-safe, we should add that to the C API documentation as you're not the first person having this issue ;) |
Hi, @AlbertoEAF , are all your benchmarks based on putting a unique_lock in the predict function? I had 2 suggestions that we can try:
I believe 2nd approach is good, because as per the c api code the These are just my thoughts. Because the performance boost that we get with having a shared lock and making it thread safe looks more promising.
|
By using a unique lock instead of the shared lock the timings are very similar, but predictions are correct. Even so, by designing a small C++ benchmark with a very simple LGBM model,more threads on a simple model are slower than the single-thread case. This is probably due to very small work units, the lock contention overhead increases. We should in the future benchmark with more complex models to see if supporting threading on these calls is worth it in performance gains. If not, then we could choose to not to provide thread-safety and remove the locks altogether for maximal throughput. See microsoft#3751 for timings. See gist for benchmark code: https://gist.github.com/AlbertoEAF/5972db15a27c294bab65b97e1bc4c315
maybe the best way is to add unique lock for now as quick patch and then maybe we can add a set of functions for multithreaded usage? or that would be too much mantainance overhead? |
@JoanFM Yes, agree. We can make it shared_lock or may be as you said have a seperate function for multi threaded usage. If doing so gives good performance sometime later. For now may be I will try out with shared lock approach and try having a new vector for each call and test it out for my use case. If it looks good then may be I will report later. So, that a thought can be given on that direction. |
Hi @tarunreddy1018 @JoanFM :) As you can see from the PR #3771 description and the benchmark code the change to unique lock at the Notice that neither The lock is only at the So a shared_lock wouldn't help us here, except if we go down one extra level of lock granularity, but this is already pretty tricky as we have virtual functions etc and in terms of maintenance would be pretty hard to maintain. Does it make sense? Or did I miss something in your points? Regarding @tarunreddy1018 's suggestion of combining |
@AlbertoEAF In the above statement, you meant to say that the performance gains that you got with shared lock and unique lock are almost similar right. So, shared lock might not help here right? |
I think that the measure that needs to be counted is the amount of predictions we can do in a unit of time and not the speed of the prediction right? |
I meant doing locking only inside the predictor exactly where needed in the inner calls of SingleRowPredict, so every statement that is parallelizable could be parallelized.
Exactly, as I do 3M predictions, if it had a higher throughput the program would finish sooner, thus measuring total wall time is what we want here. |
Ok I see @AlbertoEAF thanks! |
Thanks @AlbertoEAF @JoanFM |
Hi again @tarunreddy1018 :) I tried your idea of merging those two functions. Approach 1 was the laziest, turning the ..._helper() to a function that returns a function (basically computes the offset vs the data pointer), this allows us to have a single representation (pairs (int, value)): 706f2af...feedzai:single-row-predict-fix-lock-faster-RowPairFunctionFromDenseMatric2 You can check the code above. I got exactly the same timings as having the two intermediate vectors and going from the 1st vector to the pairs vector. I don't think there's much chance of improvement here anymore, the bottleneck is somewhere else. In fact when I was designing the *Fast() API, initially you could only use a fixed *data pointer for prediction that you set at the ...FastInit() call, and you replaced the *data values inplace. Even then, that alternative which didn't allocate memory per call ran with very little performance differences to the current version which allows using different data adresses per prediction (< 5% execution time savings). True @tarunreddy1018 , we can still try the other way around, getting rid of the pairs by changing code inside the Predictor but I'm afraid of that because that changes a lot of code, and I don't want to go into all that hassle :) But if you're up for it I can help you get the benchmark going and you can try changing it yourself ;) |
Ok I just profiled the benchmark, and it's normal that those changes have little/no effect. All the time is spent in internal calls already. Consider as 100% execution time the part of the code with the C API predict call I used valgrind, so I am not accounting for the mutex contention time that we've seen is not negligible as we increased number of threads. |
Ok, after reading the code over and over I can say 2 things:
|
@AlbertoEAF Right, I agree with that. I will try out as you suggested. Thanks for the information. I guess this is fine for now. |
@AlbertoEAF Hi, I have tried with the And also by doing so, the performance is really great for us. And also creating one extra vector for each call did not had much memory overhead as well. This was the only change that I had to do, rest everything the This was the code which I have tweaked in Before Change:
After Change:
Probably we can try this approach, This does not alter any other parts of the code also with minimal change and good performance with not so much memory overhead. Each of these parallel calls uses |
Hello @tarunreddy1018, First of all, interesting that you only need a new But did you try using the regular implementation with the lock and without setting I ask this because we might have different "optimal" implementations & use cases. For instance, if the model is not trivially simple, unless I'm mistaken a prediction is already parallelized internally, and hence doing several predictions in parallel with 1 thread vs 1 prediction at a time but with multiple threads each. I'm not sure this is better, and it might vary according to # trees, average tree depth and # features. In this implemenation you're removing that opportunity of parallelism inside the predict call right? Also take into account LightGBM is actually optimized for prediction in batch, which also relies on this code, thus this might degrade its performance and probably is not acceptable by the core devs. By the way, due to C++'s RAII, predict_buf_ will be deleted automatically after the closing brace, so no need to call clear & shrink_to_fit. ;) |
I agree with @AlbertoEAF I think these new predict functions are suited to have multiple predictions with different threads, while OpenMP tries to have one prediction with multiple threads. Maybe these 2 distinctions can be made when choosing the predict function to use, but the mantainance overhead may be a little too much |
@AlbertoEAF yes, agree with that. Probably In my use case having |
By using a unique lock instead of the shared lock the timings are very similar, but predictions are correct. Even so, by designing a small C++ benchmark with a very simple LGBM model,more threads on a simple model are slower than the single-thread case. This is probably due to very small work units, the lock contention overhead increases. We should in the future benchmark with more complex models to see if supporting threading on these calls is worth it in performance gains. If not, then we could choose to not to provide thread-safety and remove the locks altogether for maximal throughput. See #3751 for timings. See gist for benchmark code: https://gist.github.com/AlbertoEAF/5972db15a27c294bab65b97e1bc4c315
@tarunreddy1018 can we close this issue? If you want to see improvements regarding threading maybe open a new issue like a feature request regarding threading support with no locking and mention this issue and #3675 so we know all the details we discussed. |
@AlbertoEAF Sure, you can close this issue for now. Thanks |
@tarunreddy1018 I don't have permissions, must be you closing it :) |
Fixed via #3771. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
How we are using LightGBM:
We are using the LightGBM c_api in our service. But this c_api is being called from golang. Basically we have a golang wrapper around the c_api. And we call the c_api functions from golang using cgo.
We get the “lib_lightgbm.so” library file from the Github release section and use them.
Version of LightGBM being used:
3.1.1 (Observed with all versions >= 3.0.0)
LightGBM component:
C++ Library
Environment info:
Operating System: Observed on both Linux and MacOS
Architecture: x86_64
CPU model: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
C++ compiler version: gcc version 7.2.0 (Debian 7.2.0-1)
CMake version: 3.9.0
GLIBC version: ldd (Debian GLIBC 2.28-10) 2.28
Context:
Starting with LightGbm 3.0.0 version, there is an update in Predict functions where only the necessary parts of the predict function have been given with "unique_lock", and rest of the parts where they just do computation were given with "shared_lock".
Unlike the previous version like 2.3.1, where a "lock_guard" was used in the beginning of the predict function itself. This led to the situation where only one thread can execute this predict function at a time. This could not scale better in production environments where we do many predictions in a parallel way.
So, we decided to upgrade to a newer version of LightGbm 3.1.1. Which really improved the performance. Since now multiple threads can execute the predict function in a parallel way because of the "shared_lock".
Issue:
But, Now we saw that the predictions were not consistent. That is if there are many threads which are invoking this predict function in a parallel way on a large scale, And if we give the same input multiple times to the service, it turns out that we are getting different predictions each time we make a call.
This was not the issue with earlier LightGBM version that we used 2.3.1. All the predictions were consistent with that version. It could probably be because of the lock that it had in the predict function because of which we did not notice this issue.
To test this out quickly, we had to put a lock before we made a call to predict function. And now all the predictions were as expected and consistent.
So, we thought that there could be some race condition in the code or LightGbm was not meant to do predictions in a parallel way. To investigate it further we started looking into the c_api of predict function. And found out that it has a vector of vectors from which each thread gets its vector using thread id (tid) and copies the input data into that vector and passes it to predict function. Once the predict call is completed it clears the vector.
Definition of Vector of Vectors:
std::vector<std::vector<double, Common::AlignmentAllocator<double, kAlignedSize>>> predict_buf_
Code where Vector of Vectors is used:
So, we suspected that the problem could be arising from this vector of vectors and tried logging the tid into a file and surprisingly we saw that all of the calls were logging "tid" as "0".
This means that all the threads were using the same vector with "tid" as "0". Which led to a race condition.
Could you please let us know if the latest update was not meant to do predictions in a parallel way or Could it be some issue with our golang code calling c_api (Probably with openMP)?
The text was updated successfully, but these errors were encountered: