You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The API LGBM_DatasetPushRows (and CSR version) are not thread safe for sparse data. Internally, there is some thread safety for the OpenMP parallelism, since each thread gets its own buffer in a SparseBin. However, if multiple threads each call LGBM_DatasetPushRows, OpenMP will return the same set of tid's (i.e. 0 to num_omp_threads-1) for each external thread. Hence, external thread A and external thread B will both get an internal tid of 0, and eventually collide when appending to the same sparse buffer.
Reproducible example
Simply call LGBM_DatasetPushRows on a Dataset with sparse bins from 2 different threads, and you will eventually get an access violation when 2 threads both try to add a record to the same element of a SparseBin push_buffers_ array.
Environment info
N/A
Additional Comments
A fix for this issue has been added to the PR #5299, since we require thread safety to stream between multiple Spark Tasks (this issue was discovered from sporadic test failures with the existing PR). The fix as first pushed only applies to the new streaming push APIs in that PR (e.g. LGBM_DatasetPushRowsWithMetadata), but it could be added to the existing PushRows APIs as well if so desired. The fix (to be approved still) involves initializing the bins with the number of external threads that will be used, thus allowing the SparseBin to add extra push buffers for thread isolation.
The text was updated successfully, but these errors were encountered:
@svotaw Thanks for working on this. Yes. I think LGBM_DatasetPushRows wasn't designed for multithreading from outside the API before. I'll review the PR this weekend.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Description
The API
LGBM_DatasetPushRows
(and CSR version) are not thread safe for sparse data. Internally, there is some thread safety for the OpenMP parallelism, since each thread gets its own buffer in aSparseBin
. However, if multiple threads each callLGBM_DatasetPushRows
, OpenMP will return the same set of tid's (i.e. 0 to num_omp_threads-1) for each external thread. Hence, external thread A and external thread B will both get an internal tid of 0, and eventually collide when appending to the same sparse buffer.Reproducible example
Simply call
LGBM_DatasetPushRows
on aDataset
with sparse bins from 2 different threads, and you will eventually get an access violation when 2 threads both try to add a record to the same element of aSparseBin
push_buffers_
array.Environment info
N/A
Additional Comments
A fix for this issue has been added to the PR #5299, since we require thread safety to stream between multiple Spark Tasks (this issue was discovered from sporadic test failures with the existing PR). The fix as first pushed only applies to the new streaming push APIs in that PR (e.g. LGBM_DatasetPushRowsWithMetadata), but it could be added to the existing PushRows APIs as well if so desired. The fix (to be approved still) involves initializing the bins with the number of external threads that will be used, thus allowing the SparseBin to add extra push buffers for thread isolation.
The text was updated successfully, but these errors were encountered: