-
Notifications
You must be signed in to change notification settings - Fork 14
[Async] Change std::async to tbb #705
base: main
Are you sure you want to change the base?
Conversation
9e91dce
to
cb25faf
Compare
This commit resolves segfault in H20 benchmark on 1e9 data. Currently a system_error occurs during checksum calculation. According to cppref: https://en.cppreference.com/w/cpp/thread/async. System error is thrown if there are not enough resources to create a new thread: ``` std::system_error with error condition std::errc::resource_unavailable_try_again, if policy == std::launch::async and the implementation is unable to start a new thread. If policy is std::launch::async | std::launch::deferred or has additional bits set, it will fall back to deferred invocation or the implementation-defined policies in this case. ``` To avoid this situation, I rewrote this method to use tbb, which should automatically check for available threads in the thread pool. Signed-off-by: Dmitrii Makarenko <dmitrii.makarenko@intel.com>
Why is using tbb better than adjusting the worker count by some metric similar to the granularity parameter you introduced? When I did performance characterization with tbb, the cost of creating the tbb context and task group was higher than simply launching async threads. |
As far as I understand, tbb here should check thread pool and run slices of rows not less than granularity. In case of using raw thread vector and launching it there is an danger of system_error due to system resources. We can try here catch and repeat, but I decided, that tbb will be better here. |
result_set::use_parallel_algorithms(rows) ? cpu_threads() : 1; | ||
// Heuristics, should be tuned somehow | ||
size_t granularity = (last_row - first_row) / (worker_count * 3); | ||
granularity = std::max(granularity, static_cast<size_t>(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we work with TBB algorithms, we should ignore a number of actual cores. Granularity should provide us with tasks of proper size. Proper size means that the task's execution time is not too big to provide proper load balance (in this case it shouldn't exceed several ms) and is not too small to keep scheduler overhead small (execution time shouldn't be less than 5us). E.g. we can use a small batch size (like 64-256KB) and divide it by the number of bytes fetched per row or use another similar heuristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't know how to estimate data size, it can memcopy some data or can just pass thorough. I think it's better to have some granularity, that not. But I am not sure which estimation should be used here - some pessimistic, that everyone will memcpy, or some other? Also it's new heuristics, so if approach is fine, I will need some statistics of data access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to develop the perfect heuristics from the beginning. Any would be most probably better than the current code. What I suggest now is to compute a number of bytes we copy for a single row and then set granularity as (128 << 10) / bytes_per_row
. This would roughly make 128KB tasks.
This commit resolves segfault in H20 benchmark on 1e9 data. Currently a system_error occurs during checksum calculation. According to cppref: https://en.cppreference.com/w/cpp/thread/async. System error is thrown if there are not enough resources to create a new thread:
To avoid this situation, I rewrote this method to use tbb, which should automatically check for available threads in the thread pool.