[Async] Change std::async to tbb #705

Devjiu · 2023-10-16T15:50:20Z

This commit resolves segfault in H20 benchmark on 1e9 data. Currently a system_error occurs during checksum calculation. According to cppref: https://en.cppreference.com/w/cpp/thread/async. System error is thrown if there are not enough resources to create a new thread:

std::system_error with error condition std::errc::resource_unavailable_try_again, if
policy == std::launch::async and the implementation is unable to start a new thread.

If policy is std::launch::async | std::launch::deferred or has additional bits set,
it will fall back to deferred invocation or the implementation-defined policies in this case.

To avoid this situation, I rewrote this method to use tbb, which should automatically check for available threads in the thread pool.

This commit resolves segfault in H20 benchmark on 1e9 data. Currently a system_error occurs during checksum calculation. According to cppref: https://en.cppreference.com/w/cpp/thread/async. System error is thrown if there are not enough resources to create a new thread: ``` std::system_error with error condition std::errc::resource_unavailable_try_again, if policy == std::launch::async and the implementation is unable to start a new thread. If policy is std::launch::async | std::launch::deferred or has additional bits set, it will fall back to deferred invocation or the implementation-defined policies in this case. ``` To avoid this situation, I rewrote this method to use tbb, which should automatically check for available threads in the thread pool. Signed-off-by: Dmitrii Makarenko <dmitrii.makarenko@intel.com>

alexbaden · 2023-10-17T15:55:30Z

Why is using tbb better than adjusting the worker count by some metric similar to the granularity parameter you introduced? When I did performance characterization with tbb, the cost of creating the tbb context and task group was higher than simply launching async threads.

Devjiu · 2023-10-17T17:55:03Z

Why is using tbb better than adjusting the worker count by some metric similar to the granularity parameter you introduced? When I did performance characterization with tbb, the cost of creating the tbb context and task group was higher than simply launching async threads.

As far as I understand, tbb here should check thread pool and run slices of rows not less than granularity. In case of using raw thread vector and launching it there is an danger of system_error due to system resources. We can try here catch and repeat, but I decided, that tbb will be better here.

ienkovich · 2023-10-18T23:09:43Z

omniscidb/ResultSetRegistry/ColumnarResults.cpp

+      result_set::use_parallel_algorithms(rows) ? cpu_threads() : 1;
+  // Heuristics, should be tuned somehow
+  size_t granularity = (last_row - first_row) / (worker_count * 3);
+  granularity = std::max(granularity, static_cast<size_t>(10));


When we work with TBB algorithms, we should ignore a number of actual cores. Granularity should provide us with tasks of proper size. Proper size means that the task's execution time is not too big to provide proper load balance (in this case it shouldn't exceed several ms) and is not too small to keep scheduler overhead small (execution time shouldn't be less than 5us). E.g. we can use a small batch size (like 64-256KB) and divide it by the number of bytes fetched per row or use another similar heuristic.

I just don't know how to estimate data size, it can memcopy some data or can just pass thorough. I think it's better to have some granularity, that not. But I am not sure which estimation should be used here - some pessimistic, that everyone will memcpy, or some other? Also it's new heuristics, so if approach is fine, I will need some statistics of data access.

You don't have to develop the perfect heuristics from the beginning. Any would be most probably better than the current code. What I suggest now is to compute a number of bytes we copy for a single row and then set granularity as (128 << 10) / bytes_per_row. This would roughly make 128KB tasks.

Devjiu force-pushed the dmitriim/async_system_error branch from 9e91dce to cb25faf Compare October 16, 2023 15:52

Devjiu requested a review from ienkovich October 16, 2023 15:52

Devjiu marked this pull request as ready for review October 16, 2023 19:20

Devjiu mentioned this pull request Oct 17, 2023

[Perf][Bench] Join is slow on big tables. #574

Open

ienkovich reviewed Oct 18, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Async] Change std::async to tbb #705

[Async] Change std::async to tbb #705

Devjiu commented Oct 16, 2023

alexbaden commented Oct 17, 2023

Devjiu commented Oct 17, 2023

ienkovich Oct 18, 2023

Devjiu Oct 19, 2023

ienkovich Oct 19, 2023

[Async] Change std::async to tbb #705

Are you sure you want to change the base?

[Async] Change std::async to tbb #705

Conversation

Devjiu commented Oct 16, 2023

alexbaden commented Oct 17, 2023

Devjiu commented Oct 17, 2023

ienkovich Oct 18, 2023

Choose a reason for hiding this comment

Devjiu Oct 19, 2023

Choose a reason for hiding this comment

ienkovich Oct 19, 2023

Choose a reason for hiding this comment