-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
static_thread_pool
is not work-conserving
#1305
Comments
From my understanding, other threads will be notified if the local bwos-queue has at least one stealable batch. You are correct that each bwos-configuration has its own pathological setups. Is there a particular defect in the current implementation of the algorithm where siblings are not woken up when they should, according to the algorithm?
Here I agree with you. It was a quick solution which I wanted to address at another pass through, which never happend. My focus shifted away from the thread pool. I think in the original papers to bwos the authors used a simple pool-wide queue to take work from. |
The constructor of the pool should take bwos parameters as an optional argument. If you set the block size to 1 you should get a work conserving scheduler. |
With this:
the deadlock still repros. The other example gets resolved, but I can recreate the pathological behavior by tweaking the example a bit and scheduling the tasks on a chain: note that the sleep happens after the scheduling, so all the tasks should start executing almost instantaneously, instead they run sequentially.
As far as I understand that happens in |
But I was incorrect here
since it also happens on |
I have to take a deeper look but that sounds like a bug to me. I expect this configuration to work in your examples. inline void static_thread_pool_::thread_state::push_local(task_base* task) {
if (!local_queue_.push_back(task)) {
pending_queue_.push_back(task);
}
} Currently this "pending" is a non-stealable worker-private queue. If you block one worker thread for ever this could again lead to the problem that work is not being processed forever. So having a finite queue for stealable work and a worker-private queue seems like a bad combination. I will reconsider this idea. |
And there should always be a thread that is looking for a chance to steal a work item. IIRC, if a thread goes to sleep, it checks whether its the last thief and wakes up a sleeping one, if necessary. But again, I will take a look in, ASAP. Probably tomorrow. |
I think this is a red herring: the local queue in these cases does not get full. The problem here is that "number of tasks" is not a good heuristic for running time, if individual tasks are unbounded, so there is no guarantee on how frequently the sibling threads can be awoken. I think that there is an implicit assumption that tasks are small, so both the inability to steal part of a block and the possibly infrequent opportunities to steal are not a large problem. But if we have arbitrarily large tasks, some pathologies can emerge.
I might have missed this. So does the polling continue indefinitely, even if all the queues are empty? |
yes, IIRC. The invariant was, if possible, there is always at least one thief. See inline void static_thread_pool_::thread_state::clear_stealing() {
if (pool_->numThiefs_.fetch_sub(1, std::memory_order_relaxed) == 1) {
notify_one_sleeping();
}
} |
I understand this. But I think even if we fix the bug and your example works for block size = 1, the worker private pending queue is still a source for problems.
Yes, i think the whole bwos idea optimizes for lots of small tasks. Optimal usage requires to chose bwos parameters for its applications.
I think the problem is that when I execute the Task its already dequeued and there is place for its continuation in the unstealable Slot. I will change that order to first execute then dequeue |
I see, but this means that we're constantly spinning through the threads even when there is no work to do. I just verified that a program that creates the pool and then sleeps without enqueuing any tasks runs at 100% CPU utilization (1 core worth of CPU). Is that desirable? |
A totally empty pool should probably be put to sleep. :-) Im not so sure in the other cases but im happy to revisit the spin loops in any case. They should at least emit pause instructions and yield. All suggestions, or even PRs, are highly appreciated. |
I wish I had suggestions 😄 I noticed this because I'm looking at a few work-stealing implementations to see how they deal with sibling wake-ups, and all I've seen so far is different variations of polling, possibly throttled to trade off some latency for overhead. |
The current implementation of
static_thread_pool
might use a subset of the available threads even when there are tasks scheduled. The work-stealing mechanism only runs on threads that are running and would otherwise transition to idle; if a thread is already idle, nothing will wake it up even if other threads have non-empty queues (EDIT: sibling threads can also be notified in-between tasks, but that depends on waiting for the current task to complete).The result is that the scheduler is not work-conserving; this can cause both correctness issues and pathological performance.
For an example of incorrectness, the following code should terminate but it instead deadlocks:
This happens because the child task is scheduled from a scheduler thread so it goes directly into its local queue without waking up any of the other threads. Note that the same code does not deadlock with libunifex, which uses a more primitive scheduler (single locked queue) which is however work-conserving.
For an example of pathological performance, the following code spawns N long-running tasks on a pool of N threads, so they should all run in parallel
Instead, they run sequentially, so the program takes 8 seconds instead of the expected 1 second (libunifex correctly takes 1 second).
Generally, work-stealing schedulers have some mechanism to wake up siblings when a thread queue is non-empty. However, it is tricky to do this efficiently, because this requires frequent cross-thread synchronization; if done naively, the local thread and a sibling will compete for each newly scheduled task. So it may not be easy to fix without affecting performance.
Full repro: https://gist.github.com/ot/64712433d33dab75559731f84dba5304
On an unrelated note, the remote queue management seems inefficient: if I am understanding correctly, each remote thread will create a queue for each thread in the scheduler. So if we have two schedulers with N threads scheduling tasks into each other, each will have N^2 remote queues, which seems expensive to poll when collecting remote tasks.
The text was updated successfully, but these errors were encountered: