Skip to content
Jim T edited this page Jan 1, 2019 · 4 revisions

Multi-threading the Rack engine

Let's break down the current implementation of multi-threading and see what it offers and why it is how it is.

Use the little rocket ship to choose how many threads you want.

There's no other way to put this - this is terrible. A user should not have to do this. Limiting the number of threads might be a configuration setting but it should not be a requirement. It's a very technical implementation detail to expose to a user.

What are the alternatives and why were they not chosen?

Alternative 1: Start a thread per core and sleep until needed.

Sleeps kill the rack engine.

This is the general form when having a processor intensive workload: A queue of work is provided and multiple threads take jobs off the queue, if the queue is empty they sleep until a job is available. The problem is that the job queue is only 1 sample deep, that means that all modules must have their step method called only once. This means the job queue empties every sample and every thread goes to sleep. Experimentally, the time it takes to sleep a thread and wake it again does not work at audio rates of over 44k times a second.

A sample rate of 44.1khz gives us a maximum budget of 22µs per sample (less if higher sample rates are used). A wake may take from 0.1-30µs depending on cache hits, core migration, etc. Early experiments went with this route and provided worse performance than a single threaded engine.

Alternative 2: Add threads based on how long it takes to process samples.

This solution suffers from a number of small issues and one large one.

First, what threshold do we use for starting a new thread? If a patch uses 80% capacity, we can start a new thread, we're now using 40% capacity. If the patch is finished and doesn't grow, we're wasting 60% of our capacity.

Second, how do we measure processor use? Modules do not take the same amount of time to process samples for every call. Many modules batch their work, even in small batches, this means most calls are very fast but some take significant time. Other modules take different amounts of processing depending on the signals going into them or otherwise vary depending on their internal state.

This behaviour means that the engine can't look at the time it takes to produce one sample and make a decision to start or stop a thread, even if we could, starting & waking threads takes significant time. If we're at the start of processing a sample and decide we need more threads, we won't have them available to us by the end of the sample time.

We need to average processor use over time. This introduces the big problem:

The audio module expands to fill all the processor time. This means that as far as the engine is concerned, it takes on average at least 22µs to produce a sample at 44.1khz regardless of how many modules are in the patch. The engine doesn't know what the audio module is. We can't completely exclude the audio module from our results because: a) there is no identifier to signal which module is the audio module and b) the audio module needs significant processing time of its own.

Alternative 3: Add a thread when stuttering occurs

This is bad for a number of reasons: the main one being that the problem has already occurred before we correct it. We also don't know the reason for the stuttering, it might be other applications on our system are starving us of cpu, in which case adding a thread will make matters worse. If we've got a single badly behaving module, no amount of extra threads will help and again, will only make matters worse.

This also doesn't let us remove threads that aren't needed anymore. We could measure usage and drop threads below a certain threshold, but then we've got all the usage measure issues noted before - if we can do that we don't need to try this approach. We could periodically drop old threads until stuttering happens again, but now we're just sabotaging what was a working system.

Decision

Since I couldn't find any other solutions to this problem, I went with asking the user how many threads to use. They can make informed decisions based on their environment and the behaviour of rack. But this is far from ideal and it's worth working for a more sophisticated solution that doesn't require this.

A last decision is where to store this configuration? Should it be stored at all or reset to 1 on every rack load? Reset to 1 on every patch load? Stored as a system setting? Stored as a patch setting? In the end, I chose to store it as a system setting to avoid polluting patches in the wild with random information that isn't used by the mainstream version of rack.

Distributing the workload

Now that we know how many threads we're working with, we need to distribute the workload.

The engine cycle works in 2 steps:

  1. Have each module produce its next output values.
  2. Shuffle the output values to the corresponding input values using wires.

We cannot perform step 2 until step 1 is completely finished, this gives us a processing barrier between 1&2. Step 2 is a very lightweight, predictable process that can be done in a single thread, at least for now. But, technically there is also a processing barrier between steps 2&1 as well. Each operation must be completed in its entirety, guaranteed, before proceeding to the next.

Note this is only for that particular engine implementation. The barriers can be defined more precisely: A module step cannot be performed until all inbound and outbound wires have been stepped (inputs have been set and outputs taken). A wire step cannot be performed until the modules at both ends of the wire have been stepped (the output is ready and the previous input has been consumed). To get things going, wires have a 0 initial value and all module initial outputs are considered to have been taken. A module with no wires should have no effect on the patch, however it should be stepped in sync with the other modules, although the exact nature of the sync is ambiguous.

This gives us the following task: distribute the calls to each module->step() across all the available threads. Once all the step methods have returned, call all the wire->step() methods. Once they're all done, repeat.

I chose the following implementation for the worker threads:

State:

  • running - flag whether a thread should be running at all, set to false to kill the thread safely.
  • stepping - flag whether the thread is ok to run module steps. If false, we're stepping wires and should wait.
  • moduleIndex - shared atomic, the index of the last module that was processed.
  • completedProcessors - shared atomic, the number of processors that have completed their work.
while (running) {
  block_waiting_for_stepping_to_be_true(); // or running to be false, edge case.
  increment_module_index_to_get_next_module_to_process();
  while (our_next_module_to_process_exists()) {
    step_next_module():
    increment_module_index_to_get_next_module_to_process();
  }
  set_stepping_to_false();
  signal_thread_complete();
}

https://github.com/Rcomian/Rack/blob/a0df3f3e12c2770b7659abd403bd67e08c4190fb/src/engine.cpp#L92

Note: we can't use moduleIndex alone to determine if the work is complete. This is because the module index is incremented to its final value before setting the stepping value to false. This sets up a race condition with the main thread, which may set the stepping value to true just before we set it to false. If this happens, the next sample is processed without this thread, as we overwrite that true value to false, then spin waiting for it to be true again. We could add an extra increment after we set stepping to false, but then we just obscure what we're trying to say with no real benefits.

This implementation distributes the work using a single shared atomic value. This works because when multiple threads increment an atomic value, each thread receives a unique value & all values are given to the calling threads with no gaps. Only atomics offer this feature without requiring locks.

The work is synchronised using the completedProcessors atomic. Once that value hits the number of processing threads we have, we know all work is complete and we're safe to step the wires.

These threads need some orchestration, which is done by the main engine thread. This thread:

  1. Sets the initial states so that the shared module index points before the first module and the number of completed processors is 0.
  2. Sets each processing thread to "stepping".
  3. Spin wait for the completedProcessors atomic to match the number of audio processors we have.

https://github.com/Rcomian/Rack/blob/a0df3f3e12c2770b7659abd403bd67e08c4190fb/src/engine.cpp#L205

Since the main thread has nothing better to do than wait for the other processors to complete, we have a spare thread, so we might as well put it to use processing modules too. We don't need to start spinning looking for completedProcessors to complete until we've run the moduleIndex off the end of the array, so there's very little wasted work here.

This implementation is actually relatively efficient, despite the spin waits. This is because the main thread only waits as long as it takes the last module to process on each worker thread. Unless a module on a worker takes an extraordinary amount of time to process, the main thread should not spend very long waiting for other threads at all. (narrators voice, there is one module that takes an extraordinary amount of time to process).

The wasted busy spins are all on the worker threads in a well defined location, not the main engine thread.

There is an advantage to this implementation: When there are no additional worker threads the performance and behaviour of the single threaded engine is unchanged from stock (save for some minimal bookkeeping). No spins are performed at all since all the busy loops terminate immediately.

Weaknesses with this distribution method

Busy waits

The main weakness here is that the additional threads are busy waiting. This isn't a huge problem, as they should only be busy waiting while waiting for the other threads to finish their final modules and for the wires to step. But since there's always one module that takes an extraordinary amount of time to process, we end up busy waiting during the engine time sync.

Atomic contention

There is also some contention over the atomic moduleIndex. Whilst it's lock free, an atomic isn't actually free. The whole CPU is actually locked during each atomic operation (I may be out of date on this). Whilst faster than a thread lock, this still has performance implications as all cores need to be synchronised. If we've got multiple CPUs, we potentially have motherboard communications involved too. When only light modules are used, a lot of the processing time goes into contention in the atomic rather than processing the light modules. However, we are still talking very high processing rates, so it isn't much of an issue.

An alternative would be to have each thread know which modules it should process ahead of time by assigning modules to threads. This way, there are no contention issues during the actual module processing and we have only the one call per thread to the completedThreads atomic.

For this to be efficient, each thread needs its module set to be perfectly balanced with the others to avoid them spinning uselessly whilst there's work that could be getting done.

This becomes an impossible problem to solve however, since we can't tell the relative weight of the modules we hold to ahead of time to see if they're balanced - any module at any time could decide to take a long time to process a sample, leaving the following modules in the thread unprocessed when the other threads are spinning uselessly. And there's always one module that takes a long time to process a sample.

In practice, the atomics contention probably isn't an issue. I have data to show it exists, but only with light modules and processing rates far above what is reasonable for a patch. But it's worth noting that with with thousands of light (NOOP) modules, we do get diminishing returns with extra threads due to contention on the atomics.

Cache busting

One less thought about problem is the CPU cache. The most efficient way to process a set of N samples is to call the same instance of a module N times in a tight loop, then call the next module N times and so on. This keeps both the the instruction & data cache for the module hot.

The next most efficient method is to call all modules of the same type together in the same thread. If there are many modules of the same type it is still necessary to distribute the workload across threads, but any one thread should group all the modules of the same type together as it iterates through its workload. This keeps the instruction cache hot.

The least efficient method possible from a caching perspective is to call modules randomly on a thread exactly once and then go and do something else. This is exactly what the current algorithm does, evicting the CPU caches immediately after a module is processed.

I do not have data on the impact of this, but it's a next level optimisation that is available to the rack engine, at least in principle. It's worth considering the impacts of cache busting when considering new processing topologies, together with metrics to determine the size of the problem. Caches will always tend to increase in size, so it's a problem that will tend to get less over time, however, cached vs uncached performance differences are often immense, and when we're squeezing every ounce out of every thread, as we are in rack, it should probably be a consideration. I think it's unlikely that a large patch which requires multiple threads to run will fit the module instruction and data requirements into the L1 caches.

It's true that a large module may not fit into the L1 caches anyway. Such a module will be slower to run than otherwise expected, but it's impact on the engine goes beyond its own performance. Such a module will evict all the contents of the L1 cache every time. If sandwiched between 2 modules of the same type, the other modules instructions will generate a cache miss even when a smaller module placed between them could have fit both module codes into the L1 cache.

https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips

https://blogs.msdn.microsoft.com/oldnewthing/20181205-00/?p=100405

Sleeping the engine

The main engine thread needs to sync with wall time. We need to produce enough samples to keep the audio (or other) buffers full and no more. The audio hardware needs to consume the samples at a predictable rate (and generally will).

This means that our engine is going to spend some time doing nothing. With a single thread of execution, we can just block. This kind of blocking is fine because it only happens when there's nothing to do by definition. There is a wake-up cost but it's amortised across all the samples that are processed until the next blocking operation.

With multiple threads we could also block too. The difficulty is that we need to communicate this blocking operation to these threads.

Blocking in vcvrack is not done by the engine, but by the audio module. This makes it impossible for the engine to reason about its workload or plan resource allocation. The main engine may block whilst holding locks in the rest of the system and it doesn't get the opportunity to drop them.

Single engine mutex

In this implementation, each thread tries to lock a shared single mutex during their busy loop. Most of the time they will be successful and so have to unlock it again. The audio engine can then lock the mutex to cause all the other threads to sleep. This might be reasonably efficient with a shared lock implementation: https://en.cppreference.com/w/cpp/thread/shared_mutex

However this mutex is only available in c++17 which as I was using this, was not immediately available. (how was this not a feature in c++11?).

I didn't want to bring in boost or some random implementation.

This would work with a simple exclusive lock, but would be very inefficient as each worker thread would be randomly blocking each other for no reason, causing the wake-up costs to delay their ability to process the next sample. You also get issues with lock-convoys, starvation and priority inversion and are reliant on the internal implementation of the lock's fairness to ensure things work at all.

Send a message to sleep all the threads

The audio threads could spin waiting for a message to go to sleep. The problem is that the audio module as implemented doesn't actually know if it needs to sleep or not until it does. So the message needs to be sent every time the audio module might sleep. This puts all the spinning threads to sleep immediately, causing them to pay the wake up cost every sample - we're back to our lock per sample situation which is worse than single threaded.

Individual thread locks

The solution currently used is that each thread has its own mutex which it locks and unlocks repeatedly when spinning. Since no other threads are normally locking this mutex, the operation is relatively cheap and doesn't cause the thread to be rescheduled. It could be made marginally cheaper by only locking every N iterations around the loop, but I couldn't decide on a sensible value for N.

When the audio module hits an area where it might sleep, it first calls engineSleep(). This code locks all the audio thread mutexes in the engine. It then calls engineWake() once the lock has passed. This still has the potential to artificially block any spinning processing threads but it only happens once (or twice) per sample so is less likely than having all the threads lock the same mutex during their spin. There are other mitigations we can make if it becomes an issue.

With the audio processing threads sleeping when the audio module does, we reduce the wasted overhead of the spinlocks. The overhead is now just the time between the first thread finishing its workload and the last thread finishing its workload. And the difference between these times is at most one module (and not the audio module) - if there was another module to process, we'd go and process it rather than spin.

Final optimisations

There are 2 final optimisations added to the engine, and to be honest they're probably not worth it, given the extra complexity they add. They should be removed, but they're demonstrating interesting issues.

Reversed processing order

The first is a change to the module processing order. The worst case scenario is to have a heavy module processed last. If we've got 2 threads processing 50 modules, one of which is heavy, we can run through 49 modules very quickly, then 1 thread spins uselessly while the other churns through the heavy module. The best case would be to process the heavy module first, that way, the other thread can churn through all the light modules whilst the heavy module is being worked on. This minimises the wasted spin time. If we can find a way to process heavy modules first, we win.

The engine as it currently is cannot allocate a reasonable weight to the modules to process them in any order, certainly not on a sample/sample basis. We might be able to use the CPU meter features to do this, but currently this isn't done. So the engine can't tell what a heavy module is, but a user might. However, a user can't put a module on the start of the processing list, which is what they need to do. But they can put it on the end of the list - by duplicating and rewiring the module, then deleting the original. So if we process the list in reverse order, we give the user the option to dig themselves out of a hole by duplicating and rewiring a heavy module. They don't need to know why this hack works, it just becomes an extra option.

It's not a great feature, I'll admit, but the underlying problem of module processing order is real.

Grouping modules

To reduce contention on the atomic, I've tried grouping the modules so that a thread has to increment the central counter a few times less on each run. This was more a proof of concept and I have no numbers to back up this optimisation in the real world. I made a marginal difference in artificial benchmarks and was probably still not worth it given the complexity it adds to the engine. It has a significant downside in that by grouping heavy modules together in a group tied to a thread, we're unable to use any threads that come free to help out in processing them. This may lead to additional useless spinning and pressures the engine to use smaller groups to minimise this, which reduces the benefit got from incrementing the atomic less often.

Burn this one with fire.

Future tech

IO Buffers

One thing that keeps coming up is that the engine is completely blindsided when it comes to keeping timing sync. It might be possible to mitigate this by giving the engine a concept of IO buffers. An audio module would then register an io buffer with the engine and ask it to keep it full or consumed (depending on direction). When the engine wakes, it can then look at all its io buffer states and determine how many samples is needs to generate. It can then generate the samples as fast as possible with no blocking, then determine if it needs to sleep. Sleep in this case can be orderly - first dropping global locks and sleeping the audio threads, then sleeping for exactly the right amount of time according to the IO buffer states.

This is speculation at this point, I've no implementation to back this up. It just feels that this change would make it easier to reason about the engine's behaviour.

Dependency walking

This would be a major change to the engine algorithm. We can determine at any point in time which modules are available to be processed. A module with no wires going into it can be called in a tight loop, optimising cache usage. If a downstream module then has N samples queued for it, we can also call it in a tight loop N times, and so on. The problem with this algorithm is that the available work changes as we proceed through the network. At one time, there might only be one module up for processing, and thus only opportunity for one thread to do any useful work, the others spinning uselessly. At other times, there may be a glut of work encouraging us to spawn more threads. Balancing the thread load here might be possible do to automatically, especially if the engine is in control of the IO buffers and so can tell the state of the system.

The major issue with this algorithm is that work distribution is much more complex than with the simple algorithm we currently have. Thus the overheads would be higher meaning we really need to experiment if it would produce a net positive overall. This kind of algorithm would be a way we can take advantage of CPU caching.

Blocking threads

Spinning threads is an ugly and wasteful solution to the synchronisation problem. If we can find a lock that is light weight enough to run at audio rates while still providing most or all of the additional processing time that we get with an extra thread, it would definitely be worth putting this in place.

Given that the multithreaded engine is fairly stable now, it might be worth trying a new solution with standard mutexes to see what happens.