-
Notifications
You must be signed in to change notification settings - Fork 0
Optimising CPU cache
A base benchmark value was taken running a test patch with the current engine running on 1 thread.
This technique increased performance on 1 thread by 3x. Multi threaded performance increased faster than it increased on the current engine.
With the current engine, 8 threads increased performance by 3x compared to 1 thread. The new technique alone increases performance slightly more than adding 8 threads does. If we add 8 threads on top of this technique, performance goes up over 13x compared to base.
8 threads with the current engine improves by 3.03x over 1 thread. 8 threads with this improvement improves by 4.45x over 1 thread with this improvement.
With this improvement, threading improvements also continue into the hyperthreading zone, although only slightly.
On a large patch, calling multiple modules one after the other means that instructions and data for each module gets loaded into the CPU cache and then immediately discarded.
If you have 4 different modules A,B,C & D, and need to call each of them 4 times, it should be faster to call each module 4 times in turn than to call each module one after the other and then repeat that 4 times.
Fast call order:
A-A-A-A-B-B-B-B-C-C-C-C-D-D-D-D
Slow call order:
A-B-C-D-A-B-C-D-A-B-C-D-A-B-C-D
In both instances above, each module gets called the same number of times, but one way of calling the modules should be faster than the other. This is because the first time we call A, all the instructions and data for A are loaded into the CPU's cache. When we call A again, everything is already there and we just re-use everything.
With the second ordering, when we call A we load the cache, then we load B and throw away everything that we were using for A. Then we throw away what we had for B and call C, then we call D. When we come to call A again, there might be nothing at all left in the CPU cache from the original call. Also, because A is not likely to be called on the same physical core, the caches won't apply for that reason either - if A was cached on core 1, so when we run A again on core 2, it has no idea about module A and has to get it from slower memory.
The purpose of this investigation is to determine how much performance could theoretically be gained by changing how the modules are called and whether it's worth pursuing this technique.
To test this, the vcv rack engine was modified into a test bench. A large patch was used, with the audio module removed. This meant that the engine will not block during testing. I then rigged the engine to produce 100,000 samples as fast as possible.
These samples didn't produce any audio, the UI was not updated and the engine stopped after producing them and logged how long the test took.
On the test patch with 1 thread, 100,000 samples took ~2.5 seconds to generate.
The test patch with 8 threads took 0.83 seconds to generate (my test machine has 8 physical cores).
The engine was then modified in the following ways:
- wires were given a buffer of 128 samples, initialised with 64 zero values
- each module was called 64 times in a loop, each call consumed a sample from the inbound wires and pushed any new samples onto the outbound wires
- the previous wire stepping loop was removed as it was now irrelevant
This means that each wire introduced a 64 sample delay instead of the usual 1 sample delay. The purpose of this test was not to create a new engine, but to get an idea of the kind of improvements that were available.
However, the improvement was immediate 3x over what we had before.
I integrated the changes fully into the engine and ran my patch manually with the UI. I was able to play the patch fully in 1 thread where before I needed 7 threads to play without glitches 96khz. The delays didn't cause any noticeable artefacts in the audio output. This gave me confidence that the tests were real and that I wasn't shortcutting the engine in some fundamental way.
As a second test, I wanted to know what effect buffer size had on the engine performance, so I looped from 1 sample buffer to 64 and tabulated the results. This test was run without the cpu meters turned on, so they're not directly comparable to the previous test. Maximum improvement was just under 3x.
There is an overhead to the implementation. Running the original engine without cpu meters took ~2.0 seconds. With the new implementation and a buffer size of 1, it took ~2.4 seconds (single threaded).
The tests were then repeated with a larger ring buffer and buffer sizes between 65 & 256.
The results followed an exponential decay curve, with the inflexion point at around a buffer size of 13. I'd probably recommend a buffer size of 32, after that, gains were minimal.