-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13
base: main
Are you sure you want to change the base?
Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13
Conversation
The existing implementation uses a spinlock that falls back to a kernel semaphore. When the contention is high and cannot be alleviated by the spinlock, this approach fails to deliver an adequate throughput. The probable cause for that is the general slowness of the semaphore object created with `CreateSemaphore()`. The new implementation is based on the `WaitOnAddress()` primitive and delivers an up to 20x synthetic improvement. Since the CSemaphore is used as a building block for other synchronization primitives, such as CCriticalSection, this translates into the up to ~5.4x throughput improvement in a benchmark with multiple concurrent db writers. The benchmarks below were conducted in an environment with an i9-9900K CPU and an NVMe SSD. Synthetic benchmark based on the adjusted semaphoreperf.cpp test: 1 thread: 93850 KOps/Thread → 93950 KOps/Thread 2 threads: 20445 KOps/Thread → 30165 KOps/Thread (+48 %) 3 threads: 6083 KOps/Thread → 10897 KOps/Thread (+79 %) 8 threads: 102 KOps/Thread → 2058 KOps/Thread (+1917 %) 16 threads: 45 KOps/Thread → 719 KOps/Thread (+1498 %) Real benchmark with multiple concurrent db writers: (The numbers represent the amount of complete "write sequences" and are meaningful only as a relative measurement of throughput) 1 writer: 8443 Sequences → 8562 Sequences 2 writers: 14934 Sequences → 15331 Sequences 3 writers: 17434 Sequences → 17386 Sequences 4 writers: 16583 Sequences → 17997 Sequences 5 writers: 15924 Sequences → 17431 Sequences 6 writers: 8437 Sequences → 17223 Sequences (+104 %) 7 writers: 5020 Sequences → 18145 Sequences (+261 %) 8 writers: 4131 Sequences → 17670 Sequences (+339 %) 9 writers: 4168 Sequences → 17438 Sequences (+318 %) 10 writers: 3670 Sequences → 19709 Sequences (+437 %) On Win64, the size of CSemaphore remains unchanged (8 bytes). On Win32, the size of CSemaphore changes from 4 to 8 bytes, so the patch updates the size assertions for structures like PIB.
Allowing to steal the semaphore from the threads that are woken up introduces a fairness issue and can lead to starvations. The underlying `WakeOnAddress()` primitive by itself allows for a fair implementation, because it guarantees that the threads are woken up in FIFO order [1]. However, if we don't block the other threads from stealing the semaphore from the waiters right after waking them up, those waiters may keep getting starved out. This behavior is shown by the new regression test, CSemaphoreFairnessTest. [1] https://docs.microsoft.com/windows/win32/api/synchapi/nf-synchapi-wakebyaddresssingle
Simplify the code by switching to a single loop, instead of using a nested loop for the OS-level wait. Ensure that when returning from the OS-level wait, we try to acquire the semaphore as soon as possible, without any intermediate actions such as adjusting the remaining time — as this is better from the standpoint of working under contention.
Within the CSemaphore::_Release() method, use `const INT` for an immutable local variable instead of a `LONG`, to match the return type of the CSemaphoreState::CWait().
The new implementation inadvertently removed the check that allowed skipping the full acquire-release cycle when the semaphore is not contended. So let's bring it back.
Within that method, check the condition against a local copy of the state, as otherwise the state might change between the two parts of the condition.
The new implementation of the CSemaphore is OS dependent, so it is located in the bottom half of the file. Let's keep the CSemaphore's constructor and destructor together with that implementation in the file.
Use a slightly more descriptive name `cSpinRemaining` instead of `cSpinCount`. That also makes the new name more distinctive from the method's parameter, `cSpin`.
The original code used the latter name, so let's not unnecessarily change that.
The original code had such explicit casts, and they are also used everywhere within sync.cxx, so let's not unnecessarily change that.
This changeset reworks the existing comments and adds the missing bits to match the style and the documentation level that is used in sync.hxx and sync.cxx. No functional changes intended.
Properly handle a case where this method is called in a state where no one is waiting on the semaphore. This should be a no-op operation. Add a regression test for this behavior, SyncSemaphoreNoopReleaseAllWaiters().
volatile LONG m_cAvail; // 0 <= m_cAvail <= ( 1 << 31 ) - 1 | ||
|
||
// Mode 1: waiters | ||
volatile QWORD m_qwState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to keep the state as a LONG in order to continue supporting 32-bit, e.g. ARM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on the "continue supporting 32-bit" part? Does it refer to the lack of support for certain 64-bit atomics on some platforms? Or is the incompatibility caused by changing the size of the state from 4 to 8 bytes under 32 bits?
The former doesn't look like a problem, at least based on what I could grasp from the code. I see that various parts of ESE rely on the presence of such atomics — for example in the implementation of the CInvasiveConcurrentModSet<>
or in LOG_WRITE_BUFFER::ErrLGScheduleWrite()
.
If it's the latter, then the "continue supporting" part is still somewhat unclear to me. I think the CSemaphore
does not participate in the on-disk format and is not a part of the ABI. So I was hoping that this change doesn't do anything beyond increasing the in-memory size of certain structures under 32 bits.
In general, I don't yet see an acceptable approach where we get to keep the state in 32 bits. With WaitOnAddress()
we require a value for the semaphore's count to wait on, but we also have to keep a separate record of the number of waiting threads. So unless there is a trick that I haven't thought of, I see two options:
- Make those two values 32-bit, forming a 64-bit state
- Make those two values 16-bit, forming a 32-bit state
Option 1. is what the patch does.
Option 2. doesn't seem to be practically feasible, because the CSemaphore
currently accepts an INT
as its semaphore count. Switching that to 16 bits would narrow the actual range of the accepted values, and I think that would constitute an unsafe change that requires validating and updating every usage of the semaphore.
I thought it would be easier to apply changes internally after the comments are added, but I was mistaken :D |
dev/ese/src/sync/sync.cxx
Outdated
|
||
if ( State().FChange(stateCur, stateNew ) ) | ||
{ | ||
volatile void *pv = State().PAvail(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit strange to expose this - maybe WakeByAddressAll could be made in a member of CSemaphoreState? Same as WaitOnAddress. It seems that CSemaphore::_FOSWait has to improperly know a bit about the underlying type of CAvail in addition to having to call PAvail. On the other hand, it seems improper for CSemaphoreState to be waiting and waking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, it seems improper for CSemaphoreState to be waiting and waking...
I agree with that: it's the semaphore that should be waiting for its state to change, not the state itself.
For the problem itself, I committed a reworked version of that code in 161b0bcb. The new version (hopefully) breaks the dependency between PAvail()
and CAvail()
by using a typed pointer for the address that we wait on. We also make the CompareAddress
point to the value of the same type and use a static_assert<>
for the sizes of two values.
How does that sound?
Adjust the CSemaphoreState so that it would return the typed pointer for the semaphore count value. In _FOSWait(), accept the value of the same type. Assert same sizes of the values in `Address` and `CompareAddress` when performing a WaitOnAddress().
Cast to `void *` instead of `PVOID`, to be consistent with other similar casts around.
As the semaphore count returned by CSemaphoreState::CAvail() and m_cAvail have originally been signed, let's not unnecessarily switch to an unsigned type. Doing so also allows us to remove some type casts.
Just wanted to check up on the status of this pull request. Is there something I can do to improve it? I think that the problem is still present in the latest versions of ESE. And since such write-heavy usage patterns involving ESE could be quite common, it would be great if there was a way to move forward with fixing it. Thanks! |
Some time ago I found out that ESE's write throughput is severely capped
by the number of concurrent writers.
The attached
jet_concurrent_write_stall.zip
example illustrates this.Detailed benchmarks are below, but in short, all cases with more than 5
concurrent writers work even slower than a single writer.
Apparently, the throughput is limited by the current implementation of
CSemaphore
.The existing implementation uses a spinlock that falls back to a — slow! —
kernel semaphore. When the contention is high and cannot be alleviated by
the spinlock, this approach fails to deliver an adequate throughput.
CSemaphore
is a building block for other primitives such asCCriticalSection
,which are in turn used in multiple performance-critical parts such as
in the log buffer writer.
This PR reimplements
CSemaphore
usingWaitOnAddress()
/WakeByAddress()
as the underlying primitive.
The new implementation delivers an up to 20x synthetic improvement.
Since the
CSemaphore
is used as a building block for other synchronizationprimitives, such as
CCriticalSection
, this translates into the up to ~5.4x throughputimprovement in a benchmark with multiple concurrent db writers.
The benchmarks below were conducted in an environment with an i9-9900K
CPU and an NVMe SSD.
Synthetic benchmark based on the adjusted semaphoreperf.cpp test:
Real benchmark with multiple concurrent db writers:
(The numbers represent the amount of complete "write sequences"
and are meaningful only as a relative measurement of throughput)
The patch passes all relevant tests in my environment.
CSemaphore
is thoroughly tested in semaphore.cxx and indirectly by various other tests.