Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13

evgenykotkov · 2021-03-18T22:42:19Z

Some time ago I found out that ESE's write throughput is severely capped
by the number of concurrent writers.

The attached jet_concurrent_write_stall.zip example illustrates this.
Detailed benchmarks are below, but in short, all cases with more than 5
concurrent writers work even slower than a single writer.

Apparently, the throughput is limited by the current implementation of CSemaphore.
The existing implementation uses a spinlock that falls back to a — slow! —
kernel semaphore. When the contention is high and cannot be alleviated by
the spinlock, this approach fails to deliver an adequate throughput.
CSemaphore is a building block for other primitives such as CCriticalSection,
which are in turn used in multiple performance-critical parts such as
in the log buffer writer.

This PR reimplements CSemaphore using WaitOnAddress() / WakeByAddress()
as the underlying primitive.

The new implementation delivers an up to 20x synthetic improvement.
Since the CSemaphore is used as a building block for other synchronization
primitives, such as CCriticalSection, this translates into the up to ~5.4x throughput
improvement in a benchmark with multiple concurrent db writers.

The benchmarks below were conducted in an environment with an i9-9900K
CPU and an NVMe SSD.

Synthetic benchmark based on the adjusted semaphoreperf.cpp test:

1 thread:    93850 KOps/Thread  →  93950 KOps/Thread
2 threads:   20445 KOps/Thread  →  30165 KOps/Thread  (+48 %)
3 threads:    6083 KOps/Thread  →  10897 KOps/Thread  (+79 %)
8 threads:     102 KOps/Thread  →   2058 KOps/Thread  (+1917 %)
16 threads:     45 KOps/Thread  →    719 KOps/Thread  (+1498 %)

Real benchmark with multiple concurrent db writers:

(The numbers represent the amount of complete "write sequences"
and are meaningful only as a relative measurement of throughput)

1 writer:     8443 Sequences  →   8562 Sequences
2 writers:   14934 Sequences  →  15331 Sequences
3 writers:   17434 Sequences  →  17386 Sequences
4 writers:   16583 Sequences  →  17997 Sequences
5 writers:   15924 Sequences  →  17431 Sequences
6 writers:    8437 Sequences  →  17223 Sequences  (+104 %)
7 writers:    5020 Sequences  →  18145 Sequences  (+261 %)
8 writers:    4131 Sequences  →  17670 Sequences  (+339 %)
9 writers:    4168 Sequences  →  17438 Sequences  (+318 %)
10 writers:   3670 Sequences  →  19709 Sequences  (+437 %)

The patch passes all relevant tests in my environment.
CSemaphore is thoroughly tested in semaphore.cxx and indirectly by various other tests.

The existing implementation uses a spinlock that falls back to a kernel semaphore. When the contention is high and cannot be alleviated by the spinlock, this approach fails to deliver an adequate throughput. The probable cause for that is the general slowness of the semaphore object created with `CreateSemaphore()`. The new implementation is based on the `WaitOnAddress()` primitive and delivers an up to 20x synthetic improvement. Since the CSemaphore is used as a building block for other synchronization primitives, such as CCriticalSection, this translates into the up to ~5.4x throughput improvement in a benchmark with multiple concurrent db writers. The benchmarks below were conducted in an environment with an i9-9900K CPU and an NVMe SSD. Synthetic benchmark based on the adjusted semaphoreperf.cpp test: 1 thread: 93850 KOps/Thread → 93950 KOps/Thread 2 threads: 20445 KOps/Thread → 30165 KOps/Thread (+48 %) 3 threads: 6083 KOps/Thread → 10897 KOps/Thread (+79 %) 8 threads: 102 KOps/Thread → 2058 KOps/Thread (+1917 %) 16 threads: 45 KOps/Thread → 719 KOps/Thread (+1498 %) Real benchmark with multiple concurrent db writers: (The numbers represent the amount of complete "write sequences" and are meaningful only as a relative measurement of throughput) 1 writer: 8443 Sequences → 8562 Sequences 2 writers: 14934 Sequences → 15331 Sequences 3 writers: 17434 Sequences → 17386 Sequences 4 writers: 16583 Sequences → 17997 Sequences 5 writers: 15924 Sequences → 17431 Sequences 6 writers: 8437 Sequences → 17223 Sequences (+104 %) 7 writers: 5020 Sequences → 18145 Sequences (+261 %) 8 writers: 4131 Sequences → 17670 Sequences (+339 %) 9 writers: 4168 Sequences → 17438 Sequences (+318 %) 10 writers: 3670 Sequences → 19709 Sequences (+437 %) On Win64, the size of CSemaphore remains unchanged (8 bytes). On Win32, the size of CSemaphore changes from 4 to 8 bytes, so the patch updates the size assertions for structures like PIB.

ghost · 2021-03-18T22:42:33Z

All CLA requirements met.

Allowing to steal the semaphore from the threads that are woken up introduces a fairness issue and can lead to starvations. The underlying `WakeOnAddress()` primitive by itself allows for a fair implementation, because it guarantees that the threads are woken up in FIFO order [1]. However, if we don't block the other threads from stealing the semaphore from the waiters right after waking them up, those waiters may keep getting starved out. This behavior is shown by the new regression test, CSemaphoreFairnessTest. [1] https://docs.microsoft.com/windows/win32/api/synchapi/nf-synchapi-wakebyaddresssingle

Simplify the code by switching to a single loop, instead of using a nested loop for the OS-level wait. Ensure that when returning from the OS-level wait, we try to acquire the semaphore as soon as possible, without any intermediate actions such as adjusting the remaining time — as this is better from the standpoint of working under contention.

Within the CSemaphore::_Release() method, use `const INT` for an immutable local variable instead of a `LONG`, to match the return type of the CSemaphoreState::CWait().

The new implementation inadvertently removed the check that allowed skipping the full acquire-release cycle when the semaphore is not contended. So let's bring it back.

Within that method, check the condition against a local copy of the state, as otherwise the state might change between the two parts of the condition.

The new implementation of the CSemaphore is OS dependent, so it is located in the bottom half of the file. Let's keep the CSemaphore's constructor and destructor together with that implementation in the file.

Use a slightly more descriptive name `cSpinRemaining` instead of `cSpinCount`. That also makes the new name more distinctive from the method's parameter, `cSpin`.

The original code used the latter name, so let's not unnecessarily change that.

The original code had such explicit casts, and they are also used everywhere within sync.cxx, so let's not unnecessarily change that.

This changeset reworks the existing comments and adds the missing bits to match the style and the documentation level that is used in sync.hxx and sync.cxx. No functional changes intended.

Properly handle a case where this method is called in a state where no one is waiting on the semaphore. This should be a no-op operation. Add a regression test for this behavior, SyncSemaphoreNoopReleaseAllWaiters().

dev/ese/src/sync/sync.cxx

2BitSalute · 2021-07-06T21:54:39Z

dev/ese/published/inc/sync.hxx

-            volatile LONG       m_cAvail;       //  0 <= m_cAvail <= ( 1 << 31 ) - 1
-
-            //  Mode 1:  waiters
+            volatile QWORD      m_qwState;


It should be possible to keep the state as a LONG in order to continue supporting 32-bit, e.g. ARM.

Could you please elaborate on the "continue supporting 32-bit" part? Does it refer to the lack of support for certain 64-bit atomics on some platforms? Or is the incompatibility caused by changing the size of the state from 4 to 8 bytes under 32 bits?

The former doesn't look like a problem, at least based on what I could grasp from the code. I see that various parts of ESE rely on the presence of such atomics — for example in the implementation of the CInvasiveConcurrentModSet<> or in LOG_WRITE_BUFFER::ErrLGScheduleWrite().

If it's the latter, then the "continue supporting" part is still somewhat unclear to me. I think the CSemaphore does not participate in the on-disk format and is not a part of the ABI. So I was hoping that this change doesn't do anything beyond increasing the in-memory size of certain structures under 32 bits.

In general, I don't yet see an acceptable approach where we get to keep the state in 32 bits. With WaitOnAddress() we require a value for the semaphore's count to wait on, but we also have to keep a separate record of the number of waiting threads. So unless there is a trick that I haven't thought of, I see two options:

Make those two values 32-bit, forming a 64-bit state

Make those two values 16-bit, forming a 32-bit state

Option 1. is what the patch does.

Option 2. doesn't seem to be practically feasible, because the CSemaphore currently accepts an INT as its semaphore count. Switching that to 16 bits would narrow the actual range of the accepted values, and I think that would constitute an unsafe change that requires validating and updating every usage of the semaphore.

dev/ese/published/inc/sync.hxx

2BitSalute · 2021-07-06T22:03:58Z

I thought it would be easier to apply changes internally after the comments are added, but I was mistaken :D

2BitSalute · 2021-07-06T23:14:19Z

dev/ese/src/sync/sync.cxx

+
+            if ( State().FChange(stateCur, stateNew ) )
+            {
+                volatile void *pv = State().PAvail();


It's a bit strange to expose this - maybe WakeByAddressAll could be made in a member of CSemaphoreState? Same as WaitOnAddress. It seems that CSemaphore::_FOSWait has to improperly know a bit about the underlying type of CAvail in addition to having to call PAvail. On the other hand, it seems improper for CSemaphoreState to be waiting and waking...

On the other hand, it seems improper for CSemaphoreState to be waiting and waking...

I agree with that: it's the semaphore that should be waiting for its state to change, not the state itself.

For the problem itself, I committed a reworked version of that code in 161b0bcb. The new version (hopefully) breaks the dependency between PAvail() and CAvail() by using a typed pointer for the address that we wait on. We also make the CompareAddress point to the value of the same type and use a static_assert<> for the sizes of two values.

How does that sound?

Adjust the CSemaphoreState so that it would return the typed pointer for the semaphore count value. In _FOSWait(), accept the value of the same type. Assert same sizes of the values in `Address` and `CompareAddress` when performing a WaitOnAddress().

Cast to `void *` instead of `PVOID`, to be consistent with other similar casts around.

As the semaphore count returned by CSemaphoreState::CAvail() and m_cAvail have originally been signed, let's not unnecessarily switch to an unsigned type. Doing so also allows us to remove some type casts.

evgenykotkov · 2023-04-11T16:38:13Z

@2BitSalute @MsftBrettShirley

Just wanted to check up on the status of this pull request. Is there something I can do to improve it?

I think that the problem is still present in the latest versions of ESE. And since such write-heavy usage patterns involving ESE could be quite common, it would be great if there was a way to move forward with fixing it.

Thanks!

evgenykotkov changed the title ~~Reimplement CSemaphore using WaitOnAddress() for better contended performance~~ Reimplement CSemaphore using WaitOnAddress() for better performance under contention Mar 18, 2021

2BitSalute self-requested a review March 19, 2021 18:10

evgenykotkov added 16 commits March 23, 2021 16:33

Fix formatting in sync.hxx

fb08168

Use an appropriate type for a local variable

cbb7c51

Within the CSemaphore::_Release() method, use `const INT` for an immutable local variable instead of a `LONG`, to match the return type of the CSemaphoreState::CWait().

Merge branch 'main' into sync-csemaphore-waitonaddress

a382c20

Restore the easy-out check in CSemaphore::FWait()

73c4e62

The new implementation inadvertently removed the check that allowed skipping the full acquire-release cycle when the semaphore is not contended. So let's bring it back.

Fix a potential race condition in CSemaphore::_FTryAcquire()

c559a0e

Within that method, check the condition against a local copy of the state, as otherwise the state might change between the two parts of the condition.

Move the CSemaphore's ctor and dtor below in sync.cxx

fc221eb

The new implementation of the CSemaphore is OS dependent, so it is located in the bottom half of the file. Let's keep the CSemaphore's constructor and destructor together with that implementation in the file.

Rename the local variable in CSemaphore::_FTryAcquire()

2aed461

Use a slightly more descriptive name `cSpinRemaining` instead of `cSpinCount`. That also makes the new name more distinctive from the method's parameter, `cSpin`.

Add a missing const in CSemaphore::_FAcquire()

aafdf96

Rename the local variables state to stateCur

880dab2

The original code used the latter name, so let's not unnecessarily change that.

Restore the removed (CSemaphoreState&) casts

cf8ab1e

The original code had such explicit casts, and they are also used everywhere within sync.cxx, so let's not unnecessarily change that.

Drop the empty #else-#endif block in tdb.hxx

eac8c87

Update the comments in sync.hxx/.cxx

54d0fad

This changeset reworks the existing comments and adds the missing bits to match the style and the documentation level that is used in sync.hxx and sync.cxx. No functional changes intended.

Add a comment to CSemaphoreFairnessTest's declaration

041a981

Fix an issue in CSemaphore::ReleaseAllWaiters()

0b75568

Properly handle a case where this method is called in a state where no one is waiting on the semaphore. This should be a no-op operation. Add a regression test for this behavior, SyncSemaphoreNoopReleaseAllWaiters().

2BitSalute reviewed Jul 6, 2021

View reviewed changes

dev/ese/src/sync/sync.cxx Show resolved Hide resolved

2BitSalute reviewed Jul 6, 2021

View reviewed changes

dev/ese/published/inc/sync.hxx Show resolved Hide resolved

2BitSalute self-assigned this Jul 6, 2021

2BitSalute requested changes Jul 7, 2021

View reviewed changes

evgenykotkov added 4 commits July 7, 2021 22:57

Fix formatting in CSemaphore::ReleaseAllWaiters()

4e4ff2b

Minor code cleanup in CSemaphore::_FOSWait()

7979bcc

Cast to `void *` instead of `PVOID`, to be consistent with other similar casts around.

Use LONG instead of DWORD for semaphore and wait counts

c95ce47

As the semaphore count returned by CSemaphoreState::CAvail() and m_cAvail have originally been signed, let's not unnecessarily switch to an unsigned type. Doing so also allows us to remove some type casts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13

Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13

evgenykotkov commented Mar 18, 2021

ghost commented Mar 18, 2021 •

edited by ghost

Loading

2BitSalute Jul 6, 2021

evgenykotkov Jul 7, 2021

2BitSalute commented Jul 6, 2021

2BitSalute Jul 6, 2021

evgenykotkov Jul 7, 2021 •

edited

Loading

evgenykotkov commented Apr 11, 2023

Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13

Are you sure you want to change the base?

Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13

Conversation

evgenykotkov commented Mar 18, 2021

ghost commented Mar 18, 2021 • edited by ghost Loading

2BitSalute Jul 6, 2021

Choose a reason for hiding this comment

evgenykotkov Jul 7, 2021

Choose a reason for hiding this comment

2BitSalute commented Jul 6, 2021

2BitSalute Jul 6, 2021

Choose a reason for hiding this comment

evgenykotkov Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

evgenykotkov commented Apr 11, 2023

ghost commented Mar 18, 2021 •

edited by ghost

Loading

evgenykotkov Jul 7, 2021 •

edited

Loading