Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement CSemaphore using WaitOnAddress() for better performance under contention #13

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

evgenykotkov
Copy link
Contributor

Some time ago I found out that ESE's write throughput is severely capped
by the number of concurrent writers.

The attached jet_concurrent_write_stall.zip example illustrates this.
Detailed benchmarks are below, but in short, all cases with more than 5
concurrent writers work even slower than a single writer.

Apparently, the throughput is limited by the current implementation of CSemaphore.
The existing implementation uses a spinlock that falls back to a — slow! —
kernel semaphore. When the contention is high and cannot be alleviated by
the spinlock, this approach fails to deliver an adequate throughput.
CSemaphore is a building block for other primitives such as CCriticalSection,
which are in turn used in multiple performance-critical parts such as
in the log buffer writer.

This PR reimplements CSemaphore using WaitOnAddress() / WakeByAddress()
as the underlying primitive.

The new implementation delivers an up to 20x synthetic improvement.
Since the CSemaphore is used as a building block for other synchronization
primitives, such as CCriticalSection, this translates into the up to ~5.4x throughput
improvement in a benchmark with multiple concurrent db writers.

The benchmarks below were conducted in an environment with an i9-9900K
CPU and an NVMe SSD.

Synthetic benchmark based on the adjusted semaphoreperf.cpp test:

1 thread:    93850 KOps/Thread  →  93950 KOps/Thread
2 threads:   20445 KOps/Thread  →  30165 KOps/Thread  (+48 %)
3 threads:    6083 KOps/Thread  →  10897 KOps/Thread  (+79 %)
8 threads:     102 KOps/Thread  →   2058 KOps/Thread  (+1917 %)
16 threads:     45 KOps/Thread  →    719 KOps/Thread  (+1498 %)

Real benchmark with multiple concurrent db writers:

(The numbers represent the amount of complete "write sequences"
and are meaningful only as a relative measurement of throughput)

1 writer:     8443 Sequences  →   8562 Sequences
2 writers:   14934 Sequences  →  15331 Sequences
3 writers:   17434 Sequences  →  17386 Sequences
4 writers:   16583 Sequences  →  17997 Sequences
5 writers:   15924 Sequences  →  17431 Sequences
6 writers:    8437 Sequences  →  17223 Sequences  (+104 %)
7 writers:    5020 Sequences  →  18145 Sequences  (+261 %)
8 writers:    4131 Sequences  →  17670 Sequences  (+339 %)
9 writers:    4168 Sequences  →  17438 Sequences  (+318 %)
10 writers:   3670 Sequences  →  19709 Sequences  (+437 %)

The patch passes all relevant tests in my environment.
CSemaphore is thoroughly tested in semaphore.cxx and indirectly by various other tests.

The existing implementation uses a spinlock that falls back to a kernel
semaphore.  When the contention is high and cannot be alleviated by
the spinlock, this approach fails to deliver an adequate throughput.
The probable cause for that is the general slowness of the semaphore
object created with `CreateSemaphore()`.

The new implementation is based on the `WaitOnAddress()` primitive and
delivers an up to 20x synthetic improvement.  Since the CSemaphore is
used as a building block for other synchronization primitives, such
as CCriticalSection, this translates into the up to ~5.4x throughput
improvement in a benchmark with multiple concurrent db writers.

The benchmarks below were conducted in an environment with an i9-9900K
CPU and an NVMe SSD.

Synthetic benchmark based on the adjusted semaphoreperf.cpp test:

    1 thread:    93850 KOps/Thread  →  93950 KOps/Thread
    2 threads:   20445 KOps/Thread  →  30165 KOps/Thread  (+48 %)
    3 threads:    6083 KOps/Thread  →  10897 KOps/Thread  (+79 %)
    8 threads:     102 KOps/Thread  →   2058 KOps/Thread  (+1917 %)
   16 threads:      45 KOps/Thread  →    719 KOps/Thread  (+1498 %)

Real benchmark with multiple concurrent db writers:

  (The numbers represent the amount of complete "write sequences"
   and are meaningful only as a relative measurement of throughput)

    1 writer:     8443 Sequences  →   8562 Sequences
    2 writers:   14934 Sequences  →  15331 Sequences
    3 writers:   17434 Sequences  →  17386 Sequences
    4 writers:   16583 Sequences  →  17997 Sequences
    5 writers:   15924 Sequences  →  17431 Sequences
    6 writers:    8437 Sequences  →  17223 Sequences  (+104 %)
    7 writers:    5020 Sequences  →  18145 Sequences  (+261 %)
    8 writers:    4131 Sequences  →  17670 Sequences  (+339 %)
    9 writers:    4168 Sequences  →  17438 Sequences  (+318 %)
    10 writers:   3670 Sequences  →  19709 Sequences  (+437 %)

On Win64, the size of CSemaphore remains unchanged (8 bytes).

On Win32, the size of CSemaphore changes from 4 to 8 bytes, so the
patch updates the size assertions for structures like PIB.
@ghost
Copy link

ghost commented Mar 18, 2021

CLA assistant check
All CLA requirements met.

@evgenykotkov evgenykotkov changed the title Reimplement CSemaphore using WaitOnAddress() for better contended performance Reimplement CSemaphore using WaitOnAddress() for better performance under contention Mar 18, 2021
@2BitSalute 2BitSalute self-requested a review March 19, 2021 18:10
Allowing to steal the semaphore from the threads that are woken up
introduces a fairness issue and can lead to starvations.

The underlying `WakeOnAddress()` primitive by itself allows for a fair
implementation, because it guarantees that the threads are woken up in
FIFO order [1].  However, if we don't block the other threads from
stealing the semaphore from the waiters right after waking them up,
those waiters may keep getting starved out.

This behavior is shown by the new regression test, CSemaphoreFairnessTest.

[1] https://docs.microsoft.com/windows/win32/api/synchapi/nf-synchapi-wakebyaddresssingle
Simplify the code by switching to a single loop, instead of using a
nested loop for the OS-level wait.  Ensure that when returning from
the OS-level wait, we try to acquire the semaphore as soon as possible,
without any intermediate actions such as adjusting the remaining time —
as this is better from the standpoint of working under contention.
Within the CSemaphore::_Release() method, use `const INT` for an
immutable local variable instead of a `LONG`, to match the return
type of the CSemaphoreState::CWait().
The new implementation inadvertently removed the check that allowed
skipping the full acquire-release cycle when the semaphore is not
contended.  So let's bring it back.
Within that method, check the condition against a local copy of the
state, as otherwise the state might change between the two parts of
the condition.
The new implementation of the CSemaphore is OS dependent, so it is
located in the bottom half of the file.  Let's keep the CSemaphore's
constructor and destructor together with that implementation in the
file.
Use a slightly more descriptive name `cSpinRemaining` instead of
`cSpinCount`.  That also makes the new name more distinctive from
the method's parameter, `cSpin`.
The original code used the latter name, so let's not unnecessarily
change that.
The original code had such explicit casts, and they are also used
everywhere within sync.cxx, so let's not unnecessarily change that.
This changeset reworks the existing comments and adds the missing bits
to match the style and the documentation level that is used in sync.hxx
and sync.cxx.  No functional changes intended.
Properly handle a case where this method is called in a state
where no one is waiting on the semaphore.  This should be a
no-op operation.  Add a regression test for this behavior,
SyncSemaphoreNoopReleaseAllWaiters().
volatile LONG m_cAvail; // 0 <= m_cAvail <= ( 1 << 31 ) - 1

// Mode 1: waiters
volatile QWORD m_qwState;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to keep the state as a LONG in order to continue supporting 32-bit, e.g. ARM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on the "continue supporting 32-bit" part? Does it refer to the lack of support for certain 64-bit atomics on some platforms? Or is the incompatibility caused by changing the size of the state from 4 to 8 bytes under 32 bits?

The former doesn't look like a problem, at least based on what I could grasp from the code. I see that various parts of ESE rely on the presence of such atomics — for example in the implementation of the CInvasiveConcurrentModSet<> or in LOG_WRITE_BUFFER::ErrLGScheduleWrite().

If it's the latter, then the "continue supporting" part is still somewhat unclear to me. I think the CSemaphore does not participate in the on-disk format and is not a part of the ABI. So I was hoping that this change doesn't do anything beyond increasing the in-memory size of certain structures under 32 bits.

In general, I don't yet see an acceptable approach where we get to keep the state in 32 bits. With WaitOnAddress() we require a value for the semaphore's count to wait on, but we also have to keep a separate record of the number of waiting threads. So unless there is a trick that I haven't thought of, I see two options:

  1. Make those two values 32-bit, forming a 64-bit state
  2. Make those two values 16-bit, forming a 32-bit state

Option 1. is what the patch does.

Option 2. doesn't seem to be practically feasible, because the CSemaphore currently accepts an INT as its semaphore count. Switching that to 16 bits would narrow the actual range of the accepted values, and I think that would constitute an unsafe change that requires validating and updating every usage of the semaphore.

@2BitSalute 2BitSalute self-assigned this Jul 6, 2021
@2BitSalute
Copy link
Collaborator

I thought it would be easier to apply changes internally after the comments are added, but I was mistaken :D


if ( State().FChange(stateCur, stateNew ) )
{
volatile void *pv = State().PAvail();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to expose this - maybe WakeByAddressAll could be made in a member of CSemaphoreState? Same as WaitOnAddress. It seems that CSemaphore::_FOSWait has to improperly know a bit about the underlying type of CAvail in addition to having to call PAvail. On the other hand, it seems improper for CSemaphoreState to be waiting and waking...

Copy link
Contributor Author

@evgenykotkov evgenykotkov Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it seems improper for CSemaphoreState to be waiting and waking...

I agree with that: it's the semaphore that should be waiting for its state to change, not the state itself.

For the problem itself, I committed a reworked version of that code in 161b0bcb. The new version (hopefully) breaks the dependency between PAvail() and CAvail() by using a typed pointer for the address that we wait on. We also make the CompareAddress point to the value of the same type and use a static_assert<> for the sizes of two values.

How does that sound?

Adjust the CSemaphoreState so that it would return the typed pointer for
the semaphore count value.  In _FOSWait(), accept the value of the same
type.  Assert same sizes of the values in `Address` and `CompareAddress`
when performing a WaitOnAddress().
Cast to `void *` instead of `PVOID`, to be consistent with other
similar casts around.
As the semaphore count returned by CSemaphoreState::CAvail() and
m_cAvail have originally been signed, let's not unnecessarily switch
to an unsigned type.  Doing so also allows us to remove some type casts.
@evgenykotkov
Copy link
Contributor Author

@2BitSalute @MsftBrettShirley

Just wanted to check up on the status of this pull request. Is there something I can do to improve it?

I think that the problem is still present in the latest versions of ESE. And since such write-heavy usage patterns involving ESE could be quite common, it would be great if there was a way to move forward with fixing it.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants