CP-52709: use timeslices shorter than 50ms #6177

edwintorok · 2024-12-12T22:40:17Z

Changing the default OCaml thread switch timeslice from 50ms

The default OCaml 4.x timeslice for switching between threads is 50ms: if there is more than 1 active OCaml threads each one is let to run up to 50ms, and then (at various safepoints) it can switch to another running thread.
When the runtime lock is released (and C code or syscalls run) then another OCaml thread is immediately let to run if any.

However 50ms is too long, and it inserts large latencies into the handling of API calls.

OTOH if a timeslice is too short then we waste CPU time:

overhead of Thread.yield system call, and the cost of switching threads at the OS level
potentially higher L1/L2 cache misses if we switch on the same CPU between multiple OCaml threads
potentially losing branch predictor history
potentially higher L3 cache misses (but on a hypervisor with VMs running L3 will be mostly taken up by VMs anyway, we can only rely on L1/L2 staying with us)

A microbenchmark has shown that timeslices as small as 0.5ms might strike an optimal balance between latency and overhead: values lower than that lose performance due to increased overhead, and values higher than that lose performance due to increased latency:

(the microbenchmark measures the number of CPU cycles spent simulating an API call with various working set sizes and timeslice settings)

This is all hardware dependent though, and a future PR will introduce an autotune service that measures the yield overhead and L1/L2 cache refill overhead and calculates an optimal timeslice for that particular hardware/Xen/kernel combination.
(and while we're at it, we can also tweak the minor heap size to match ~half of CPU L2 cache).

Timeslice change mechanism

Initially I used Unix.set_itimer using virtual timers, to switch a thread only when it has been actively using CPU for too long. However that relies on delivering a signal to the process, and XAPI is very bad at handling signals.
In fact XAPI is not allowed to receive any signals, because it doesn't handle EINTR well (a typical problem, that affects C programs too sometimes). Although this is a well understood problem (described in the OCaml Unix book, and some areas of XAPI make an effort to handle it, others just assert that they never receive one. Fixing that would require changes in all of XAPI (and its dependencies).

So instead I don't use signals at all, but rely on Statmemprof to trigger a hook to be executed "periodically", but not based purely on time, but on allocation activity (i.e. at places the GC could run). The hook checks the elapsed time since the last time it got called, and if too much then calls Thread.yield.
Yield is smart enough to be a no-op if there aren't any other runnable OCaml threads.

Yield isn't always beneficial though at reducing latencies, e.g. if we are holding locks then we're just increasing latency for everyone who waits for that lock.
So a mechanism is introduced to notify the periodic function when any highly contended locks are held, and the yield is skipped in this instance (e.g. the XAPI DB lock).

Plotting code

This PR only includes a very simplified version of the microbenchmark, a separate one will introduce the full cache plotting code (which is useful for development/troubleshooting purposes but won't be needed at runtime).

Default timeslice value

Set to 5ms for now, just a bit above 4ms = 1/HZ in our Dom0 kernel, the autotuner from a future PR can change this to a more appropriate value.
(the autotuner needs more validation on a wider range of hardware)

Results

The cache measurements needs to be repeated on a wider variety of hardware, but the timeslice changes here have already proven useful in reducing XAPI DB lock hold times (together with other optimizations).

ocaml/libs/timeslice/timeslice.ml

ocaml/xapi-idl/lib/xcp_service.ml

ocaml/libs/timeslice/timeslice.ml

ocaml/xapi-idl/lib/xcp_service.ml

lindig · 2024-12-16T09:12:24Z

ocaml/xapi-idl/lib/xcp_service.ml

@@ -236,6 +255,11 @@ let common_options =
    , (fun () -> !config_dir)
    , "Location of directory containing configuration file fragments"
    )
+  ; ( "timeslice"
+    , Arg.Set_float timeslice
+    , (fun () -> Printf.sprintf "%.3f" !timeslice)


Not sure seconds is the best unit here when we mostly talk about milliseconds. However, this has enough resolution down to 1ms.

I think with the statmemprof approach we could actually go below 1ms too (for testing purposes).
I'll need to do some tests on some of the newer hardware we got to see what is the smallest timeslice that would work, so maybe I'll need to increase this to %.4f or %.5f.
But in practice I think it'd be good to avoid switching more often than 1ms, just to avoid introducing too much overhead.

lindig · 2024-12-17T09:03:08Z

So instead I don't use signals at all, but rely on Statmemprof to trigger a hook to be executed "periodically", but not based purely on time, but on allocation activity (i.e. at places the GC could run). The hook checks the elapsed time since the last time it got called, and if too much then calls Thread.yield.

This seem to be crucial as the mechanism to insert additional yields. Could you elaborate how Statmemprof works such that a thread periodically executes a hook that may yield? As this is not xapi specific, is there a way to release this as an opam package to engage the OCaml community?

Uses Gc.Memprof to run a hook function periodically. This checks whether we are holding any locks, and if not and sufficient time has elapsed since the last, then we yield. POSIX timers wouldn't work here, because they deliver signals, and there are too many places in XAPI that don't handle EINTR properly. Signed-off-by: Edwin Török <[email protected]>

edwintorok · 2025-01-02T11:32:23Z

So instead I don't use signals at all, but rely on Statmemprof to trigger a hook to be executed "periodically", but not based purely on time, but on allocation activity (i.e. at places the GC could run). The hook checks the elapsed time since the last time it got called, and if too much then calls Thread.yield.

This seem to be crucial as the mechanism to insert additional yields. Could you elaborate how Statmemprof works such that a thread periodically executes a hook that may yield? As this is not xapi specific, is there a way to release this as an opam package to engage the OCaml community?

The official documentation is here https://ocaml.org/manual/5.2/api/Gc.Memprof.html, and an OPAM package that uses it for something other than profiling is here:
https://guillaume.munch.name/software/ocaml/memprof-limits/index.html and https://guillaume.munch.name/software/ocaml/memprof-limits/statistical.html has more details on how these hooks can be used.

I'm not using it for either profiling or memory allocation limits, but for executing code ~periodically.

The way the Memprof hooks work is that every N allocations the hooks are run (the original goal is to allow profiling memory usage, but other uses are possible, such as limiting memory usage, or as in our case of running some code).

You have to be a bit careful with what code you run in the hook, because it runs asynchronously to other code, e.g. you don't want to raise any exceptions because that could break some code that doesn't expect exceptions to be raised in certain places, and e.g. it can make try/finally code less robust.

In our case I only run a Thread.yield there, which stops running the current thread (the thread we switch to might raise exceptions, but that is fine, they will be raised in the context of that thread as usual). And we have to trust that none of the MTime functions raise exceptions.

Perhaps to be safe I should capture and ignore all exceptions (without logging it, because logging would run so much code that it wouldn't be wise to call from this context).

We could release this as a package on OPAM, but it is only relevant for OCaml 4, this will be obsolete for OCaml 5 I think.

And apply on startup. Signed-off-by: Edwin Török <[email protected]>

Signed-off-by: Edwin Török <[email protected]>

edwintorok force-pushed the pr/CP-52709 branch from b24f60b to 572559c Compare December 12, 2024 22:43

edwintorok mentioned this pull request Dec 12, 2024

CP-49141: mark the DB lock as high priority #6180

Draft

robhoes reviewed Dec 13, 2024

View reviewed changes

ocaml/libs/timeslice/timeslice.ml Show resolved Hide resolved

robhoes reviewed Dec 13, 2024

View reviewed changes

ocaml/xapi-idl/lib/xcp_service.ml Outdated Show resolved Hide resolved

lindig approved these changes Dec 16, 2024

View reviewed changes

edwintorok force-pushed the pr/CP-52709 branch from 572559c to 969c3a4 Compare January 2, 2025 11:33

edwintorok added 4 commits January 2, 2025 11:43

CP-52709: add timeslice configuration to all services

7e42f49

And apply on startup. Signed-off-by: Edwin Török <[email protected]>

CP-52709: add simple measurement code

3ad905e

Signed-off-by: Edwin Török <[email protected]>

CP-52709: recommended measurement

38e1ad8

Signed-off-by: Edwin Török <[email protected]>

CP-52709: Enable timeslice setting during unit tests by default

93f85be

Signed-off-by: Edwin Török <[email protected]>

edwintorok force-pushed the pr/CP-52709 branch from 969c3a4 to 93f85be Compare January 2, 2025 11:43

mg12 approved these changes Jan 10, 2025

View reviewed changes

edwintorok merged commit 9c5c8dd into xapi-project:feature/perf Jan 13, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP-52709: use timeslices shorter than 50ms #6177

CP-52709: use timeslices shorter than 50ms #6177

edwintorok commented Dec 12, 2024 •

edited

Loading

lindig Dec 16, 2024

edwintorok Jan 2, 2025

lindig commented Dec 17, 2024

edwintorok commented Jan 2, 2025 •

edited

Loading

CP-52709: use timeslices shorter than 50ms #6177

CP-52709: use timeslices shorter than 50ms #6177

Conversation

edwintorok commented Dec 12, 2024 • edited Loading

Changing the default OCaml thread switch timeslice from 50ms

Timeslice change mechanism

Plotting code

Default timeslice value

Results

lindig Dec 16, 2024

Choose a reason for hiding this comment

edwintorok Jan 2, 2025

Choose a reason for hiding this comment

lindig commented Dec 17, 2024

edwintorok commented Jan 2, 2025 • edited Loading

edwintorok commented Dec 12, 2024 •

edited

Loading

edwintorok commented Jan 2, 2025 •

edited

Loading