From ea61e26ff330e720b011a85a2f4b9040e54da870 Mon Sep 17 00:00:00 2001 From: Benjamin Bolm <74359358+bennibolm@users.noreply.github.com> Date: Mon, 5 Feb 2024 11:50:25 +0100 Subject: [PATCH] Add section about false sharing problems to documentation (#1819) * Add section to docs about false sharing * Fix typos * Fix typo * Implement suggestions --- docs/src/performance.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/src/performance.md b/docs/src/performance.md index df66f451b79..82d7f501f63 100644 --- a/docs/src/performance.md +++ b/docs/src/performance.md @@ -267,3 +267,14 @@ requires. It can thus be seen as a proxy for "energy used" and, as an extension, timing result, you need to set the analysis interval such that the `AnalysisCallback` is invoked at least once during the course of the simulation and discard the first PID value. + +## Performance issues with multi-threaded reductions +[False sharing](https://en.wikipedia.org/wiki/False_sharing) is a known performance issue +for systems with distributed caches. It also occurred for the implementation of a thread +parallel bounds checking routine for the subcell IDP limiting +in [PR #1736](https://github.com/trixi-framework/Trixi.jl/pull/1736). +After some [testing and discussion](https://github.com/trixi-framework/Trixi.jl/pull/1736#discussion_r1423881895), +it turned out that initializing a vector of length `n * Threads.nthreads()` and only using every +n-th entry instead of a vector of length `Threads.nthreads()` fixes the problem. +Since there are no processors with caches over 128B, we use `n = 128B / size(uEltype)`. +Now, the bounds checking routine of the IDP limiting scales as hoped.