MPL-v0.5 Release Notes (March 4, 2024)
In this release, we implement automatic parallelism management [1], which helps address the granularity control problem. In short, the impact of this update is as follows:
- The cost of
ForkJoin.par
is reduced by multiple orders of magnitude. Because of the reduced cost, we can now useForkJoin.par
much more liberally to write fine-grained parallel code. - The compiler and run-time system automatically decide when and where to create parallel tasks, guaranteeing low overhead and good scalability, even when the programmer does not control granularity.
This makes it possible to write very fine-grained parallel code with good performance in practice.
In many other languages, if you write parallel code that is too fine-grained, you risk a slowdown of as much as 10-100x, or worse. This slowdown is due to the overhead of creating and managing parallel tasks, and this overhead can outweigh the benefits of parallelism if you are not careful.
In contrast, MPL-v0.5 ensures that: (1) the overhead of creating and managing parallel tasks remains low, and also (2) scalability is preserved. This is accomplished with a provably efficient scheduling algorithm (described in [1]), which decides when and where to create tasks while the program is executing. The algorithm is provably efficient in the sense that it preserves both work and span up to small constant factors.
Note that, with MPL-v0.5, tuning grain sizes can still improve performance, but the performance gap between tuned and untuned codes is much smaller: in our paper [1], we measure a gap of less than 2x on average across a range of benchmarks, with a worst case of only approximately 5-6x. In other words, with MPL-v0.5, the performance penalty for ignoring granularity control is only a small constant factor overhead. This is much better than the 10-100x penalty that you would previously risk, if you ignored granularity control.
For benchmarks with no granularity control, we have measured that MPL-v0.5 is significantly faster than MPL-v0.4: up to 50x in some cases. For code that already has manual granularity control, the performance of v0.5 should generally be similar to v0.4. (Any changes here are likely due to differences in whole-program compiler optimizations.)
Acknowledgements + References
The changes in this release were implemented primarily by @shwestrick and @MatthewFluet. The design of the automatic parallelism management system is due to @shwestrick, @MatthewFluet, @mikerainey, and @umutacar, and is described in more detail in our paper [1]. Thank you to @typerSniper for help integrating with the entanglement management system! We also made a few bugfixes along the way. See "Summary of PRs and commits" below for more details.
[1] Automatic Parallelism Management. Sam Westrick, Matthew Fluet, Mike Rainey, and Umut A. Acar. POPL 2024.
Example: N-Queens with and without granularity control
As an example, here is a performance comparison of an N-queens (wikipedia link) benchmark, with and without granularity control, between MPL versions 0.4 and 0.5. (Full code available here. These experiments were run with N=13.)
The classic N-Queens algorithm is essentially an irregular parallel tree traversal, e.g., as illustrated here. It's not immediately obvious how best to do manual granularity control for this algorithm. One idea would be to switch to a sequential tree traversal after getting a few levels deep in the tree. As long as N is big enough, this should work well, because (a) the tree has large fan-out, exposing lots of parallelism at the top of the tree, and (b) the top of the tree is mostly regular... it only becomes irregular as we get deeper into the search. This granularity control strategy is implemented here, where we switch to sequential at depth 3.
Results for this benchmark are shown below. A few observations:
- With manual granularity control, using MPL-v0.5 instead of MPL-v0.4 doesn't affect performance much. This is good: it means that we preserve the performance of code that has already been manually tuned.
- Without granularity control, the performance of v0.5 is significantly better: up to 50x faster than v0.4. (MPL-v0.4 requires manual granularity control to perform well, but MPL-v0.5 relaxes this requirement.)
- In MPL-v0.5, the overhead of no granularity control is only approximately 2.5x, and this is consistent across processor counts. In other words: the scalability of MPL-v0.5 is unaffected by granularity control.
+-----------------------------------+-----------------------------------+-------------------------------+
| | | OVERHEAD OF NO GRAN CONTROL |
| MANUAL GRAN CONTROL | NO GRAN CONTROL | (ratio: no_gran/manual_gran) |
+-----------------------------------+-----------------------------------+-------------------------------+
| ratio | ratio | |
procs | v0.4 v0.5 (v0.4 / v0.5) | v0.4 v0.5 (v0.4 / v0.5) | v0.4 v0.5 |
+-------+-----------------------------------+-----------------------------------+-------------------------------+
| 1 | 1.64s 1.56s 1.05x | 13.8s 3.75s 3.7x | 8.4x 2.4x |
| 10 | 0.178s 0.169s 1.05x | 11.6s 0.452s 26x | 65x 2.7x |
| 20 | 0.097s 0.097s 1.0x | 11.8s 0.240s 46x | 121x 2.5x |
| 30 | 0.067s 0.067s 1.0x | 2.96s 0.163s 18x | 44x 2.4x |
| 40 | 0.051s 0.050s 1.02x | 3.82s 0.125s 31x | 75x 2.5x |
| 50 | 0.041s 0.040s 1.03x | 0.888s 0.099s 9.0x | 22x 2.5x |
| 60 | 0.035s 0.034s 1.03x | 1.55s 0.083s 19x | 44x 2.4x |
| 72 | 0.031s 0.029s 1.07x | 0.997s 0.071s 14x | 32x 2.4x |
+-------+-----------------------------------+-----------------------------------+-------------------------------+
New controls
In MPL-v0.5, there are now three new @mpl ... --
runtime args. These control the automatic parallelism management system, which is based on a heartbeat scheduling strategy. Please see the paper [1] for more details!
heartbeat-us <N>
, which controls how many microseconds should pass between each heartbeat. (Default 500.)heartbeat-tokens <C>
, which controls how many tokens each active thread should receive on each heartbeat. (Default 30.)heartbeat-relayer-threshold <T>
, which controls the heartbeat delivery strategy in the run-time system. (Default 16.)- When executing on
P
worker threads (@mpl procs P --
), ifP > T
, then MPL will designate one worker thread as the heartbeat relayer. The relayer's only job is to deliver heartbeats at a regular rate to all other workers. WhenP <= T
, the run-time system instead uses a regularSIGALRM
signal from the OS to deliver heartbeats.
- When executing on
Tokens are spent to perform promotions, where each promotion creates one task. The default settings allow for creating one task approximately every 500/30 microseconds on average. (Default settings are N=500us, C=30.) Increasing C results in more parallelism but potentially higher overhead. Increasing N results in less parallelism and potentially less overhead.
The default settings should work well across a range of hardware and modern OSes.
Summary of PRs and commits
- (#180) the main update, with changes to the run-time system and compiler to support the PCall calling convention, and dynamic promotions, and described in the automatic parallelism management paper [1]. This update also includes:
- integration with entanglement detection and management (approximately the commit range c0b1508...249ec8f)
- (d1646cf) bugfix for long-standing LGC bug (#156)
- (6a1da46...57e4027) fix up the tracing runtime configuration. This can be enabled with the compile-time options
-trace true -trace-runtime true
, and then can be used to output an execution trace which is compatible with Perfetto. Seemltrace/NOTES.md
for more info. - (8d32f9b, 317caea, 27116ab, 67f419d, 249ec8f) other small bugfixes
Full Changelog: v0.4...v0.5