Skip to content

Releases: MPLLang/mpl

v0.5

04 Mar 16:23
Compare
Choose a tag to compare

MPL-v0.5 Release Notes (March 4, 2024)

In this release, we implement automatic parallelism management [1], which helps address the granularity control problem. In short, the impact of this update is as follows:

  • The cost of ForkJoin.par is reduced by multiple orders of magnitude. Because of the reduced cost, we can now use ForkJoin.par much more liberally to write fine-grained parallel code.
  • The compiler and run-time system automatically decide when and where to create parallel tasks, guaranteeing low overhead and good scalability, even when the programmer does not control granularity.

This makes it possible to write very fine-grained parallel code with good performance in practice.

In many other languages, if you write parallel code that is too fine-grained, you risk a slowdown of as much as 10-100x, or worse. This slowdown is due to the overhead of creating and managing parallel tasks, and this overhead can outweigh the benefits of parallelism if you are not careful.

In contrast, MPL-v0.5 ensures that: (1) the overhead of creating and managing parallel tasks remains low, and also (2) scalability is preserved. This is accomplished with a provably efficient scheduling algorithm (described in [1]), which decides when and where to create tasks while the program is executing. The algorithm is provably efficient in the sense that it preserves both work and span up to small constant factors.

Note that, with MPL-v0.5, tuning grain sizes can still improve performance, but the performance gap between tuned and untuned codes is much smaller: in our paper [1], we measure a gap of less than 2x on average across a range of benchmarks, with a worst case of only approximately 5-6x. In other words, with MPL-v0.5, the performance penalty for ignoring granularity control is only a small constant factor overhead. This is much better than the 10-100x penalty that you would previously risk, if you ignored granularity control.

For benchmarks with no granularity control, we have measured that MPL-v0.5 is significantly faster than MPL-v0.4: up to 50x in some cases. For code that already has manual granularity control, the performance of v0.5 should generally be similar to v0.4. (Any changes here are likely due to differences in whole-program compiler optimizations.)

Acknowledgements + References

The changes in this release were implemented primarily by @shwestrick and @MatthewFluet. The design of the automatic parallelism management system is due to @shwestrick, @MatthewFluet, @mikerainey, and @umutacar, and is described in more detail in our paper [1]. Thank you to @typerSniper for help integrating with the entanglement management system! We also made a few bugfixes along the way. See "Summary of PRs and commits" below for more details.

[1] Automatic Parallelism Management. Sam Westrick, Matthew Fluet, Mike Rainey, and Umut A. Acar. POPL 2024.

Example: N-Queens with and without granularity control

As an example, here is a performance comparison of an N-queens (wikipedia link) benchmark, with and without granularity control, between MPL versions 0.4 and 0.5. (Full code available here. These experiments were run with N=13.)

The classic N-Queens algorithm is essentially an irregular parallel tree traversal, e.g., as illustrated here. It's not immediately obvious how best to do manual granularity control for this algorithm. One idea would be to switch to a sequential tree traversal after getting a few levels deep in the tree. As long as N is big enough, this should work well, because (a) the tree has large fan-out, exposing lots of parallelism at the top of the tree, and (b) the top of the tree is mostly regular... it only becomes irregular as we get deeper into the search. This granularity control strategy is implemented here, where we switch to sequential at depth 3.

Results for this benchmark are shown below. A few observations:

  • With manual granularity control, using MPL-v0.5 instead of MPL-v0.4 doesn't affect performance much. This is good: it means that we preserve the performance of code that has already been manually tuned.
  • Without granularity control, the performance of v0.5 is significantly better: up to 50x faster than v0.4. (MPL-v0.4 requires manual granularity control to perform well, but MPL-v0.5 relaxes this requirement.)
  • In MPL-v0.5, the overhead of no granularity control is only approximately 2.5x, and this is consistent across processor counts. In other words: the scalability of MPL-v0.5 is unaffected by granularity control.
        +-----------------------------------+-----------------------------------+-------------------------------+
        |                                   |                                   |  OVERHEAD OF NO GRAN CONTROL  |
        |        MANUAL GRAN CONTROL        |          NO GRAN CONTROL          |  (ratio: no_gran/manual_gran) |
        +-----------------------------------+-----------------------------------+-------------------------------+
        |                     ratio         |                     ratio         |                               |
  procs |    v0.4     v0.5    (v0.4 / v0.5) |  v0.4      v0.5     (v0.4 / v0.5) |    v0.4    v0.5               |
+-------+-----------------------------------+-----------------------------------+-------------------------------+
|    1  |    1.64s    1.56s   1.05x         |  13.8s     3.75s     3.7x         |    8.4x    2.4x               |
|   10  |    0.178s   0.169s  1.05x         |  11.6s     0.452s    26x          |    65x     2.7x               |
|   20  |    0.097s   0.097s  1.0x          |  11.8s     0.240s    46x          |    121x    2.5x               |
|   30  |    0.067s   0.067s  1.0x          |  2.96s     0.163s    18x          |    44x     2.4x               |
|   40  |    0.051s   0.050s  1.02x         |  3.82s     0.125s    31x          |    75x     2.5x               |
|   50  |    0.041s   0.040s  1.03x         |  0.888s    0.099s    9.0x         |    22x     2.5x               |
|   60  |    0.035s   0.034s  1.03x         |  1.55s     0.083s    19x          |    44x     2.4x               |
|   72  |    0.031s   0.029s  1.07x         |  0.997s    0.071s    14x          |    32x     2.4x               |
+-------+-----------------------------------+-----------------------------------+-------------------------------+

New controls

In MPL-v0.5, there are now three new @mpl ... -- runtime args. These control the automatic parallelism management system, which is based on a heartbeat scheduling strategy. Please see the paper [1] for more details!

  • heartbeat-us <N>, which controls how many microseconds should pass between each heartbeat. (Default 500.)
  • heartbeat-tokens <C>, which controls how many tokens each active thread should receive on each heartbeat. (Default 30.)
  • heartbeat-relayer-threshold <T>, which controls the heartbeat delivery strategy in the run-time system. (Default 16.)
    • When executing on P worker threads (@mpl procs P --), if P > T, then MPL will designate one worker thread as the heartbeat relayer. The relayer's only job is to deliver heartbeats at a regular rate to all other workers. When P <= T, the run-time system instead uses a regular SIGALRM signal from the OS to deliver heartbeats.

Tokens are spent to perform promotions, where each promotion creates one task. The default settings allow for creating one task approximately every 500/30 microseconds on average. (Default settings are N=500us, C=30.) Increasing C results in more parallelism but potentially higher overhead. Increasing N results in less parallelism and potentially less overhead.

The default settings should work well across a range of hardware and modern OSes.

Summary of PRs and commits

  • (#180) the main update, with changes to the run-time system and compiler to support the PCall calling convention, and dynamic promotions, and described in the automatic parallelism management paper [1]. This update also includes:

Full Changelog: v0.4...v0.5

v0.4

22 Aug 18:41
f10dab4
Compare
Choose a tag to compare

MPL-v0.4 Release Notes (August 22, 2023)

Another year, another major release. In addition to multiple bugfixes and improvements (see below), this release features a milestone:

MPL now has full support for in-place updates.

Previously, MPL was only capable of running disentangled programs, and included a dynamic detector (similar to a data-race detector) that would exit the program with an error when entanglement occurred. With this update, MPL now automatically manages entanglement on-the-fly, allowing the program to continue safely and efficiently. This work was presented recently at PLDI 2023 in our paper Efficient Parallel Functional Programming with Effects [1].

Summary of PRs and contributions

  • (#173) the main update, especially the implementation of entanglement management (approximately the commit range c6a61f7..83b4b68 as well as #168)
  • (#163) additional controls for CGC collection
  • (#166) a low-overhead logging/profiling mechanism for investigating memory usage of MPL's block allocator
  • (#172) bugfixes for Real.{toString,fmt,...}: make gdtoa thread-safe
  • (#175) bugfixes for Int.{fmt,toString}, Word.{fmt,toString}, Real.{split,toManExp}, etc.: update One structure in basis library to be thread-safe
  • (4d343e4) bugfix for CGC and exception handling
  • (5a98337) add support for Real32 type for primitive CAS
  • (ba59f03) disable splitTypes{1,2} passes for now (these will need to be updated to support primitive CAS)

The changes in this release are implemented primarily by Jatin Arora (@typerSniper) and Sam Westrick (@shwestrick). The design of the entanglement manager [1] is by Jatin Arora, Sam Westrick, and Umut Acar (@umutacar). As always, thank you to @MatthewFluet for his constant support of the project!

References

[1] Efficient Parallel Functional Programming with Effects. Jatin Arora, Sam Westrick, and Umut A. Acar. PLDI 2023.

v0.3

26 Jun 20:43
7f59464
Compare
Choose a tag to compare

MPL-v0.3 Release Notes (June 26, 2022)

Over the past year and a half, we have made a number of improvements to MPL. In addition to various bug-fixes and performance improvements, there are three major changes: entanglement detection, CGC-chaining, and a new block allocator.

Entanglement detection

In v0.3, the MPL runtime system now dynamically monitors reads and writes to check for entanglement. If entanglement is detected, then MPL terminates the program with an error message.

Entanglement detection ensures that MPL is safe for all programs: if it type-checks, then it is safe to run. Specifically, when you run a program, MPL v0.3 guarantees that either (a) the program runs to completion in a disentangled manner, or (b) the program terminates with an entanglement error. The entanglement detector is precise: no false alarms!

With entanglement detection, you can now safely use in-place updates to improve efficiency. You can even use in-place updates in a non-deterministic manner, which is essential for good performance in some cases. In particular, many well-known parallel programming techniques improve efficiency by interleaving atomic in-place updates and accesses to shared memory. These techniques are inherently non-deterministic, but desirable for performance optimization. A key advantage of disentanglement (in comparison to purely functional programming, where in-place updates are disallowed) is that it allows for in-place updates to be utilized in this manner.

For example, with v0.3, you can safely use atomic read-modify-write operations to efficiently traverse a graph and reduce contention, or use deterministic reservations to ensure determinism via non-deterministic speculation, or implement lock-free data structures to allow threads to communicate on-the-fly. We have multiple benchmarks using these techniques. Our breadth-first search benchmark interleaves atomic CAS operations to visit vertices, and our low-diameter decomposition uses a similar technique to compute the boundary between adjacent clusters. Betweenness centrality aggregates sums of weights via priority updates. In Delaunay triangulation, triangles atomically reserve access to their neighbors to resolve update conflicts.

Using in-place updates for efficiency, our benchmarks are able to compete in performance with low-level code written in C/C++. For example, our breadth-first search and Delaunay triangulation benchmarks are only ~2x slower than state-of-the-art PBBS implementations, while using a similar amount of memory.

It is important to note that, because MPL allows for non-determinism, entanglement detection is execution-dependent. If an execution of a program exhibits entanglement, then MPL v0.3 will detect the entanglement and terminate. But on the exact same program—even on the same input—it is possible that a different execution might not exhibit entanglement, in which case MPL will allow the execution to complete. This is possible because the outcome of a race condition might determine whether or not entanglement occurs during execution. Race conditions are not necessarily bugs, however. All of the non-deterministic programming techniques we described above have race conditions, but these race conditions are "benign" in the sense of being intentional, useful for efficiency, and disentangled (with some care).

In terms of time and space overhead, don't worry: entanglement detection has nearly zero overhead. See the table, below. The space overhead is essentially zero across the board (one exception: 20% for tinykaboom on 1 processor, but the overall memory usage for this benchmark is low.) The largest time overhead is 7%, and a majority of benchmarks have 2% overhead or less, with a geomean across of approximately 1% across all benchmarks.

When there is overhead, it is predictable and easy to optimize away. The only operations affected are mutable dereference operations, e.g. Array.sub and Ref.!, so it is easy to avoid overhead from entanglement detection by removing unnecessary mutation. Purely functional code has no overhead.

Overhead of Entanglement Detection
                  TIME          SPACE
--------------------------------------------
     Benchmark   P=1    P=72   P=1     P=72
============================================
      bfs-tree  +3.0%  +2.1%  +0.0%   -0.1%
    centrality  -1.2%  -1.4%  -0.0%   -0.0%
 dedup-strings  +3.2%  +7.0%  -0.0%   +4.8%
      delaunay  +5.3%  +3.7%  -0.1%   +2.6%
  dense-matmul  +0.4%  +0.0%  -2.1%   +0.2%
  game-of-life  -4.4%  +0.0%  -0.0%   +0.1%
          grep  -1.9%  +0.0%  -0.0%   +0.1%
  low-d-decomp  +2.6%  -3.4%  -0.0%   -10.5%
 msort-strings  +2.6%  -3.3%  +0.1%   +1.5%
  nearest-nbrs  +0.0%  +2.2%  +0.0%   -0.1%
       nqueens  +0.6%  +3.6%  -0.3%   +0.6%
    palindrome  +4.3%  +3.2%  +2.5%   +0.2%
        primes  +1.9%  -2.4%  -0.3%   -0.0%
     quickhull  +5.6%  +6.8%  +0.0%   +2.9%
   range-query  +6.9%  -0.8%  -0.0%   +3.0%
     raytracer  -5.7%  +0.0%  +1.2%   +2.1%
        reverb  -2.2%  -2.4%  +0.0%   -0.1%
    seam-carve  +0.0%  +2.4%  -0.1%   -0.2%
       skyline  +3.6%  +0.4%  +0.1%   -1.2%
  suffix-array  +3.6%  +1.8%  -0.1%   -5.1%
    tinykaboom  +0.4%  +0.0%  +20.0%  +0.4%
        tokens  -5.1%  -2.3%  +0.0%   -0.0%
triangle-count  -3.6%  -0.5%  +0.0%   -2.2%

These benchmarks are available in our benchmark suite. To reproduce these experiments, see https://github.com/MPLLang/entanglement-detect/tree/icfp22-artifact

CGC-chaining

This change improves the CGC collector by placing collections off the critical path.

Some background: MPL's collector performs collections along spines in the task tree. Each spine collection is divided into two pieces, and the two pieces are collected with two different algorithms.

  • LGC: The lower portion of each spine is collected via a Cheney-style compacting copying collector. We call this "local GC" (LGC for short) because the lower portion of each spine is private to a single processor.
  • CGC: The upper portion of each spine might be shared between multiple processors. This portion is collected by a mark-and-sweep non-moving concurrent collector, so we refer to this algorithm as "concurrent GC", or CGC for short.

This approach requires almost no synchronization. Furthermore, GC is both parallel and concurrent: we get parallelism across spines, and we allow for GC and mutators to run concurrently. To perform a single spine collection, we only have to pause one mutator, to facilitate LGC. In this way, MPL has no stop-the-world pauses. This is the case for both v0.2 as well as v0.3.

The spines of the program can be unbalanced in their memory sizes. In particular, the spine that owns the root heap is often significantly larger than the other spines (which are short-lived). In v0.2, because spine collections are scheduled on the critical path, a single large collection might cause a noticeable decrease in throughput.

In v0.3, MPL spawns CGC tasks and schedules these exactly like normal tasks. Therefore, within a spine, multiple CGCs might be executed on different processors, if the scheduler deems it profitable. To keep CGC'ed heaps separate from the main spine heaps, we extended each heap with a "chain" of secondary heaps that are undergoing CGC. The chain can grow as multiple CGCs are spawned on different portions on the same heap (as the program forks and joins concurrently with ongoing CGCs). When a CGC task completes, the corresponding heap is merged back into the spine. CGC tasks are spawned at forks in the program, and are prioritized for shallow heaps (which are generally larger). LGC is still performed along the critical path, but each LGC is fast and only needs to pause a single mutator.

With this approach, in v0.3, the collector is better parallelized and has less impact on the critical path of the computation.

Hoard-style block allocation

In #135, we developed a new block-allocation mechanism based in part on the Hoard algorithm. This is the central allocation mechanism in MPL: all major allocations performed by the rest of the system are backed by blocks.

The results are quite impressive, to say the least. Due to reduced mmap pressure and better reuse of memory, we get as much as 30% improvement in time and 80% improvement in max residency. See #135 for more details.

Summary of PRs and contributions

  • (#131) more efficient hierarchical heaps and union-find heap lookup
  • (#132) coins example (thanks @ckoparkar!)
  • (#135) Hoard-inspired block allocator
  • (#141) parallelization of "forgotten set" (snapshot reachability)
  • (#142) CGC-chaining (major performance improvement for CGC)
  • (#142) down-pointer pinning
  • (#153) upstream merge from MLton (thanks @MatthewFluet!)
  • (#147, #150, #160) CC work list
  • (#142, #145, #154, #159) entanglement detection

Most of the implementation work done by @shwestrick and @typerSniper. @larry98 helped develop the order maintenance component of the entanglement detector, and @ckoparkar ported a new example from Haske...

Read more

v0.2

16 Dec 20:27
Compare
Choose a tag to compare

MPL-v0.2 Release Notes (December 16, 2020)

It's been about a year since we released MPL-v0.1, and now we're ready for round 2! Here's what's changed.

Concurrent, Parallel GC is here! (#129) This is the biggest change, and has been in the works for over a year. Awesome work Jatin @typerSniper! Many of the details are published in our recent POPL'21 paper: Provably Space-Efficient Parallel Functional Programming.

  • Concurrent: heaps are collected while the mutator(s) continue running.
  • Parallel: multiple heaps are collected simultaneously.
  • Based on a provably efficient algorithm, described in the paper.
  • Consists of two cooperating forms of collection:
    • Non-moving mark-sweep collections of shallow internal heaps.
    • Compactifying (copying) collections of deep processor-local heaps.
  • Utilizes a novel snapshot-at-fork technique which efficiently computes root-sets.
    • Closely integrated with the scheduler.
    • No interference with parallelism! The GC essentially requires no additional synchronization. (Never stop-the-world, even to determine roots.)

General performance improvements

  • (#119) Scheduler-allocated data is now subject to GC, including thread objects (and their stacks) allocated to execute stolen work, as well as any data allocated during idle loops. This is accomplished by generalizing the heap hierarchy, as discussed in 02c9b3e.
  • (#120) Stacks now resize in-place, by giving each stack its own chunk. For stack-stress benchmarks, the performance benefits can be significant. A secondary benefit is that stacks no longer "move around" due to GC, so it's easier to reason about stack placement in the hierarchy. (Previously, when the stack was resized, it would be copied to the current depth of the executing thread! Gross.)
  • (#124) Local collections now allocate chunks lazily, which avoids some fragmentation and simplifies the algorithm. (Previously, local collection relied on having a valid frontier ready for every depth in the to-space. But on each forwarding copy, we had to check that there was enough space for the object anyway, which sometimes would leave behind an empty chunk! It's much simpler and more efficient to assume that you don't have a valid frontier, and only allocate chunks when you need them.)

Upstream MLton merge (thanks @MatthewFluet!)

  • (#127) Merges new MLton features, including the static/dynamic heaps distinction.

New features

  • (#113) Primitives for fast parallel file I/O based on mmap/munmap. The existing basis library I/O functions were essentially entirely sequential. By mmaping a file, we can then load it into a heap array in parallel.
  • (#118) Primitives for GC statistics.

New examples

General maintenance

  • (#121) Refactor and simplify examples.
  • (#125) Removed MLton-specific files, docs, unsupported libraries, etc.
  • (#128) More consistent style for higher-order runtime functions.

Bug fixes

  • (#110, #111) Fixes an error in SimplifyTypes SSA optimization pass.
  • (19d0c63) Prevent scheduler crashes by sequentializing instead of exceeding the deque capacity.