Fix triggering of Worker Wrapper optimization on Stack #5468

ChrisPenner · 2024-11-26T20:49:38Z

Overview

Worker/wrapper was failing to trigger, and thus failed to unbox Stack, which resulted in us re-allocating the Stack wrapper object on every machine instruction, wasting a ton of allocation time and triggering a ton of garbage collection in the process.

This PR gets the optimization firing again, and includes a test to ensure it remains active.

Implementation details

Change all dynamic function calls which accept a Stack to instead accept an unboxed stack, labelled XStack. GHC can't rewrite these dynamic functions at optimization time, so this is a necessary step.
There's some weirdness going on related to precise exceptions, so I've needed to switch some calls to "throwIO with error, and also inserted a few unreachable errors, click through to see more information on why that might be.
Added export lists, these affect GHC inlining, since if GHC can know for sure that a definition is only used inside the module it will inline it more aggressively. It can also affect compile times since GHC won't even generate or optimize code for a method if it's only every fully inlined and isn't exported.
Add an inspection-testing clause that's only triggered in CI, but checks that no Stack type is mentioned within eval0. This should be sufficient to detect if worker-wrapper starts to fail again.

Benchmarks

For the 'count to a billion' tight loop I've reduced allocations significantly from 88.8 GB to 16.8 GB, a 5.3X improvement, but we're still allocating much more than we should. I found a few more spots to look into next.

trunk

  88,824,025,056 bytes allocated in the heap
      91,832,104 bytes copied during GC
       1,614,624 bytes maximum residency (3 sample(s))
         142,560 bytes maximum slop
              44 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     21362 colls,     0 par    0.000s   0.177s     0.0000s    0.0048s
  Gen  1         3 colls,     2 par    0.008s   0.005s     0.0017s    0.0031s

  Parallel GC work balance: 86.81% (serial 0%, perfect 100%)

  TASKS: 22 (1 bound, 21 peak workers (21 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.003s  (  0.003s elapsed)
  MUT     time   20.584s  ( 21.152s elapsed)
  GC      time    0.008s  (  0.182s elapsed)
  EXIT    time    0.001s  (  0.004s elapsed)
  Total   time   20.597s  ( 21.341s elapsed)

  Alloc rate    4,315,222,223 bytes per MUT second

  Productivity  99.9% of total user, 99.1% of total elapsed

unison-trunk-o run.compiled billion.uc +RTS -sstderr  20.45s user 0.19s system 96% cpu 21.370 total

this branch

  16,824,376,680 bytes allocated in the heap
      14,581,512 bytes copied during GC
       1,533,640 bytes maximum residency (2 sample(s))
         121,144 bytes maximum slop
              44 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4069 colls,     0 par    0.000s   0.042s     0.0000s    0.0023s
  Gen  1         2 colls,     1 par    0.004s   0.039s     0.0194s    0.0382s

  Parallel GC work balance: 75.01% (serial 0%, perfect 100%)

  TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.004s  (  0.006s elapsed)
  MUT     time   15.584s  ( 15.951s elapsed)
  GC      time    0.004s  (  0.081s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time   15.593s  ( 16.037s elapsed)

  Alloc rate    1,079,610,001 bytes per MUT second

  Productivity  99.9% of total user, 99.5% of total elapsed

stack exec unison -- run.compiled billion2.uc +RTS -sstderr  15.99s user 0.15s system 97% cpu 16.597 total

trunk -> this branch:

fib1
554.553µs -> 391.667µs

fib2
2.58437ms -> 2.37125ms

fib3
2.91811ms -> 2.801885ms

Decode Nat
416ns -> 369ns

Generate 100 random numbers
255.557µs -> 219.953µs

List.foldLeft
2.196607ms -> 2.127142ms

Count to 1 million
202.4926ms -> 154.5432ms

Json parsing (per document)
256.432µs -> 269.922µs

Count to N (per element)
277ns -> 220ns

Count to 1000
277.599µs -> 221.255µs

Mutate a Ref 1000 times
482.914µs -> 424.884µs

CAS an IO.ref 1000 times
669.477µs -> 627.869µs

List.range (per element)
376ns -> 343ns

List.range 0 1000
398.279µs -> 365.65µs

Set.fromList (range 0 1000)
2.092761ms -> 1.76602ms

Map.fromList (range 0 1000)
1.468527ms -> 1.294153ms

NatMap.fromList (range 0 1000)
6.12309ms -> 5.378384ms

Map.lookup (1k element map)
3.314µs -> 3.111µs

Map.insert (1k element map)
8.584µs -> 7.669µs

List.at (1k element list)
375ns -> 312ns

Text.split /
32.839µs -> 34.036µs

ChrisPenner · 2024-11-27T18:42:31Z

unison-runtime/package.yaml

+    dependencies:
+      - inspection-testing
+  - condition: flag(dumpcore)
+    ghc-options: -ddump-simpl -ddump-stg-final -ddump-to-file -dsuppress-coercions -dsuppress-idinfo -dsuppress-module-prefixes -ddump-str-signatures -ddump-simpl-stats # -dsuppress-type-applications -dsuppress-type-signatures


I'm going to leave these commented out flags here because they're sometimes useful and it's annoying to go look them up every time I need them.

ChrisPenner · 2024-11-27T18:43:40Z

unison-runtime/src/Unison/Runtime/Machine.hs

@@ -136,15 +165,6 @@ refNumTm cc r =
    (M.lookup r -> Just w) -> pure w
    _ -> die $ "refNumTm: unknown reference: " ++ show r

-refNumTy :: CCache -> Reference -> IO Word64


Several methods ended up being unused when I added the export lists.

If @dolio is attached to any of these we can certainly revive them :)

pchiusano · 2024-11-27T23:42:40Z

Just curious - I read the precise exceptions page, but I don't know if I get it. It seems like if the function is strict in all its arguments, it shouldn't matter whether you use throw or error. Is there a function we're calling which isn't fully strict where the distinction matters? Or is it somehow still relevant even if the function is fully strict?

ChrisPenner · 2024-12-06T23:36:07Z

To be honest I admit I don't fully understand what's happening here or why aside from that the linked issue does seem to be relevant. All I know is that at least in theory the worker-wrapper should trigger with the existing setup but the Core reflects that it doesn't and that this fixes it.

I could spend more time digging in if we think it's important, but changing from a throw to an error seems like a pretty reasonable alternative and adding the test at least makes sure it won't suddenly regress without us noticing :)

aryairani

Tests ftw

dolio

Looks good.

A couple of those removed functions might become relevant again with different calling conventions. But they can be added back (and maybe a more systematic approach would be better).

The testing library you're using for this sounds interesting.

ChrisPenner added 9 commits November 25, 2024 16:32

Add export lists and specializations

67d1c0d

Manually unbox functions in FF

1b268ea

Add dumpcore flag

6d8b26d

Fix max worker args, and add unboxing to callbacks too

e2cf40a

Experiment with optimization flags

eec7498

Just replace the broken case with error

73fbed4

Inspection testing

930e5b4

Get eval0 working without allocating any 'Stack'

7c0bbfe

Hide inspection testing behind build flag

73c1b9d

ChrisPenner force-pushed the cp/worker-wrapper branch from e149582 to 1e9f863 Compare November 27, 2024 06:25

Wire up inspection testing into CI

e2b8302

ChrisPenner force-pushed the cp/worker-wrapper branch from 1e9f863 to e2b8302 Compare November 27, 2024 06:47

Only include inspection testing when used

deb544c

ChrisPenner force-pushed the cp/worker-wrapper branch from 9a8d8f4 to 0c20a8e Compare November 27, 2024 07:10

ChrisPenner commented Nov 27, 2024

View reviewed changes

ChrisPenner force-pushed the cp/worker-wrapper branch from 0c20a8e to dcddb9d Compare November 27, 2024 18:47

ChrisPenner marked this pull request as ready for review November 27, 2024 18:47

ChrisPenner requested a review from a team as a code owner November 27, 2024 18:47

ChrisPenner requested a review from dolio November 27, 2024 18:47

PR Cleanup

cc68b25

ChrisPenner force-pushed the cp/worker-wrapper branch from dcddb9d to cc68b25 Compare November 27, 2024 18:52

pchiusano approved these changes Nov 27, 2024

View reviewed changes

aryairani approved these changes Dec 7, 2024

View reviewed changes

dolio approved these changes Dec 9, 2024

View reviewed changes

Re-merge trunk

891cd8c

ChrisPenner added ready-to-merge Apply this to a PR and it will get merged automatically once CI passes and 1 reviewer has approved and removed ready-to-merge Apply this to a PR and it will get merged automatically once CI passes and 1 reviewer has approved labels Dec 9, 2024

ChrisPenner changed the base branch from trunk to cp/fix-caching December 9, 2024 17:56

ChrisPenner changed the base branch from cp/fix-caching to trunk December 9, 2024 17:56

ChrisPenner added the ready-to-merge Apply this to a PR and it will get merged automatically once CI passes and 1 reviewer has approved label Dec 9, 2024

ChrisPenner merged commit fe3a9e8 into trunk Dec 9, 2024
47 checks passed

ChrisPenner deleted the cp/worker-wrapper branch December 9, 2024 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix triggering of Worker Wrapper optimization on Stack #5468

Fix triggering of Worker Wrapper optimization on Stack #5468

ChrisPenner commented Nov 26, 2024 •

edited

Loading

ChrisPenner Nov 27, 2024

ChrisPenner Nov 27, 2024

pchiusano commented Nov 27, 2024

ChrisPenner commented Dec 6, 2024 •

edited

Loading

aryairani left a comment

dolio left a comment

Fix triggering of Worker Wrapper optimization on Stack #5468

Fix triggering of Worker Wrapper optimization on Stack #5468

Conversation

ChrisPenner commented Nov 26, 2024 • edited Loading

Overview

Implementation details

Benchmarks

ChrisPenner Nov 27, 2024

Choose a reason for hiding this comment

ChrisPenner Nov 27, 2024

Choose a reason for hiding this comment

pchiusano commented Nov 27, 2024

ChrisPenner commented Dec 6, 2024 • edited Loading

aryairani left a comment

Choose a reason for hiding this comment

dolio left a comment

Choose a reason for hiding this comment

ChrisPenner commented Nov 26, 2024 •

edited

Loading

ChrisPenner commented Dec 6, 2024 •

edited

Loading