Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix triggering of Worker Wrapper optimization on Stack #5468

Merged
merged 13 commits into from
Dec 9, 2024

Conversation

ChrisPenner
Copy link
Contributor

@ChrisPenner ChrisPenner commented Nov 26, 2024

Overview

Worker/wrapper was failing to trigger, and thus failed to unbox Stack, which resulted in us re-allocating the Stack wrapper object on every machine instruction, wasting a ton of allocation time and triggering a ton of garbage collection in the process.

This PR gets the optimization firing again, and includes a test to ensure it remains active.

Implementation details

  • Change all dynamic function calls which accept a Stack to instead accept an unboxed stack, labelled XStack. GHC can't rewrite these dynamic functions at optimization time, so this is a necessary step.
  • There's some weirdness going on related to precise exceptions, so I've needed to switch some calls to "throwIO with error, and also inserted a few unreachable errors, click through to see more information on why that might be.
  • Added export lists, these affect GHC inlining, since if GHC can know for sure that a definition is only used inside the module it will inline it more aggressively. It can also affect compile times since GHC won't even generate or optimize code for a method if it's only every fully inlined and isn't exported.
  • Add an inspection-testing clause that's only triggered in CI, but checks that no Stack type is mentioned within eval0. This should be sufficient to detect if worker-wrapper starts to fail again.

Benchmarks

For the 'count to a billion' tight loop I've reduced allocations significantly from 88.8 GB to 16.8 GB, a 5.3X improvement, but we're still allocating much more than we should. I found a few more spots to look into next.

trunk

  88,824,025,056 bytes allocated in the heap
      91,832,104 bytes copied during GC
       1,614,624 bytes maximum residency (3 sample(s))
         142,560 bytes maximum slop
              44 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     21362 colls,     0 par    0.000s   0.177s     0.0000s    0.0048s
  Gen  1         3 colls,     2 par    0.008s   0.005s     0.0017s    0.0031s

  Parallel GC work balance: 86.81% (serial 0%, perfect 100%)

  TASKS: 22 (1 bound, 21 peak workers (21 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.003s  (  0.003s elapsed)
  MUT     time   20.584s  ( 21.152s elapsed)
  GC      time    0.008s  (  0.182s elapsed)
  EXIT    time    0.001s  (  0.004s elapsed)
  Total   time   20.597s  ( 21.341s elapsed)

  Alloc rate    4,315,222,223 bytes per MUT second

  Productivity  99.9% of total user, 99.1% of total elapsed

unison-trunk-o run.compiled billion.uc +RTS -sstderr  20.45s user 0.19s system 96% cpu 21.370 total

this branch

  16,824,376,680 bytes allocated in the heap
      14,581,512 bytes copied during GC
       1,533,640 bytes maximum residency (2 sample(s))
         121,144 bytes maximum slop
              44 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4069 colls,     0 par    0.000s   0.042s     0.0000s    0.0023s
  Gen  1         2 colls,     1 par    0.004s   0.039s     0.0194s    0.0382s

  Parallel GC work balance: 75.01% (serial 0%, perfect 100%)

  TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.004s  (  0.006s elapsed)
  MUT     time   15.584s  ( 15.951s elapsed)
  GC      time    0.004s  (  0.081s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time   15.593s  ( 16.037s elapsed)

  Alloc rate    1,079,610,001 bytes per MUT second

  Productivity  99.9% of total user, 99.5% of total elapsed

stack exec unison -- run.compiled billion2.uc +RTS -sstderr  15.99s user 0.15s system 97% cpu 16.597 total

trunk -> this branch:

fib1
554.553µs -> 391.667µs

fib2
2.58437ms -> 2.37125ms

fib3
2.91811ms -> 2.801885ms

Decode Nat
416ns -> 369ns

Generate 100 random numbers
255.557µs -> 219.953µs

List.foldLeft
2.196607ms -> 2.127142ms

Count to 1 million
202.4926ms -> 154.5432ms

Json parsing (per document)
256.432µs -> 269.922µs

Count to N (per element)
277ns -> 220ns

Count to 1000
277.599µs -> 221.255µs

Mutate a Ref 1000 times
482.914µs -> 424.884µs

CAS an IO.ref 1000 times
669.477µs -> 627.869µs

List.range (per element)
376ns -> 343ns

List.range 0 1000
398.279µs -> 365.65µs

Set.fromList (range 0 1000)
2.092761ms -> 1.76602ms

Map.fromList (range 0 1000)
1.468527ms -> 1.294153ms

NatMap.fromList (range 0 1000)
6.12309ms -> 5.378384ms

Map.lookup (1k element map)
3.314µs -> 3.111µs

Map.insert (1k element map)
8.584µs -> 7.669µs

List.at (1k element list)
375ns -> 312ns

Text.split /
32.839µs -> 34.036µs

dependencies:
- inspection-testing
- condition: flag(dumpcore)
ghc-options: -ddump-simpl -ddump-stg-final -ddump-to-file -dsuppress-coercions -dsuppress-idinfo -dsuppress-module-prefixes -ddump-str-signatures -ddump-simpl-stats # -dsuppress-type-applications -dsuppress-type-signatures
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave these commented out flags here because they're sometimes useful and it's annoying to go look them up every time I need them.

@@ -136,15 +165,6 @@ refNumTm cc r =
(M.lookup r -> Just w) -> pure w
_ -> die $ "refNumTm: unknown reference: " ++ show r

refNumTy :: CCache -> Reference -> IO Word64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several methods ended up being unused when I added the export lists.

If @dolio is attached to any of these we can certainly revive them :)

@ChrisPenner ChrisPenner marked this pull request as ready for review November 27, 2024 18:47
@ChrisPenner ChrisPenner requested a review from a team as a code owner November 27, 2024 18:47
@ChrisPenner ChrisPenner requested a review from dolio November 27, 2024 18:47
@pchiusano
Copy link
Member

Just curious - I read the precise exceptions page, but I don't know if I get it. It seems like if the function is strict in all its arguments, it shouldn't matter whether you use throw or error. Is there a function we're calling which isn't fully strict where the distinction matters? Or is it somehow still relevant even if the function is fully strict?

@ChrisPenner
Copy link
Contributor Author

ChrisPenner commented Dec 6, 2024

To be honest I admit I don't fully understand what's happening here or why aside from that the linked issue does seem to be relevant. All I know is that at least in theory the worker-wrapper should trigger with the existing setup but the Core reflects that it doesn't and that this fixes it.

I could spend more time digging in if we think it's important, but changing from a throw to an error seems like a pretty reasonable alternative and adding the test at least makes sure it won't suddenly regress without us noticing :)

Copy link
Contributor

@aryairani aryairani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests ftw

Copy link
Contributor

@dolio dolio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

A couple of those removed functions might become relevant again with different calling conventions. But they can be added back (and maybe a more systematic approach would be better).

The testing library you're using for this sounds interesting.

@ChrisPenner ChrisPenner added ready-to-merge Apply this to a PR and it will get merged automatically once CI passes and 1 reviewer has approved and removed ready-to-merge Apply this to a PR and it will get merged automatically once CI passes and 1 reviewer has approved labels Dec 9, 2024
@ChrisPenner ChrisPenner changed the base branch from trunk to cp/fix-caching December 9, 2024 17:56
@ChrisPenner ChrisPenner changed the base branch from cp/fix-caching to trunk December 9, 2024 17:56
@ChrisPenner ChrisPenner added the ready-to-merge Apply this to a PR and it will get merged automatically once CI passes and 1 reviewer has approved label Dec 9, 2024
@ChrisPenner ChrisPenner merged commit fe3a9e8 into trunk Dec 9, 2024
47 checks passed
@ChrisPenner ChrisPenner deleted the cp/worker-wrapper branch December 9, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge Apply this to a PR and it will get merged automatically once CI passes and 1 reviewer has approved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants