Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: refactor WalAcceptor to be event-driven #9462

Merged
merged 3 commits into from
Oct 28, 2024

Conversation

erikgrinaker
Copy link
Contributor

Problem

The WalAcceptor main loop currently uses two nested loops to consume inbound messages. This makes it hard to slot in periodic events like metrics collection (e.g. #9328). It also duplicates the event processing code, and assumes all messages in steady state are AppendRequests (other messages types may be dropped if following an AppendRequest).

Extracted from #9450.

Summary of changes

Refactor the WalAcceptor loop to be event driven.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@erikgrinaker erikgrinaker requested a review from a team as a code owner October 21, 2024 09:22
@erikgrinaker
Copy link
Contributor Author

We should run some benchmarks to ensure this doesn't cause a performance regression. I don't expect it will.

I'll see if we have any suitable benchmarks already, otherwise I'll add some as part of #9339.

Copy link

github-actions bot commented Oct 21, 2024

5300 tests run: 5078 passed, 0 failed, 222 skipped (full report)


Flaky tests (5)

Postgres 17

Postgres 16

  • test_timeline_offloading[False]: release-arm64
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-timeline-delete-before-rm]: release-x86-64

Code coverage* (full report)

  • functions: 31.3% (7690 of 24580 functions)
  • lines: 48.7% (60460 of 124104 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
e228d08 at 2024-10-28T17:28:16.769Z :recycle:

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 21, 2024

Have seen two CI failures on test_project_switch_primary_two_branches in E2E tests, investigating.

Edit: known flake, see https://neondb.slack.com/archives/C059ZC138NR/p1729524890795969.

Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely there perf difference won't be noticeable, but it would be curious to check, so there is no good infra to do that quickly.

@erikgrinaker
Copy link
Contributor Author

TFTR! I'm adding some benchmark infra for #9339 anyway. I'll throw together a simple WAL ingest benchmark and try it out on this PR.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 25, 2024

Looks like this comes with a pretty hefty performance penalty. I'll have a look at why. Will submit a PR for the benchmarks in a bit.

wal_acceptor/fsync=false/n=1
  time:   [1.2000 ms 1.2028 ms 1.2052 ms]
  change: [+6765.7% +6811.1% +6861.2%] (p = 0.00 < 0.05)

wal_acceptor/fsync=false/n=100
  time:   [2.1809 ms 2.1880 ms 2.1954 ms]
  change: [+123.88% +125.47% +127.12%] (p = 0.00 < 0.05)

wal_acceptor/fsync=false/n=10000
  time:   [98.030 ms 98.687 ms 99.451 ms]
  change: [+3.5012% +4.2182% +4.9586%] (p = 0.00 < 0.05)

wal_acceptor/fsync=true/n=1
  time:   [5.2277 ms 5.2782 ms 5.3313 ms]
  change: [+27.396% +29.414% +31.128%] (p = 0.00 < 0.05)

wal_acceptor/fsync=true/n=100
  time:   [6.3553 ms 6.4275 ms 6.5018 ms]
  change: [+13.648% +15.290% +16.958%] (p = 0.00 < 0.05)

wal_acceptor/fsync=true/n=10000
  time:   [106.57 ms 107.00 ms 107.45 ms]
  change: [+3.8936% +4.4429% +4.9226%] (p = 0.00 < 0.05)

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 28, 2024

Ok, so it appears that reset_immediately() is not actually immediately ready -- it waits for the minimum timer resolution (1 millisecond) before firing. This explains why the added latency is constant, rather than linear in the number of messages.

Submitted tokio-rs/tokio#6940 for this (edit: dupe of tokio-rs/tokio#6866).

I'll add a workaround shortly.

@erikgrinaker
Copy link
Contributor Author

Added a workaround, performance is now basically equivalent.

wal_acceptor/fsync=false/n=1
                        time:   [17.700 µs 17.789 µs 17.890 µs]
                        change: [+4.4851% +7.2364% +11.308%] (p = 0.00 < 0.05)

wal_acceptor/fsync=false/n=100
                        time:   [942.83 µs 946.26 µs 950.40 µs]
                        change: [-1.0171% +0.4377% +2.6128%] (p = 0.64 > 0.05)

wal_acceptor/fsync=false/n=10000
                        time:   [94.838 ms 95.170 ms 95.635 ms]
                        change: [+0.0756% +0.5036% +1.0851%] (p = 0.03 < 0.05)

wal_acceptor/fsync=true/n=1
                        time:   [4.1999 ms 4.2401 ms 4.2852 ms]
                        change: [+2.3733% +3.9625% +5.4735%] (p = 0.00 < 0.05)

wal_acceptor/fsync=true/n=100
                        time:   [5.2680 ms 5.3039 ms 5.3447 ms]
                        change: [-6.0064% -4.8629% -3.8137%] (p = 0.00 < 0.05)

wal_acceptor/fsync=true/n=10000
                        time:   [99.223 ms 99.501 ms 99.811 ms]
                        change: [-3.2774% -2.8797% -2.4867%] (p = 0.00 < 0.05)

@erikgrinaker erikgrinaker enabled auto-merge (squash) October 28, 2024 14:28
@erikgrinaker erikgrinaker enabled auto-merge (squash) October 28, 2024 15:21
@erikgrinaker erikgrinaker merged commit 248558d into main Oct 28, 2024
80 checks passed
@erikgrinaker erikgrinaker deleted the erik/wal-acceptor-event-driven branch October 28, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants