Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: block deletion on protocol handler shutdown #9364

Merged
merged 26 commits into from
Nov 20, 2024

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Oct 11, 2024

Problem

Two recently observed log errors indicate safekeeper tasks for a timeline running after that timeline's deletion has started.

These code paths do not have a mechanism that coordinates task shutdown with the overall shutdown of the timeline.

Summary of changes

  • Add a Gate to Timeline
  • Take the gate as part of resident timeline guard: any code that holds a guard over a timeline staying resident should also hold a guard over the timeline's total lifetime.
  • Take the gate from the wal removal task
  • Respect Timeline::cancel in WAL send/recv code, so that we do not block shutdown indefinitely.
  • Add a test that deletes timelines with open pageserver+compute connections, to check these get torn down as expected.

There is some risk to introducing gates: if there is code holding a gate which does not properly respect a cancellation token, it can cause shutdown hangs. The risk of this for safekeepers is lower in practice than it is for other services, because in a healthy timeline deletion, the compute is shutdown first, then the timeline is deleted on the pageserver, and finally it is deleted on the safekeepers -- that makes it much less likely that some protocol handler will still be running.

Closes: #8972
Closes: #8974

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/safekeeper Component: storage: safekeeper a/tech_debt Area: related to tech debt labels Oct 11, 2024
@jcsp jcsp requested a review from arssher October 11, 2024 14:49
Copy link

github-actions bot commented Oct 11, 2024

5534 tests run: 5308 passed, 0 failed, 226 skipped (full report)


Flaky tests (1)

Postgres 17

Code coverage* (full report)

  • functions: 31.5% (7935 of 25227 functions)
  • lines: 49.6% (62952 of 126943 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
96ae2bb at 2024-11-20T10:33:23.074Z :recycle:

@jcsp jcsp force-pushed the jcsp/issue-8974-safekeeper-shutdown branch 2 times, most recently from d2587ef to 1f3847b Compare October 14, 2024 15:21
@jcsp jcsp marked this pull request as ready for review October 15, 2024 08:16
@jcsp jcsp requested a review from a team as a code owner October 15, 2024 08:16
@jcsp jcsp requested a review from erikgrinaker October 15, 2024 08:16
safekeeper/src/timeline.rs Outdated Show resolved Hide resolved
safekeeper/src/timeline_manager.rs Show resolved Hide resolved
safekeeper/src/timeline_manager.rs Show resolved Hide resolved
safekeeper/src/wal_backup.rs Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/send_wal.rs Show resolved Hide resolved
safekeeper/src/timeline.rs Outdated Show resolved Hide resolved
safekeeper/src/wal_backup.rs Show resolved Hide resolved
@jcsp
Copy link
Contributor Author

jcsp commented Oct 22, 2024

Addressed some comments & rebased on main, a few still changes are still oustanding

@jcsp jcsp force-pushed the jcsp/issue-8974-safekeeper-shutdown branch from 1f3847b to fa50815 Compare October 22, 2024 19:06
@arssher
Copy link
Contributor

arssher commented Oct 23, 2024

One more note: it would be reasonable to add GateGuard to Manager as well, otherwise deletion doesn't wait for manager exit.

@jcsp
Copy link
Contributor Author

jcsp commented Nov 8, 2024

One more note: it would be reasonable to add GateGuard to Manager as well, otherwise deletion doesn't wait for manager exit.

Added GateGuard in the task where we run Manager in ba40e2c

@jcsp
Copy link
Contributor Author

jcsp commented Nov 8, 2024

Sorry for the high-latency turnaround. I've been through all the comments now, although haven't done a full re-self-review to make sure it all still makes sense.

erikgrinaker

This comment was marked as outdated.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move the send/receive WAL cancellation up to the task level and just drop the futures, otherwise this LGTM.

Much simpler this time around, appreciate the cleanup.

safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/send_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/wal_backup.rs Outdated Show resolved Hide resolved
safekeeper/src/wal_backup.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this cleaned up pretty nicely.

safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Show resolved Hide resolved
Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except +1's on Eric's comments.

safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Show resolved Hide resolved
safekeeper/src/receive_wal.rs Outdated Show resolved Hide resolved
safekeeper/src/receive_wal.rs Show resolved Hide resolved
@jcsp jcsp enabled auto-merge (squash) November 19, 2024 17:20
@jcsp jcsp merged commit 33dce25 into main Nov 20, 2024
79 checks passed
@jcsp jcsp deleted the jcsp/issue-8974-safekeeper-shutdown branch November 20, 2024 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/safekeeper Component: storage: safekeeper
Projects
None yet
3 participants