-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
safekeeper: block deletion on protocol handler shutdown #9364
Conversation
5534 tests run: 5308 passed, 0 failed, 226 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
96ae2bb at 2024-11-20T10:33:23.074Z :recycle: |
d2587ef
to
1f3847b
Compare
Addressed some comments & rebased on main, a few still changes are still oustanding |
1f3847b
to
fa50815
Compare
One more note: it would be reasonable to add |
Added GateGuard in the task where we run Manager in ba40e2c |
Sorry for the high-latency turnaround. I've been through all the comments now, although haven't done a full re-self-review to make sure it all still makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move the send/receive WAL cancellation up to the task level and just drop the futures, otherwise this LGTM.
Much simpler this time around, appreciate the cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this cleaned up pretty nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except +1's on Eric's comments.
Problem
Two recently observed log errors indicate safekeeper tasks for a timeline running after that timeline's deletion has started.
WAL removal task failed: No such file or directory
#8972WAL sender: failed to open WAL file
#8974These code paths do not have a mechanism that coordinates task shutdown with the overall shutdown of the timeline.
Summary of changes
Gate
toTimeline
There is some risk to introducing gates: if there is code holding a gate which does not properly respect a cancellation token, it can cause shutdown hangs. The risk of this for safekeepers is lower in practice than it is for other services, because in a healthy timeline deletion, the compute is shutdown first, then the timeline is deleted on the pageserver, and finally it is deleted on the safekeepers -- that makes it much less likely that some protocol handler will still be running.
Closes: #8972
Closes: #8974
Checklist before requesting a review
Checklist before merging