-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve management of systemd stop timeout to avoid SIGKILL while making progress #6661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Looks like a nice improvement. One take it or leave it comment inline.
doc/man7/flux-broker-attributes.rst
Outdated
The systemd TimeoutStopSec value (in RFC 23 Flux Standard Duration format), | ||
which is used by the broker to extend the stop timeout while it is making | ||
progress towards shutdown. This is set in the Flux systemd unit file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean broker.sd-stop-timeout
needs to be the same as TimeoutStopSec
? If so, it might be clearer to just explicitly state that in this documentation. Maybe just state what the broker does with the value (which I can't remember even though I just reviewed the code 🤦 ), then state "This value should be set equal to the TimeoutStopSec
set for the flux.service
unit." or similar.
Problem: the debian control file does not require libsystemd-dev and if it's not installed, configuration fails. Add it as a requirement in the debian control file.
Problem: the broker unconditionally calls sd_notify(3) READY=1 but all other notifications are conditional on a broker attribute. Make that one conditional too.
Problem: if a system instance is shut down with 'systemctl stop flux', as opposed to flux-shutdown(1), the whole stop process is under a 90s timeout ending in SIGKILL, which sometimes is not enough time to stop a large system instance cleanly. To mitigate this, CLEANUP state has a 45s timeout, and in FINALIZE, flux-dump(1) periodically extends the systemd stop timeout by 10s. However, if CLEANUP uses up the entire 45s, only 45s is left to split among the remaining states, more or less. Experience has shown that 'systemctl stop flux' will be used, so we need to improve management of the systemd stop timeout, and perhaps even let it apply to flux-shutdown(1), so the behavior at shutdown is not dependent on the method used. As an initial step towards this goal, add a function that effectively resets the systemd stop timer by calling sd_notify() EXTEND_TIMEOUT_USEC with a value equal to TimeoutStopSec, which is passed in from the unit file via broker attribute. Call this function on entry to each of the CLEANUP, SHUTDOWN, FINALIZE, and GOODBYE states. The state transitions are indicative of progress, and we treat the systemd stop timeout as a timeout on progress rather than on the entire stop sequence. This increases the maximum time allowed to flux for stopping, but ensures the broker won't be killed while making progress.
Problem: runat.c calls sd_notify() directly but state_machine.c is now coordinating systemd notifications. Add state_machine_sd_notify() and register that as a callback in runat.c instead of directly calling sd_notify(). As a side effect, each time runat starts a new process, the stop timer is reset (if stopping).
Problem: flux-dump(1) calls sd_notify(3) to update status and extend the systemd stop timeout, but the stop timeout management should be coordinated with the broker. Add a state-machine.sd-notify RPC. It updates status and, if stopping, extends the timeout.
Problem: flux-dump calls sd_notify() directly but this is coordinated through the broker now. Use the broker RPC instead.
Problem: flux-restore calls sd_notify() directly but this is coordinated through the broker now. Use the broker RPC instead.
Problem: flux-dump activates systemd notifications when the broker.sd-notify attribute is set, but that might result in unintended notifications outside of rc3. Activate systemd notifications via a new --sd-notify option. Use this option in rc3. Update flux-dump(1) man page.
Problem: flux-restore activates systemd notifications when the broker.sd-notify attribute is set, but that might result in unintended notifications outside of rc1. Activate systemd notifications via a new --sd-notify option. Use this option in rc1. Update flux-restore(1) man page.
Problem: some flux-restore progress reports are duplicated. Add a check to suppress duplicate progress reports.
Problem: some flux-dump progress reports are duplicated. Add a check to suppress duplicate progress reports.
Problem: the broker.sd-notify and broker.sd-stop-timeout attributes are undocumented. Add them to the man page.
Problem: the broker's systemd unit file allows any process to use sd_notify(3) to update its state. Now that all updates are coordinated through the broker, this is no longer needed. Set NotifyAccess=main (main process only).
Problem: NOTIFY_SOCKET is passed on to some broker subprocesses, but they are no longer permitted to call sd_notify(3). Add NOTIFY_SOCKET to the runat blocklist to prevent it from being passed on to the rc scripts and initial program.
Problem: subprocess server environment manipulation is unreadable as one conditional block. Spread the code out so it is easier to follow. Test for cmd_set_env() < 0 to indicate failure rather than "true". Add some comments.
Problem: NOTIFY_SOCKET might leak into the environment of a remote process from the subprocess server environment or the command environment. Unset that variable in the subprocess server.
Problem: NOTIFY_SOCKET might leak into the environment of a remote process from the subprocess server environment or the command environment. Unset that variable in the sdexec subprocess server.
7a7a533
to
7a8b91d
Compare
Thanks, made the suggested changes to the man page. Also, while experimenting with I wasted a couple hours (ok I was being dense) tracking down the fact that NOTIFY_SOCKET had leaked from the broker's environment to IMP. So I tacked on some commits to prevent inheritance of this environment variable from the broker and (because we can now) restrict notifications to the main process. LMK if this still looks good @grondo. |
Yeah, looks good! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6661 +/- ##
==========================================
- Coverage 83.89% 83.89% -0.01%
==========================================
Files 533 533
Lines 88536 88592 +56
==========================================
+ Hits 74281 74324 +43
- Misses 14255 14268 +13
|
Problem: the subprocess server ignored failure of cmd_set_env(). This was introduced by flux-framework#6661 which was just merged. Oops! Fix error path.
Problem: the subprocess server ignores failure of cmd_set_env(). This was introduced by flux-framework#6661 which was just merged. Oops! Fix error path.
Problem: systemd imposes a 90s fixed timeout on shutdown when flux is stopped with
systemctl stop flux
, but that is not always enough time to cleanly stop flux on a big system. Data loss likely results if the timeout runs out and the broker is killed.This PR causes the stop timeout to be reset each time the broker transitions into one of the CLEANUP, SHUTDOWN, FINALIZE, or GOODBYE states. It also coordinates all stop time extension through the broker state machine rather than having
sd_notify()
calls in several locations.As a result, each of those states can take up to 90s - actually more for CLEANUP and FINALIZE (rc3) where the timeout is reset as progress is made within the state.
This is just a first step toward making
systemctl stop flux
more robust. In the future we'll need to add some more state timeouts (SHUTDOWN in particular).