improve management of systemd stop timeout to avoid SIGKILL while making progress #6661

garlick · 2025-02-24T21:05:13Z

Problem: systemd imposes a 90s fixed timeout on shutdown when flux is stopped with systemctl stop flux, but that is not always enough time to cleanly stop flux on a big system. Data loss likely results if the timeout runs out and the broker is killed.

This PR causes the stop timeout to be reset each time the broker transitions into one of the CLEANUP, SHUTDOWN, FINALIZE, or GOODBYE states. It also coordinates all stop time extension through the broker state machine rather than having sd_notify() calls in several locations.

As a result, each of those states can take up to 90s - actually more for CLEANUP and FINALIZE (rc3) where the timeout is reset as progress is made within the state.

This is just a first step toward making systemctl stop flux more robust. In the future we'll need to add some more state timeouts (SHUTDOWN in particular).

src/broker/state_machine.c

grondo

LGTM! Looks like a nice improvement. One take it or leave it comment inline.

grondo · 2025-02-25T00:30:03Z

doc/man7/flux-broker-attributes.rst

+   The systemd TimeoutStopSec value (in RFC 23 Flux Standard Duration format),
+   which is used by the broker to extend the stop timeout while it is making
+   progress towards shutdown.  This is set in the Flux systemd unit file.


Does this mean broker.sd-stop-timeout needs to be the same as TimeoutStopSec? If so, it might be clearer to just explicitly state that in this documentation. Maybe just state what the broker does with the value (which I can't remember even though I just reviewed the code 🤦 ), then state "This value should be set equal to the TimeoutStopSec set for the flux.service unit." or similar.

Problem: the debian control file does not require libsystemd-dev and if it's not installed, configuration fails. Add it as a requirement in the debian control file.

Problem: the broker unconditionally calls sd_notify(3) READY=1 but all other notifications are conditional on a broker attribute. Make that one conditional too.

Problem: if a system instance is shut down with 'systemctl stop flux', as opposed to flux-shutdown(1), the whole stop process is under a 90s timeout ending in SIGKILL, which sometimes is not enough time to stop a large system instance cleanly. To mitigate this, CLEANUP state has a 45s timeout, and in FINALIZE, flux-dump(1) periodically extends the systemd stop timeout by 10s. However, if CLEANUP uses up the entire 45s, only 45s is left to split among the remaining states, more or less. Experience has shown that 'systemctl stop flux' will be used, so we need to improve management of the systemd stop timeout, and perhaps even let it apply to flux-shutdown(1), so the behavior at shutdown is not dependent on the method used. As an initial step towards this goal, add a function that effectively resets the systemd stop timer by calling sd_notify() EXTEND_TIMEOUT_USEC with a value equal to TimeoutStopSec, which is passed in from the unit file via broker attribute. Call this function on entry to each of the CLEANUP, SHUTDOWN, FINALIZE, and GOODBYE states. The state transitions are indicative of progress, and we treat the systemd stop timeout as a timeout on progress rather than on the entire stop sequence. This increases the maximum time allowed to flux for stopping, but ensures the broker won't be killed while making progress.

Problem: runat.c calls sd_notify() directly but state_machine.c is now coordinating systemd notifications. Add state_machine_sd_notify() and register that as a callback in runat.c instead of directly calling sd_notify(). As a side effect, each time runat starts a new process, the stop timer is reset (if stopping).

Problem: flux-dump(1) calls sd_notify(3) to update status and extend the systemd stop timeout, but the stop timeout management should be coordinated with the broker. Add a state-machine.sd-notify RPC. It updates status and, if stopping, extends the timeout.

Problem: flux-dump calls sd_notify() directly but this is coordinated through the broker now. Use the broker RPC instead.

Problem: flux-restore calls sd_notify() directly but this is coordinated through the broker now. Use the broker RPC instead.

Problem: flux-dump activates systemd notifications when the broker.sd-notify attribute is set, but that might result in unintended notifications outside of rc3. Activate systemd notifications via a new --sd-notify option. Use this option in rc3. Update flux-dump(1) man page.

Problem: flux-restore activates systemd notifications when the broker.sd-notify attribute is set, but that might result in unintended notifications outside of rc1. Activate systemd notifications via a new --sd-notify option. Use this option in rc1. Update flux-restore(1) man page.

Problem: some flux-restore progress reports are duplicated. Add a check to suppress duplicate progress reports.

Problem: some flux-dump progress reports are duplicated. Add a check to suppress duplicate progress reports.

Problem: the broker.sd-notify and broker.sd-stop-timeout attributes are undocumented. Add them to the man page.

Problem: the broker's systemd unit file allows any process to use sd_notify(3) to update its state. Now that all updates are coordinated through the broker, this is no longer needed. Set NotifyAccess=main (main process only).

Problem: NOTIFY_SOCKET is passed on to some broker subprocesses, but they are no longer permitted to call sd_notify(3). Add NOTIFY_SOCKET to the runat blocklist to prevent it from being passed on to the rc scripts and initial program.

Problem: subprocess server environment manipulation is unreadable as one conditional block. Spread the code out so it is easier to follow. Test for cmd_set_env() < 0 to indicate failure rather than "true". Add some comments.

Problem: NOTIFY_SOCKET might leak into the environment of a remote process from the subprocess server environment or the command environment. Unset that variable in the subprocess server.

Problem: NOTIFY_SOCKET might leak into the environment of a remote process from the subprocess server environment or the command environment. Unset that variable in the sdexec subprocess server.

garlick · 2025-02-25T15:58:45Z

Thanks, made the suggested changes to the man page.

Also, while experimenting with

multiuser job with unkillable tasks causes job to hang indefinitely in RUN state #6656

I wasted a couple hours (ok I was being dense) tracking down the fact that NOTIFY_SOCKET had leaked from the broker's environment to IMP. So I tacked on some commits to prevent inheritance of this environment variable from the broker and (because we can now) restrict notifications to the main process. LMK if this still looks good @grondo.

grondo · 2025-02-25T16:10:42Z

Yeah, looks good!

codecov · 2025-02-25T16:20:28Z

Codecov Report

Attention: Patch coverage is 84.76190% with 16 lines in your changes missing coverage. Please review.

Project coverage is 83.89%. Comparing base (9762f63) to head (7a8b91d).
Report is 18 commits behind head on master.

Files with missing lines	Patch %	Lines
src/broker/state_machine.c	62.50%	12 Missing ⚠️
src/common/libsubprocess/server.c	66.66%	3 Missing ⚠️
src/broker/broker.c	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6661      +/-   ##
==========================================
- Coverage   83.89%   83.89%   -0.01%     
==========================================
  Files         533      533              
  Lines       88536    88592      +56     
==========================================
+ Hits        74281    74324      +43     
- Misses      14255    14268      +13

Files with missing lines	Coverage Δ
src/broker/runat.c	`80.60% <100.00%> (+1.00%)`	⬆️
src/cmd/builtin/dump.c	`88.17% <100.00%> (+2.82%)`	⬆️
src/cmd/builtin/restore.c	`90.17% <100.00%> (+2.67%)`	⬆️
src/modules/sdexec/sdexec.c	`71.17% <100.00%> (+0.30%)`	⬆️
src/broker/broker.c	`77.35% <66.66%> (ø)`
src/common/libsubprocess/server.c	`79.36% <66.66%> (-1.07%)`	⬇️
src/broker/state_machine.c	`81.06% <62.50%> (-0.96%)`	⬇️

... and 4 files with indirect coverage changes

Problem: the subprocess server ignored failure of cmd_set_env(). This was introduced by flux-framework#6661 which was just merged. Oops! Fix error path.

Problem: the subprocess server ignores failure of cmd_set_env(). This was introduced by flux-framework#6661 which was just merged. Oops! Fix error path.

github-advanced-security bot found potential problems Feb 24, 2025

View reviewed changes

src/broker/state_machine.c Dismissed Show dismissed Hide dismissed

grondo approved these changes Feb 25, 2025

View reviewed changes

garlick added 17 commits February 25, 2025 06:48

deb: require libsystemd-dev

6634d85

Problem: the debian control file does not require libsystemd-dev and if it's not installed, configuration fails. Add it as a requirement in the debian control file.

broker: fix conditional sd_notify()

05e42db

Problem: the broker unconditionally calls sd_notify(3) READY=1 but all other notifications are conditional on a broker attribute. Make that one conditional too.

flux-dump: use state-machine.sd-notify

0b0cf78

Problem: flux-dump calls sd_notify() directly but this is coordinated through the broker now. Use the broker RPC instead.

flux-restore: use state-machine.sd-notify

39dd8e2

Problem: flux-restore calls sd_notify() directly but this is coordinated through the broker now. Use the broker RPC instead.

flux-restore: avoid duplicate notifications

f797ca5

Problem: some flux-restore progress reports are duplicated. Add a check to suppress duplicate progress reports.

flux-dump: avoid duplicate notifications

1892c40

Problem: some flux-dump progress reports are duplicated. Add a check to suppress duplicate progress reports.

flux-broker-attributes(7): add missing attrs

2d1a7f6

Problem: the broker.sd-notify and broker.sd-stop-timeout attributes are undocumented. Add them to the man page.

systemd: set NotifyAccess=main

7b98c0e

Problem: the broker's systemd unit file allows any process to use sd_notify(3) to update its state. Now that all updates are coordinated through the broker, this is no longer needed. Set NotifyAccess=main (main process only).

broker: add NOTIFY_SOCKET to runat blocklist

a03a7c1

Problem: NOTIFY_SOCKET is passed on to some broker subprocesses, but they are no longer permitted to call sd_notify(3). Add NOTIFY_SOCKET to the runat blocklist to prevent it from being passed on to the rc scripts and initial program.

libsubprocess: make server env handling readable

982d7ac

Problem: subprocess server environment manipulation is unreadable as one conditional block. Spread the code out so it is easier to follow. Test for cmd_set_env() < 0 to indicate failure rather than "true". Add some comments.

libsubprocess: don't leak NOTIFY_SOCKET into env

dd1efac

Problem: NOTIFY_SOCKET might leak into the environment of a remote process from the subprocess server environment or the command environment. Unset that variable in the subprocess server.

sdexec: don't leak NOTIFY_SOCKET into env

7a8b91d

Problem: NOTIFY_SOCKET might leak into the environment of a remote process from the subprocess server environment or the command environment. Unset that variable in the sdexec subprocess server.

garlick force-pushed the stop_timeout branch from 7a7a533 to 7a8b91d Compare February 25, 2025 15:54

garlick added the merge-when-passing label Feb 25, 2025

mergify bot merged commit 4b0a520 into flux-framework:master Feb 25, 2025
35 checks passed

garlick deleted the stop_timeout branch February 25, 2025 17:17

garlick added a commit to garlick/flux-core that referenced this pull request Feb 25, 2025

libsubprocess: fix error path bug

1254d78

Problem: the subprocess server ignored failure of cmd_set_env(). This was introduced by flux-framework#6661 which was just merged. Oops! Fix error path.

garlick added a commit to garlick/flux-core that referenced this pull request Feb 25, 2025

libsubprocess: fix error path bug

ddbb39b

Problem: the subprocess server ignores failure of cmd_set_env(). This was introduced by flux-framework#6661 which was just merged. Oops! Fix error path.

garlick mentioned this pull request Feb 25, 2025

libsubprocess: fix error path bug #6663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve management of systemd stop timeout to avoid SIGKILL while making progress #6661

improve management of systemd stop timeout to avoid SIGKILL while making progress #6661

garlick commented Feb 24, 2025

grondo left a comment

grondo Feb 25, 2025

garlick commented Feb 25, 2025

grondo commented Feb 25, 2025

codecov bot commented Feb 25, 2025

improve management of systemd stop timeout to avoid SIGKILL while making progress #6661

improve management of systemd stop timeout to avoid SIGKILL while making progress #6661

Conversation

garlick commented Feb 24, 2025

grondo left a comment

Choose a reason for hiding this comment

grondo Feb 25, 2025

Choose a reason for hiding this comment

garlick commented Feb 25, 2025

grondo commented Feb 25, 2025

codecov bot commented Feb 25, 2025

Codecov Report