Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve management of systemd stop timeout to avoid SIGKILL while making progress #6661

Merged
merged 17 commits into from
Feb 25, 2025

Conversation

garlick
Copy link
Member

@garlick garlick commented Feb 24, 2025

Problem: systemd imposes a 90s fixed timeout on shutdown when flux is stopped with systemctl stop flux, but that is not always enough time to cleanly stop flux on a big system. Data loss likely results if the timeout runs out and the broker is killed.

This PR causes the stop timeout to be reset each time the broker transitions into one of the CLEANUP, SHUTDOWN, FINALIZE, or GOODBYE states. It also coordinates all stop time extension through the broker state machine rather than having sd_notify() calls in several locations.

As a result, each of those states can take up to 90s - actually more for CLEANUP and FINALIZE (rc3) where the timeout is reset as progress is made within the state.

This is just a first step toward making systemctl stop flux more robust. In the future we'll need to add some more state timeouts (SHUTDOWN in particular).

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Looks like a nice improvement. One take it or leave it comment inline.

Comment on lines 146 to 148
The systemd TimeoutStopSec value (in RFC 23 Flux Standard Duration format),
which is used by the broker to extend the stop timeout while it is making
progress towards shutdown. This is set in the Flux systemd unit file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean broker.sd-stop-timeout needs to be the same as TimeoutStopSec? If so, it might be clearer to just explicitly state that in this documentation. Maybe just state what the broker does with the value (which I can't remember even though I just reviewed the code 🤦 ), then state "This value should be set equal to the TimeoutStopSec set for the flux.service unit." or similar.

Problem: the debian control file does not require libsystemd-dev
and if it's not installed, configuration fails.

Add it as a requirement in the debian control file.
Problem: the broker unconditionally calls sd_notify(3) READY=1
but all other notifications are conditional on a broker attribute.

Make that one conditional too.
Problem: if a system instance is shut down with 'systemctl stop flux',
as opposed to flux-shutdown(1), the whole stop process is under a 90s
timeout ending in SIGKILL, which sometimes is not enough time to stop a
large system instance cleanly.

To mitigate this, CLEANUP state has a 45s timeout, and in FINALIZE,
flux-dump(1) periodically extends the systemd stop timeout by 10s.
However, if CLEANUP uses up the entire 45s, only 45s is left to split
among the remaining states, more or less.

Experience has shown that 'systemctl stop flux' will be used, so we need
to improve management of the systemd stop timeout, and perhaps even let
it apply to flux-shutdown(1), so the behavior at shutdown is not dependent
on the method used.

As an initial step towards this goal, add a function that effectively
resets the systemd stop timer by calling sd_notify() EXTEND_TIMEOUT_USEC
with a value equal to TimeoutStopSec, which is passed in from the unit
file via broker attribute.  Call this function on entry to each of the
CLEANUP, SHUTDOWN, FINALIZE, and GOODBYE states.  The state transitions
are indicative of progress, and we treat the systemd stop timeout as a
timeout on progress rather than on the entire stop sequence.  This
increases the maximum time allowed to flux for stopping, but ensures the
broker won't be killed while making progress.
Problem: runat.c calls sd_notify() directly but state_machine.c is
now coordinating systemd notifications.

Add state_machine_sd_notify() and register that as a callback in
runat.c instead of directly calling sd_notify().

As a side effect, each time runat starts a new process, the stop
timer is reset (if stopping).
Problem: flux-dump(1) calls sd_notify(3) to update status and
extend the systemd stop timeout, but the stop timeout management
should be coordinated with the broker.

Add a state-machine.sd-notify RPC.
It updates status and, if stopping, extends the timeout.
Problem: flux-dump calls sd_notify() directly but this is coordinated
through the broker now.

Use the broker RPC instead.
Problem: flux-restore calls sd_notify() directly but this is coordinated
through the broker now.

Use the broker RPC instead.
Problem: flux-dump activates systemd notifications when the
broker.sd-notify attribute is set, but that might result in
unintended notifications outside of rc3.

Activate systemd notifications via a new --sd-notify option.
Use this option in rc3.
Update flux-dump(1) man page.
Problem: flux-restore activates systemd notifications when the
broker.sd-notify attribute is set, but that might result in
unintended notifications outside of rc1.

Activate systemd notifications via a new --sd-notify option.
Use this option in rc1.
Update flux-restore(1) man page.
Problem: some flux-restore progress reports are duplicated.

Add a check to suppress duplicate progress reports.
Problem: some flux-dump progress reports are duplicated.

Add a check to suppress duplicate progress reports.
Problem: the broker.sd-notify and broker.sd-stop-timeout attributes
are undocumented.

Add them to the man page.
Problem: the broker's systemd unit file allows any process to
use sd_notify(3) to update its state.

Now that all updates are coordinated through the broker, this
is no longer needed.  Set NotifyAccess=main (main process only).
Problem: NOTIFY_SOCKET is passed on to some broker subprocesses,
but they are no longer permitted to call sd_notify(3).

Add NOTIFY_SOCKET to the runat blocklist to prevent it from
being passed on to the rc scripts and initial program.
Problem: subprocess server environment manipulation is unreadable
as one conditional block.

Spread the code out so it is easier to follow.
Test for cmd_set_env() < 0 to indicate failure rather than "true".
Add some comments.
Problem: NOTIFY_SOCKET might leak into the environment of a remote
process from the subprocess server environment or the command
environment.

Unset that variable in the subprocess server.
Problem: NOTIFY_SOCKET might leak into the environment of a remote
process from the subprocess server environment or the command
environment.

Unset that variable in the sdexec subprocess server.
@garlick
Copy link
Member Author

garlick commented Feb 25, 2025

Thanks, made the suggested changes to the man page.

Also, while experimenting with

I wasted a couple hours (ok I was being dense) tracking down the fact that NOTIFY_SOCKET had leaked from the broker's environment to IMP. So I tacked on some commits to prevent inheritance of this environment variable from the broker and (because we can now) restrict notifications to the main process. LMK if this still looks good @grondo.

@grondo
Copy link
Contributor

grondo commented Feb 25, 2025

Yeah, looks good!

@mergify mergify bot merged commit 4b0a520 into flux-framework:master Feb 25, 2025
35 checks passed
Copy link

codecov bot commented Feb 25, 2025

Codecov Report

Attention: Patch coverage is 84.76190% with 16 lines in your changes missing coverage. Please review.

Project coverage is 83.89%. Comparing base (9762f63) to head (7a8b91d).
Report is 18 commits behind head on master.

Files with missing lines Patch % Lines
src/broker/state_machine.c 62.50% 12 Missing ⚠️
src/common/libsubprocess/server.c 66.66% 3 Missing ⚠️
src/broker/broker.c 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6661      +/-   ##
==========================================
- Coverage   83.89%   83.89%   -0.01%     
==========================================
  Files         533      533              
  Lines       88536    88592      +56     
==========================================
+ Hits        74281    74324      +43     
- Misses      14255    14268      +13     
Files with missing lines Coverage Δ
src/broker/runat.c 80.60% <100.00%> (+1.00%) ⬆️
src/cmd/builtin/dump.c 88.17% <100.00%> (+2.82%) ⬆️
src/cmd/builtin/restore.c 90.17% <100.00%> (+2.67%) ⬆️
src/modules/sdexec/sdexec.c 71.17% <100.00%> (+0.30%) ⬆️
src/broker/broker.c 77.35% <66.66%> (ø)
src/common/libsubprocess/server.c 79.36% <66.66%> (-1.07%) ⬇️
src/broker/state_machine.c 81.06% <62.50%> (-0.96%) ⬇️

... and 4 files with indirect coverage changes

@garlick garlick deleted the stop_timeout branch February 25, 2025 17:17
garlick added a commit to garlick/flux-core that referenced this pull request Feb 25, 2025
Problem: the subprocess server ignored failure of cmd_set_env().

This was introduced by flux-framework#6661 which was just merged.  Oops!

Fix error path.
garlick added a commit to garlick/flux-core that referenced this pull request Feb 25, 2025
Problem: the subprocess server ignores failure of cmd_set_env().

This was introduced by flux-framework#6661 which was just merged.  Oops!

Fix error path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants