Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiuser job with unkillable tasks causes job to hang indefinitely in RUN state #6656

Closed
grondo opened this issue Feb 21, 2025 · 14 comments
Closed

Comments

@grondo
Copy link
Contributor

grondo commented Feb 21, 2025

This is actually an combination flux-core/flux-security issue.

In flux-framework/flux-security#187, the IMP was modified to wait until the job cgroup is empty before it exits. This is supposed to ensure that all job tasks have exited before starting the epilog, entering CLEANUP, etc.

However, there is one big drawback to this approach. When there is an unkillable process and IMP lingers waiting for the cgroup to be empty, the job execution system thinks the job is still running, since it is waiting for the IMP to exit. This means that there is no indication that there is a problem with job cleanup until the job hits its timeout.

Ideally, there would be some way to tell the job execution system that the main process in the job has exited and it is in the process of trying to exit, so that users and admins are aware that there may be a problem.

@grondo
Copy link
Contributor Author

grondo commented Feb 21, 2025

Part of the question here is whether a job should be expected to post a finish event once all user tasks have exited, or should that be delayed until all processes created by the job have gone away. With the current IMP configuration, Flux is doing the second thing.

This does run counter some of the statements in documentation, such as from flux-alloc(1):

When COMMAND exits, the Flux subinstance exits, resources are released to the enclosing Flux instance, and flux alloc returns.

In the case here, the script passed to flux alloc terminated, but the job continues in the RUN state until timeout because of some unkillable processes that were actually not even direct descendants of the initial program (they are leftover processes invoked by flux run which eventually timed out). I can sympathize with the user that questions why their job did not terminate when COMMAND did.

This is also a difference in behavior between multiuser jobs, in which the finish event is delayed until the cgroup is empty, and Flux jobs in all other situations, where the finish event is posted as soon as the job shell exits, which occurs when all tasks have exited.

This leads me to prefer a solution that has the IMP either exit after some attempts to SIGKILL all processes in the cgroup, and defer waiting on the cgroup in housekeeping or some other place that allows partial release, or somehow notify sdexec that the "main" user tasks have exited and have it notify the job exec module.

Either way, the transient unit should probably stick around while there are active user processes, perhaps including the IMP, though that doesn't seem necessary since housekeeping also runs with privilege. I'm not sure if this is a feasible approach.

@garlick
Copy link
Member

garlick commented Feb 22, 2025

Either way, the transient unit should probably stick around while there are active user processes, perhaps including the IMP, though that doesn't seem necessary since housekeeping also runs with privilege. I'm not sure if this is a feasible approach.

Agreed, IMHO it gives us more to work with if the unit remains running in this case. Although I think if we kill the IMP the unit currently transitions to inactive.dead even with the cgroup non-empty. I should confirm that though.

So If I understand, the job with the hung processes and the alloc instance that ran it actually terminates, so in the unit, where we once had

imp->shell->broker->shell->task(s)

now we have

imp->task(s)

@garlick
Copy link
Member

garlick commented Feb 22, 2025

A quick thought is that maybe the IMP, as the main process of the unit, could inform systemd when the shell (or any process it directly spawns through run) exits by calling sd_notify() and setting the status to STOPPING.

https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html

That might be something that could be tracked in sdexec and perhaps start a timer that eventually lets it give up on a still-running unit.

@grondo
Copy link
Contributor Author

grondo commented Feb 22, 2025

Yes! That is a great idea. I was hoping there would be support for something like that.

@grondo
Copy link
Contributor Author

grondo commented Feb 22, 2025

So If I understand, the job with the hung processes and the alloc instance that ran it actually terminates, so in the unit, where we once had

imp->shell->broker->shell->task(s)

now we have

imp->task(s)

Yes, though task(s) are not a direct child of the IMP, it is more like:

imp
task1
task2

where task1 and 2 may have even been from different jobs.

Edit: and I did verify by stack trace that the IMP is in cgroup_wait_for_empty()

@garlick
Copy link
Member

garlick commented Feb 22, 2025

I'll see if I can do an experiment:

  • change the transient unit for jobs from Type=exec to Type=notify
  • add sd_notify() calls to the right places in the imp for READY=1 and STOPPING=1
  • run a job like flux run nohup sleep inf or similar (tasks detached from shell)
  • kill the shell and watch for unit state change

Actually, for learning purposes, it might be a good idea to try setting STOPPING=1 in the broker when it starts shutting down and observe the effect. I think right now systemd status doesn't show when the broker is shutting down unless systemctl stop is used.

@garlick
Copy link
Member

garlick commented Feb 22, 2025

OK I did the latter quick test and verified that sending STOPPING=1 transitions the unit state to "deactivating".
This is the status line from systemctl status flux after running flux shutdown with the above changes.

Active: deactivating (stop-sigterm) since Sat 2025-02-22 10:31:10 PST; 2s...

That state transition is definitely available to sdexec.

@garlick
Copy link
Member

garlick commented Feb 25, 2025

OK, I tried the following

  • modify flux to run the job as a Type=notify unit
  • modify the IMP to notify READY=1 after launching the shell and STOPPING=1 when the shell terminates
  • do a flux run nohup sleep inf
  • manually kill -9 the flux-shell

Summary of results

  • the unit transitions to deactivating state as planned. Job is in R state and sleep is running.
  • after exactly 3m, the job transitions to failed state. Job is in R state and sleep is running.
  • sdexec resets the failed unit which transitions it to inactive state. Job is in R state and sleep is running
  • allocation times out, job is in C state and sleep is running
  • after manual kill of sleep, job becomes inactive

Here are some annotated logs

# job is starting
[ +36.631023] sdexec.debug[0]: watch imp-shell-0-fBT5yq9rqqy.service
[ +36.631075] sdexec.debug[0]: start imp-shell-0-fBT5yq9rqqy.service
[ +36.634718] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: unknown.unknown
[ +36.634795] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: inactive.dead
[ +36.635232] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: activating.start
# IMP notifies it started flux-shell
[ +36.654010] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: active.running
# shell is killed manually
[Feb25 08:35] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: deactivating.unknown
# unit fails after 3m delay
[Feb25 08:38] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: failed.failed
# sdexec notes the failure and "reaps" the transient unit
[  +0.000019] sdexec.debug[0]: reset-failed imp-shell-0-fBT5yq9rqqy.service
[  +0.001584] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: inactive.dead
# allocation expires after 10m limit reached
[Feb25 08:44] job-exec.info[0]: job-exception: id=ƒBT5yq9rqqy: resource allocation expired
# bunch o kills directed at nonexistent unit, probably ignored by systemd
[  +0.000566] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 14)
[Feb25 08:45] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[  +5.003356] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +10.001343] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +15.003699] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +20.003052] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +20.003094] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +20.003294] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[ +30.010557] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +30.010704] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[ +50.027882] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +50.028034] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:46] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[  +0.000141] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:47] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[  +0.000190] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:50] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[  +0.000160] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:55] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[  +0.000217] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
# manual kill of sleep process - presumably closes stdout/stderr
[Feb25 08:56] sdexec.debug[0]: unwatch imp-shell-0-fBT5yq9rqqy.service
# the job finally ends
[Feb25 08:57] job-manager.debug[0]: housekeeping: ƒBT5yq9rqqy started
[  +0.423997] job-list.debug[0]: purged 1 inactive jobs
[  +0.425075] job-manager.debug[0]: purged 1 inactive jobs
[ +30.057748] job-manager.debug[0]: housekeeping: ƒBT5yq9rqqy complete

@grondo
Copy link
Contributor Author

grondo commented Feb 25, 2025

Great!

Just some (probably dumb) questions:

after exactly 3m, the job transitions to failed state. Job is in R state and sleep is running.

Do we know where the 3m timeout is coming from? (Also I assume you mean the unit transitions to failed state?)

sdexec resets the failed unit which transitions it to inactive state. Job is in R state and sleep is running

If the unit is in 'inactive' state, is the cgroup still present with the IMP and sleep contained? Will your sdmon in #6616 still find inactive transient units after a restart so that nodes with unkillable processes do not join a restarting instance?

allocation times out, job is in C state and sleep is running
after manual kill of sleep, job becomes inactive

Sorry, it should probably be obvious, but what's keeping the job in C state at this point?

@garlick
Copy link
Member

garlick commented Feb 25, 2025

The 3m timeout must be coming from systemd, triggered by the transition to deactivating state. It happens to be TimeoutStopSec * 2 so maybe there is an initial SIGTERM delivery followed by SIGKILL, although I don't see evidence of the SIGTERM.

I believe the cgroup is still present but I think it only contains the sleep after the unit transitions to inactive. I think the only reason sdexec doesn't close out the subprocess and get the job out of C state is that the sleep has inherited a copy of the stdio file descriptors. As soon as I killed the sleep, sdexec stopped monitoring the unit ("unwatch") and terminated the sdexec.exec RPC, which let the job transition to inactive.

So it seems like there are two things to fix:

  • disable the systemd stop timeout so the IMP is not killed by systemd. Ideally as long as there are processes in the cgroup, the IMP should stay around and keep the unit in deactivating state. The IMP only entity with permission to signal those processes running as the user, and the unit is the only way sdmon has to detect orphan processes.
  • sdexec isn't doing anything with the transition to deactivating state. It should start a timer and deliver signals to the IMP starting with SIGTERM, then SIGUSR1 (the SIGKILL proxy), then it should give up and fail the subprocess.

@garlick
Copy link
Member

garlick commented Feb 25, 2025

Oh, wait a minute, why would the sleep have a copy of the stdio file descriptors? Its stdio should have been connected to the shell via socketpairs. Plus I nohupped it. I will add some debug to help understand why sdexec thinks it is not done as long as the sleep proocess is running.

@grondo
Copy link
Contributor Author

grondo commented Feb 25, 2025

Ah, I also missed that the job transitioned to CLEANUP only after it reached its time limit and a timeout exception was raised.

However, this is all really good info and seems promising, thanks for spending the time to look into it!

@garlick
Copy link
Member

garlick commented Feb 25, 2025

Confirmed: its waiting for stdout. Why, I do not know. Anyway, that won't be a problem if we implement the logic proposed above (on deactivating, send some signals then give up and terminate the sdexec RPC)

@grondo
Copy link
Contributor Author

grondo commented Feb 25, 2025

Hm, I wonder if we're leaking the shell's stdio fds to the tasks somehow.

garlick added a commit to garlick/flux-core that referenced this issue Feb 26, 2025
Problem: an sdexec imp-shell unit can run into the following problem:
- flux-shell is killed/terminates
- there are unkillable children of flux-shell
- the IMP won't exit until the cgroup is empty
- the job appears to be running until the IMP exits with the shell exit code

This deploys some new techniques to manage such processes and ensure
that the job can complete and release resources and orphaned processes
retain a systemd unit as a handle for monitoring and management.

Specifically:

- Set the KillMode=process so only the IMP is signaled
- Use Type=notify in conjunction with IMP calling sd_notify(3) so
  the unit transitions to deactivating when the shell exits.
- Set TimeoutStopUsec=infinity to disable systemd's stop timeout.
- Enable sdexec's stop timeout which is armed at deactivating,
  delivers SIGUSR1 (proxy for SIGKILL) after 30s, then abandons
  the unit and terminates the exec RPC after another 30s.

Behavior is configurable via the [exec] table.

Fixes flux-framework#6656
garlick added a commit to garlick/flux-core that referenced this issue Feb 27, 2025
Problem: jobs remain in R state when the flux-shell exits with
unkillable processes.

Run imp-shell units with Type=notify and the new sdexec stop timer.
Disable systemd's stop timer by setting TimeoutStopUsec=infinity.
This assumes the IMP has been modified to call sd_notify(3) at
appropriate transitions.

The stop timer, which is enabled by default with a timeout of 30s
and signal of SIGUSR1, may be configured or disabled via the TOML
[exec] table.

Fixes flux-framework#6656
@mergify mergify bot closed this as completed in 058387d Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants