-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiuser job with unkillable tasks causes job to hang indefinitely in RUN state #6656
Comments
Part of the question here is whether a job should be expected to post a This does run counter some of the statements in documentation, such as from flux-alloc(1):
In the case here, the script passed to This is also a difference in behavior between multiuser jobs, in which the This leads me to prefer a solution that has the IMP either exit after some attempts to SIGKILL all processes in the cgroup, and defer waiting on the cgroup in housekeeping or some other place that allows partial release, or somehow notify sdexec that the "main" user tasks have exited and have it notify the job exec module. Either way, the transient unit should probably stick around while there are active user processes, perhaps including the IMP, though that doesn't seem necessary since housekeeping also runs with privilege. I'm not sure if this is a feasible approach. |
Agreed, IMHO it gives us more to work with if the unit remains running in this case. Although I think if we kill the IMP the unit currently transitions to inactive.dead even with the cgroup non-empty. I should confirm that though. So If I understand, the job with the hung processes and the alloc instance that ran it actually terminates, so in the unit, where we once had
now we have
|
A quick thought is that maybe the IMP, as the main process of the unit, could inform systemd when the shell (or any process it directly spawns through run) exits by calling https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html That might be something that could be tracked in sdexec and perhaps start a timer that eventually lets it give up on a still-running unit. |
Yes! That is a great idea. I was hoping there would be support for something like that. |
Yes, though
where task1 and 2 may have even been from different jobs. Edit: and I did verify by stack trace that the IMP is in |
I'll see if I can do an experiment:
Actually, for learning purposes, it might be a good idea to try setting STOPPING=1 in the broker when it starts shutting down and observe the effect. I think right now systemd status doesn't show when the broker is shutting down unless |
OK I did the latter quick test and verified that sending STOPPING=1 transitions the unit state to "deactivating".
That state transition is definitely available to sdexec. |
OK, I tried the following
Summary of results
Here are some annotated logs # job is starting
[ +36.631023] sdexec.debug[0]: watch imp-shell-0-fBT5yq9rqqy.service
[ +36.631075] sdexec.debug[0]: start imp-shell-0-fBT5yq9rqqy.service
[ +36.634718] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: unknown.unknown
[ +36.634795] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: inactive.dead
[ +36.635232] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: activating.start
# IMP notifies it started flux-shell
[ +36.654010] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: active.running
# shell is killed manually
[Feb25 08:35] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: deactivating.unknown
# unit fails after 3m delay
[Feb25 08:38] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: failed.failed
# sdexec notes the failure and "reaps" the transient unit
[ +0.000019] sdexec.debug[0]: reset-failed imp-shell-0-fBT5yq9rqqy.service
[ +0.001584] sdexec.debug[0]: imp-shell-0-fBT5yq9rqqy.service: inactive.dead
# allocation expires after 10m limit reached
[Feb25 08:44] job-exec.info[0]: job-exception: id=ƒBT5yq9rqqy: resource allocation expired
# bunch o kills directed at nonexistent unit, probably ignored by systemd
[ +0.000566] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 14)
[Feb25 08:45] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +5.003356] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +10.001343] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +15.003699] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +20.003052] job-exec.debug[0]: Sending SIGKILL to job ƒBT5yq9rqqy
[ +20.003094] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +20.003294] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[ +30.010557] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +30.010704] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[ +50.027882] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +50.028034] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:46] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +0.000141] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:47] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +0.000190] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:50] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +0.000160] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
[Feb25 08:55] job-exec.debug[0]: Sending SIGUSR1 to IMP for job ƒBT5yq9rqqy
[ +0.000217] sdexec.debug[0]: kill main imp-shell-0-fBT5yq9rqqy.service (signal 10)
# manual kill of sleep process - presumably closes stdout/stderr
[Feb25 08:56] sdexec.debug[0]: unwatch imp-shell-0-fBT5yq9rqqy.service
# the job finally ends
[Feb25 08:57] job-manager.debug[0]: housekeeping: ƒBT5yq9rqqy started
[ +0.423997] job-list.debug[0]: purged 1 inactive jobs
[ +0.425075] job-manager.debug[0]: purged 1 inactive jobs
[ +30.057748] job-manager.debug[0]: housekeeping: ƒBT5yq9rqqy complete |
Great! Just some (probably dumb) questions:
Do we know where the 3m timeout is coming from? (Also I assume you mean the unit transitions to failed state?)
If the unit is in 'inactive' state, is the cgroup still present with the IMP and sleep contained? Will your
Sorry, it should probably be obvious, but what's keeping the job in C state at this point? |
The 3m timeout must be coming from systemd, triggered by the transition to deactivating state. It happens to be TimeoutStopSec * 2 so maybe there is an initial SIGTERM delivery followed by SIGKILL, although I don't see evidence of the SIGTERM. I believe the cgroup is still present but I think it only contains the sleep after the unit transitions to inactive. I think the only reason sdexec doesn't close out the subprocess and get the job out of C state is that the sleep has inherited a copy of the stdio file descriptors. As soon as I killed the sleep, sdexec stopped monitoring the unit ("unwatch") and terminated the sdexec.exec RPC, which let the job transition to inactive. So it seems like there are two things to fix:
|
Oh, wait a minute, why would the sleep have a copy of the stdio file descriptors? Its stdio should have been connected to the shell via socketpairs. Plus I nohupped it. I will add some debug to help understand why sdexec thinks it is not done as long as the sleep proocess is running. |
Ah, I also missed that the job transitioned to CLEANUP only after it reached its time limit and a timeout exception was raised. However, this is all really good info and seems promising, thanks for spending the time to look into it! |
Confirmed: its waiting for stdout. Why, I do not know. Anyway, that won't be a problem if we implement the logic proposed above (on deactivating, send some signals then give up and terminate the sdexec RPC) |
Hm, I wonder if we're leaking the shell's stdio fds to the tasks somehow. |
Problem: an sdexec imp-shell unit can run into the following problem: - flux-shell is killed/terminates - there are unkillable children of flux-shell - the IMP won't exit until the cgroup is empty - the job appears to be running until the IMP exits with the shell exit code This deploys some new techniques to manage such processes and ensure that the job can complete and release resources and orphaned processes retain a systemd unit as a handle for monitoring and management. Specifically: - Set the KillMode=process so only the IMP is signaled - Use Type=notify in conjunction with IMP calling sd_notify(3) so the unit transitions to deactivating when the shell exits. - Set TimeoutStopUsec=infinity to disable systemd's stop timeout. - Enable sdexec's stop timeout which is armed at deactivating, delivers SIGUSR1 (proxy for SIGKILL) after 30s, then abandons the unit and terminates the exec RPC after another 30s. Behavior is configurable via the [exec] table. Fixes flux-framework#6656
Problem: jobs remain in R state when the flux-shell exits with unkillable processes. Run imp-shell units with Type=notify and the new sdexec stop timer. Disable systemd's stop timer by setting TimeoutStopUsec=infinity. This assumes the IMP has been modified to call sd_notify(3) at appropriate transitions. The stop timer, which is enabled by default with a timeout of 30s and signal of SIGUSR1, may be configured or disabled via the TOML [exec] table. Fixes flux-framework#6656
This is actually an combination flux-core/flux-security issue.
In flux-framework/flux-security#187, the IMP was modified to wait until the job cgroup is empty before it exits. This is supposed to ensure that all job tasks have exited before starting the epilog, entering CLEANUP, etc.
However, there is one big drawback to this approach. When there is an unkillable process and IMP lingers waiting for the cgroup to be empty, the job execution system thinks the job is still running, since it is waiting for the IMP to exit. This means that there is no indication that there is a problem with job cleanup until the job hits its timeout.
Ideally, there would be some way to tell the job execution system that the main process in the job has exited and it is in the process of trying to exit, so that users and admins are aware that there may be a problem.
The text was updated successfully, but these errors were encountered: