-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run administrative epilog even if job is canceled before starting #6055
Comments
As noted in this comment in flux-core/src/modules/job-manager/plugins/perilog.c Lines 24 to 26 in 3e6103f
It does seem like this is an oversight, if the prolog runs, even partially, there may be some things in an epilog that should run to undo actions in the prolog. I'm not certain, though, if there's a clean way to ensure an Of course, as mentioned offline, if housekeeping support is merged (#5818), I think the housekeeping script will be executed any time resources are released, so that will be a more guaranteed way to do this kind of thing. |
This became an issue on rzadams today. A job was canceled while the administrative prolog was running, but after rabbit file systems had mounted and the
|
Ok, we should try to get this fixed in the next release. The idea would be the epilog would be triggered by |
If the prolog action described in flux-framework/flux-coral2#166 goes into production, it will make changes the compute nodes which must be undone by a matching epilog action. However, if the job is canceled or fails before the application begins to execute, the epilog action doesn't run. This leaves the potential for the node to be left in a bad state where the changes made by the prolog are never undone by the epilog.
The text was updated successfully, but these errors were encountered: