Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run administrative epilog even if job is canceled before starting #6055

Closed
jameshcorbett opened this issue Jun 25, 2024 · 3 comments · Fixed by #6249
Closed

Run administrative epilog even if job is canceled before starting #6055

jameshcorbett opened this issue Jun 25, 2024 · 3 comments · Fixed by #6249

Comments

@jameshcorbett
Copy link
Member

If the prolog action described in flux-framework/flux-coral2#166 goes into production, it will make changes the compute nodes which must be undone by a matching epilog action. However, if the job is canceled or fails before the application begins to execute, the epilog action doesn't run. This leaves the potential for the node to be left in a bad state where the changes made by the prolog are never undone by the epilog.

@grondo
Copy link
Contributor

grondo commented Jun 25, 2024

As noted in this comment in plugins/perilog.c, the epilog is only executed on a finish event:

* - The epilog is started as a result of a "finish" event,
* and therefore the job manager epilog is only run if
* job shells are actually started.

It does seem like this is an oversight, if the prolog runs, even partially, there may be some things in an epilog that should run to undo actions in the prolog. I'm not certain, though, if there's a clean way to ensure an epilog-start event is emitted in time when the job transitions to CLEANUP via exception before the start event. 🤔.

Of course, as mentioned offline, if housekeeping support is merged (#5818), I think the housekeeping script will be executed any time resources are released, so that will be a more guaranteed way to do this kind of thing.

@jameshcorbett
Copy link
Member Author

This became an issue on rzadams today. A job was canceled while the administrative prolog was running, but after rabbit file systems had mounted and the nnf-clientmount daemon had stopped. The job was never given a finish event so the job-manager.epilog (which turns back on nnf-clientmount) never started. However, the dws-epilog action started, holding the job until the file systems were cleaned up. At that point the job-manager.epilog was needed to turn the nnf-clientmount daemon back on to unmount the file systems and release the dws-epilog action.

1724871544.451634 submit userid=54987 urgency=16 flags=0 version=1
1724871544.475686 validate
1724871544.489844 dependency-add description="dws-create"
1724871544.502098 memo rabbit_workflow="fluxjob-295403862288761856"
1724871546.027610 dependency-remove description="dws-create"
1724871546.027660 depend
1724871546.027710 priority priority=16
1724871546.044127 alloc annotations={"user":{"rabbit_workflow":"fluxjob-295403862288761856"}}
1724871546.044192 prolog-start description="cray-pals-port-distributor"
1724871546.044197 prolog-start description="dws-setup"
1724871546.044476 prolog-start description="job-manager.prolog"
1724871546.044630 cray_port_distribution ports=[11943,11942] random_integer=216365323720312173
1724871546.044669 prolog-finish description="cray-pals-port-distributor" status=0
1724871547.027156 memo rabbits="rzadams207"
1724871561.024485 dws_environment variables={"DW_JOB_scrcache":"/mnt/nnf/84af6c15-7f96-4c41-a537-8ea248cbd5ed-0","DW_WORKFLOW_NAME":"fluxjob-295403862288761856","DW_WORKFLOW_NAMESPACE":"default"} rabbits={"rzadams207":"rzadams[1097-1112]"} copy_offload=true
1724871561.024619 prolog-finish description="dws-setup" status=0
1724873710.261341 exception type="cancel" severity=0 note="interrupted by ctrl-C" userid=54987
1724873710.261561 epilog-start description="dws-epilog"
1724873710.313850 exception type="prolog" severity=0 note="prolog killed by signal 15 (timeout or job canceled)" userid=764
1724873710.313942 prolog-finish description="job-manager.prolog" status=36608
1724873760.862896 exception type="cancel" severity=0 note="interrupted by ctrl-C" userid=54987
1724873760.874724 exception type="cancel" severity=0 note="interrupted by ctrl-C" userid=54987
1724873760.908778 exception type="cancel" severity=0 note="interrupted by ctrl-C" userid=54987

@grondo
Copy link
Contributor

grondo commented Aug 29, 2024

Ok, we should try to get this fixed in the next release. The idea would be the epilog would be triggered by prolog-finish if the prolog was canceled.

@grondo grondo added this to the flux-core-0.66.0 milestone Aug 29, 2024
@mergify mergify bot closed this as completed in #6249 Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants