Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run epilog even if job prolog fails or is canceled #6249

Merged
merged 5 commits into from
Sep 3, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Aug 31, 2024

This PR addresses #6055 by triggering the job epilog when the prolog completes if it fails, e.g. due to timeout or if it is canceled due to a job exception.

Fixes #6055

Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

Problem: When a perilog process completes, the same two calls are
always made, causing duplicated code.

Combine the two calls into a new perilog_proc_finish() call.
Update callers.
Problem: In order to trigger an epilog as a prolog is finishing,
some extra data will need to be carried in the perilog proc
structure.

Add userid and R to struct perilog_proc for this purpose.
Problem: When the prolog fails for any reason, including a timeout,
cancelation, or a failure on one or more nodes, the associated job
is not started and therefore any configured epilog is not started.
This is problematic if the prolog did something that is assumed to
be undone by the epilog.

Start any configured epilog immediately when a prolog fails.
The epilog-start event is posted before the prolog-finish in order
to prevent the job from reaching INACTIVE state immediately.
Problem: No tests ensure that a job epilog runs even after a prolog
fails.

Add a test to t2274-manager-perilog-per-rank.t.
Problem: The description of when the job epilog is run is not correct
in flux-config-job-manager(5).

Amend the description to include that the epilog runs after prolog
failure as well as on the finish event.
@grondo
Copy link
Contributor Author

grondo commented Sep 3, 2024

Setting MWP here, Thanks!

@mergify mergify bot merged commit 7e179c1 into flux-framework:master Sep 3, 2024
32 checks passed
@grondo grondo deleted the issue#6055 branch September 3, 2024 21:16
Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 85.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 83.34%. Comparing base (598aa8d) to head (25942cb).
Report is 450 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/job-manager/plugins/perilog.c 85.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6249      +/-   ##
==========================================
- Coverage   83.34%   83.34%   -0.01%     
==========================================
  Files         521      521              
  Lines       85658    85693      +35     
==========================================
+ Hits        71393    71418      +25     
- Misses      14265    14275      +10     
Files with missing lines Coverage Δ
src/modules/job-manager/plugins/perilog.c 81.35% <85.00%> (+0.26%) ⬆️

... and 10 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run administrative epilog even if job is canceled before starting
2 participants