Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: add extra info to doom exceptions #6453

Merged
merged 2 commits into from
Dec 3, 2024

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Nov 21, 2024

Problem: Exceptions from the doom plugin do not specify what job failed. It can be confusing to users which job failed amongst multiple in a batch job, or if the broker from a batch job itself failed.

Add the job executable name and jobid to the doom exception outputs.

Fixes #6357


we could obviously debate the message format, I went back and forth and just ended up at this. Tweak suggestions welcome.

@chu11 chu11 force-pushed the issue6357_doom_exception branch from ce9d74d to 7f51a84 Compare November 21, 2024 19:30
@grondo
Copy link
Contributor

grondo commented Nov 22, 2024

Tweak suggestions welcome

We might be able to shorten it by just prefixing the existing exception note with jobid: arg0: . Also, I wonder if we should just include the basename of arg0? (This is just a hint so the user knows at what level the exception occurred).

Now that I'm thinking about it, I wonder if flux job attach should be the thing printing the jobid for job exceptions. Logging the jobid in the eventlog that's specific to a jobid seems redundant in retrospect.

@chu11
Copy link
Member Author

chu11 commented Nov 22, 2024

Also, I wonder if we should just include the basename of arg0?

Oh that's a good idea, that just slipped my mind.

@chu11 chu11 force-pushed the issue6357_doom_exception branch from 7f51a84 to c00ed88 Compare November 22, 2024 22:53
@chu11
Copy link
Member Author

chu11 commented Nov 22, 2024

re-pushed with tweaks per comments above

  • shell: only add basename to exception note
  • flux job attach: add jobid to output message

so now it looks like this (example from t2614-job-shell-doom)

1.105s: job.exception type=exec severity=0 jobid=f4m65MP5 test2.sh: shell rank 1 on host corona211 exited and exit-timeout=1s has expired

@grondo
Copy link
Contributor

grondo commented Nov 22, 2024

1.105s: job.exception type=exec severity=0 jobid=f4m65MP5 test2.sh: shell rank 1 on host corona211 exited and exit-timeout=1s has expired

Again, it would be shorter to just put the jobid as a prefix:

1.105s job.exception f4m65MP5 type=exec severity=0 test2.sh: shell rank1 on host corona211 exited and exit-timeout=1s has expired

Since there's no mistaking that's the jobid. However, that's pretty minor and probably not worth quibbling about.

Edit: moved jobid after job.exception in example output.

@chu11 chu11 force-pushed the issue6357_doom_exception branch 2 times, most recently from 9449739 to 69e5ceb Compare November 22, 2024 23:10
@chu11
Copy link
Member Author

chu11 commented Nov 22, 2024

ok, re-pushed with the output message tweak.

Also had to tweak two tests that had to be updated for the expected output.

Edit: output is now

0.106s: job.exception f5MVaghD type=exec severity=0 test2.sh: shell rank 1 on host corona211 failed and exit-on-error is set

src/shell/doom.c Outdated
Comment on lines 102 to 107
static char buf[PATH_MAX+1] = {0};
json_t *s = json_array_get (doom->shell->info->jobspec->command, 0);
const char *path = json_string_value (s);
char *rv;
snprintf (buf, PATH_MAX, "%s", path);
rv = basename (buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to just do something like

const char *p = strrchr (s, "/");
const char *base = p ? p + 1 : s;

And avoid the copy? Maybe we should have our own implementation of basename() that doesn't modify its argument (I think some places in the codebase may already assume GNU basename(3), which doesn't modify its arg, but since we support alpine and someday BSD, that's probably not safe).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that would be better. When I read the manpage for basename(3) I was like "huh, it needs to modify the buffer?" Lets go with your simpler approach.

@grondo
Copy link
Contributor

grondo commented Nov 22, 2024

Ah, bummer. The change I suggested breaks some flux-sched checks which are matching flux job attach output exactly :-( (Since it affects all job exceptions, not just the exit-timeout)

Sorry if that was a bad suggestion.

@chu11
Copy link
Member Author

chu11 commented Nov 23, 2024

Ah, bummer. The change I suggested breaks some flux-sched checks which are matching flux job attach output exactly :-(

those tests are easily fixable (as is one I missed in t2304-sched-simple-alloc-check.t). Thinking we should keep the new style? Or are you thinking we should tweak?

@chu11 chu11 force-pushed the issue6357_doom_exception branch from 69e5ceb to 31800a8 Compare November 23, 2024 00:14
@grondo
Copy link
Contributor

grondo commented Nov 23, 2024

Up to you if you don't mind submitting a flux-sched PR

@chu11
Copy link
Member Author

chu11 commented Dec 2, 2024

rebased and re-pushed, should pass flux-sched now that flux-framework/flux-sched#1315 merged. Although unsure if @trws wants to comment on output format.

@trws
Copy link
Member

trws commented Dec 2, 2024

Nah, looks good to me, just helped to have the context. Thanks @chu11!

@chu11 chu11 changed the title shell: add extra info to doom exceptions WIP: shell: add extra info to doom exceptions Dec 2, 2024
@chu11
Copy link
Member Author

chu11 commented Dec 2, 2024

setting WIP, I'm not sure what happened to my branch w/ my tweak from comments above. Ugh, maybe i lost it during pre-thanksgiving break madness.

chu11 added 2 commits December 2, 2024 10:57
Problem: Exceptions from the doom plugin do not specify what
job failed.  It can be confusing to users which job failed amongst
multiple in a batch job, or if the broker from a batch job itself
failed.

Add the job executable name to the doom exception outputs.

Fixes flux-framework#6357
Problem: In a batch job, it may be difficult to discern which
job had an exception.

When flux job attach outputs an exception, also output the jobid
of the job.

Update tests in t2608-job-shell-log.t and t2304-sched-simple-alloc-check.t
for change in expected output.
@chu11 chu11 force-pushed the issue6357_doom_exception branch from 8be29d8 to 57f5124 Compare December 2, 2024 18:59
@chu11 chu11 changed the title WIP: shell: add extra info to doom exceptions shell: add extra info to doom exceptions Dec 2, 2024
@chu11
Copy link
Member Author

chu11 commented Dec 2, 2024

removed WIP, fixed things per comments above

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit 3acf1d8 into flux-framework:master Dec 3, 2024
34 checks passed
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.63%. Comparing base (9c3bb07) to head (57f5124).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6453      +/-   ##
==========================================
+ Coverage   83.62%   83.63%   +0.01%     
==========================================
  Files         523      523              
  Lines       87547    87552       +5     
==========================================
+ Hits        73212    73226      +14     
+ Misses      14335    14326       -9     
Files with missing lines Coverage Δ
src/cmd/job/attach.c 85.60% <100.00%> (ø)
src/shell/doom.c 88.12% <100.00%> (+0.38%) ⬆️

... and 14 files with indirect coverage changes

@chu11 chu11 deleted the issue6357_doom_exception branch December 5, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shell: doom timeout exception should include argv[0], jobid of affected job
3 participants