Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: improve error reporting #336

Merged
merged 11 commits into from
Oct 17, 2024
Merged

cmd: improve error reporting #336

merged 11 commits into from
Oct 17, 2024

Conversation

kyleam
Copy link
Collaborator

@kyleam kyleam commented Oct 9, 2024

This series focuses on improving errors reported by bbi nonmem run {local,sge}.

The two main improvements make information more available for troubleshooting:

  • the underlying error is now included when reporting turnstile.ConcurrentErrors
  • even when there is an error, the standard output and standard error for the local and SGE command invocations are written to {model}.out

kyleam added 9 commits October 7, 2024 19:43
Files in cmd/ create turnstile.ConcurrentError instances with a mix of
newConcurrentError calls and struct literals.  Drop the helper in
favor of consistently using struct literals.
Some of the turnstile.ConcurrentError notes are capitalized.  That
doesn't play well with postWorkNotice, which embeds the notes in
"Details are %s".
postWorkNotice includes only the run identifier and notes when logging
the ConcurrentError list.  While the notes provide the overall context
for the error, the underlying errors often contain additional details
that are useful for troubleshooting.
If exec.LookPath returns an error, executeSGEJob aborts and returns a
turnstile.ConcurrentError, but it has around twenty lines in between
the call and error handling, none of which are related to the error.

Return the error right after the LookPath call.
generateScript and generateBbiScript _replace_ a template parsing or
execution error with a new one.  Theses spots are unlikely to error
given the templates are hard coded, but it's still useful to include
the underlying error to make troubleshooting tweaks easier.
executeLocalJob checks whether running a command leads to an
ExitError.  If so, it logs the exit code and error string
output. Including the error string just leads to a redundant message
like

    Exit code was 127, details were exit status 127

Instead combine the exit code message with the following one that
includes the combined stdout/stderr.
Like executeLocalJob, executeSGEJob runs a command with CombinedOutput
and returns a turnstile.ConcurrentError if that results in an error.
Unlike executeLocalJob, it doesn't relay the combined stdout/stderr
when it hits an ExitError.  The ConcurrentError doesn't include the
stdout/stderr, and, in the error case, the output isn't written to
disk, so the caller has no way to access these details.

Update executeSGEJob to log ExitErrors in the same way executeLocalJob
does.
executeLocalJob and executeSGEJob have nearly identical code for
running a command and handling the errors.  While the code is not too
complicated, it's worth consolidating because 1) it ensures that all
runners (local, SGE, and upcoming Slurm) behave in the same way (e.g.,
in terms of reporting errors and writing output files) and 2) the
logic is going to get a bit more involved with the next commit.
The new runModelCommand was extracted from executeLocalJob and
executeSGEJob and behaves as they did in terms of handling the output.
The output is collected with CombinedOutput.  If there's an error, the
output is logged before returning a ConcurrentError.  The output is
written to _disk_ only if there was not an error.

Capturing the output on disk is helpful for troubleshooting.  Rework
runModelCommand to send standard output and standard error directly to
a file.
Copy link
Collaborator

@seth127 seth127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes make sense to me.

Base automatically changed from exec-bit-cleanup to main October 17, 2024 18:02
@kyleam kyleam merged commit edc6102 into main Oct 17, 2024
4 checks passed
@kyleam kyleam deleted the cmd-errors branch October 17, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants