Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Slurm execution #338

Merged
merged 20 commits into from
Jan 13, 2025
Merged

Support Slurm execution #338

merged 20 commits into from
Jan 13, 2025

Conversation

kyleam
Copy link
Collaborator

@kyleam kyleam commented Dec 6, 2024

This series adds Slurm support via a new bbi nonmem run slurm subcommand.

Overview

[01] cmd: move errpanic from sge to root
[02] cmd: remove unnecessary appends
[03] cmd: move template error checks immediately after call sites
[04] cmd: drop extra spaces from nmfe command string
[05] sge: simplify an append

These are cleanup commits that prepare for pulling out code from sge.go that will also be used for Slurm.

[06] local, sge: decouple script templates
[07] local, sge: do shell quoting at template level
[08] sge: convert qsub options to template directives

These commits move the qsub options to the template, so that submission command is reduced to qsub {submission script} and switching to submission command to Slurm can be done by swapping "qsub" for "sbatch".

[09] sge: simplify the description of --grid_name_prefix
[10] sge: improve the description of --bbi_binary
[11] sge: drop sgeOperation struct
[12] sge: use consistent parameter name for SGEModel methods

More cleanup in preparation for splitting sge.go into shared code and SGE-specific bits.

[13] sge: decouple SGE-specific details from grid submission

This commit does the main work of the series, extracting the generic "submit to grid" code from sge.go to grid.go.

[14] run-integration-tests: adjust MPIEXEC_PATH for Metworx compatibility
[15] integration/nonmem: prune commented code in SGE tests
[16] integration/nonmem: drop test IDs from SGE tests
[17] integration/nonmem: extract SGE checks to test helpers
[18] vt-test: allow skipped tests

These commits prepare the integration tests for Slurm.

[19] nonmem run: support Slurm submission

This commit wires up and exposes bbi nonmem run slurm.

[20] slurm: include modquote test case

Adds a test case that works for Slurm but not for SGE.

kyleam added 20 commits December 6, 2024 09:05
errpanic is defined in cmd/sge.go but used throughout several cmd/
files, including ones that are conceptually "upstream" of sge.  Move
errpanic to root.go.
Several spots assign a slice of strings and then immediately and
unconditionally append to it.  Absorb the appended values into the
initial assignment.
buildNonMemCommandString includes empty strings that get mapped to
extra spaces in the final shell command (i.e. have no functional
consequence).
nonMemExecutionTemplate is used by sge for writing the grid.sh script
with the 'bbi nonmem run local ...' command and by local for writing
the {run}.sh script with the nmfe command.  local's use works, though
it's a bit odd that it includes a '#$ -wd ...' comment directive that
is only relevant for the SGE submission.

An upcoming commit will add even more directives, and another will add
a new template for Slurm.  Replace nonMemExecutionTemplate with a
dedicated template for local and SGE so that the local template
doesn't contain several noop SGE directives.
For local and SGE execution, various spots in the generated shell
scripts may need quoting.  The ShQuote helper [*] is selectively
applied to the working directory (for SGE the directive) and to
different pieces of a command before joining them into the final
command string.

The next commit will move all the qsub options into the SGE template.
That leads to more values that may need to be quoted (because they are
coming through at the script level rather than via exec.Command
arguments).

We could continue to handle all of the quoting outside the template,
but it's a better fit for the template because what needs quoted is
inherently tied to how the script is rendered (which is the
responsibility of the template to define).  Also, in the future we may
allow users to override the template, in which case they should be
able to control quoting rather than relying on some upstream and
inaccessible code to quote the right values.

Note that this new approach sends all of the command's pieces through
ShQuote.  That's relatively inexpensive and won't lead to superfluous
quoting because ShQuote first checks whether the item needs quoting.

[*] ShQuote was added in e71dc47 (run: quote tokens if needed when
    generating shell scripts, 2023-02-15).  See that commit's message
    for an overview of the quoting situation.
This has the advantage of capturing more of details about the run on
disk and will also make it easier to align the submission logic across
different batch systems because the call is reduced to a submission
command that takes the submission script as the positional argument.

And now that submission happens with a plain 'qsub {script}' call, it
opens up the possibility of allowing callers to override the template
(e.g., via a command-line argument).
 * The description of --bbi_binary mentions goroutines, but the key
   user-visible effect of this option is controlling the bbi
   executable that ends up in the submission script.

 * When 807cbf8 (sge: use process bbi executable in scripts unless
   overridden, 2022-07-30) changed --bbi_binary's default to "", it
   should have documented the behavior that "" maps to.
sgeOperation is used in only one spot and simply holds a slice of
SGEModel.  Just use the slice directly.
All *LocalModel and LocalModel methods use l as the parameter.
*SGEModel and SGEModel methods, on the other hand, use s for the
pointer methods and l for the value methods.  That inconsistency
(which is flagged by golint) is confusing and presumably due to
copying sge.go from local.go.

Switch the parameter to s in all cases.
The upcoming slurm subcommand is going to require a lot of the same
logic as the sge subcommand.  Move those shared parts to grid.go and
introduce a gridSpec struct that captures the scheduler-specific
details.
In the upcoming Metworx, /usr/bin/mpiexec won't exist.

We could set MPIEXEC_PATH to the default value of
/usr/local/mpich3/bin/mpiexec (or even drop the handling entirely).
The main downside is that tests could then no longer detect whether
overriding the default mpiexec path with --mpi_exec_path was
effective.

Instead set this to the _resolved_ path of the default location.  This
is a value that 1) works across Metworx versions and 2) is distinct
from the default bbi value because we can rely on it being a symlink.
(Another option would be to set it to the highest precedence mpiexec
in PATH.)

While at it, make it possible to choose a different location by
setting MPIEXEC_PATH in the calling environment.
Test IDs aren't used in the current validation approach.  Drop them
from the SGE tests to prepare for updating these tests for Slurm.
bbi_sge_test.go contains two 'bbi nonmem run sge ...' tests, one for a
call without --parallel and another for a call with --parallel.  Much
of the test logic is not specific to SGE.

Extract out the core checks so that they can be reused to test the
upcoming slurm subcommand.
By default, 'make vt-test' exits with a non-zero status (via fmttests)
if a test is skipped.  With the upcoming support for Slurm, any given
test run will only test SGE or Slurm (depending on the machine) and
skip the other.
The new 'bbi nonmem run slurm' command follows 'bbi nonmem run sge' in
terms of going through gridSpec, attempting to align the template
directives, and testing with the same helpers.

The one deliberate deviation in behavior is the output file name.  For
SGE, standard output and standard error are redirected to the default
file name of '{run name}.o{job id}'.

There's been a request (gh-312) to clean up these .o* and .po* files
(only the former is relevant for Slurm).  Cleaning them up is probably
a bit aggressive because the .o* file contains the 'bbi run local ...'
output, which is useful for troubleshooting.  However, I suspect a
core pain point with the these files is the _changing name_ when
overwriting a model for reasons that have to do with how SVN handles
deletions.  So, don't mimic this behavior and instead use a consistent
name to hopefully avoid this issue for Slurm.

Closes #303
The 'bbi nonmem run local ...' tests include a modquote case (model
named "m'd.ctl").  SGE doesn't include this case because it gets
tripped up by it and ends up in an Eqw state, as described in e71dc47
(run: quote tokens if needed when generating shell scripts,
2023-02-15).

Slurm handles it fine.
@kyleam
Copy link
Collaborator Author

kyleam commented Dec 6, 2024

make vt-test is passing for me on both SGE and Slurm workflows, with the caveat that both the SGE and Slurm runs fail (inconsistently) if a worker isn't primed (related discussion). I'm still not sure what's going on there, though I'm fairly convinced that the wait is in effect and that it's not a matter of increasing the wait time.

@kyleam kyleam requested a review from seth127 December 6, 2024 22:51
Nonmem *NonMemModel
Cancel chan bool
postworkInstructions *PostExecutionHookEnvironment
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this point, the code in this file is coming with light modifications from what was in sge.go. Reviewing the diff locally will likely be much easier to digest.

Similarly, most of the integration/nonmem/grid.go content is coming unchanged from bbi_sge_test.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a comment so don't consider it blocking. This resides in the cmd directory, which is normally understood to contain command line actions or sub functions. It doesn't look like this returns a cobra command anywhere, so if this is just a type that we're using, it may make more sense to move it into internal somewhere.

I think before the implementations were in the sge file which was why it was ok, but it may make sense to just longer term move the actual interface implementations into something at the top level (if we need to expose them) or in internal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shairozan Good point, I agree it makes sense to move grid.go from under cmd/.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a quick look, I think that restructuring would require a larger rework because grid.go depends on other things defined in cmd/ (e.g., NonMemModel defined cmd/nonmem.go and PostExecutionHookEnvironment defined in cmd/run.go). In the long run, I don't think that larger rework would be a bad thing, though for now I'm okay with introducing a cmd/ file that doesn't actually map to a CLI subcommand. (We could of course absorb the code into an existing file like cmd/run.go, though in my view that's not an improvement organization-wise.)

Copy link
Collaborator

@seth127 seth127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me, nice work.

I did some manual testing (using the bbr in metrumresearchgroup/bbr#724) on both SGE and Slurm and all worked as I expected. At a high-level, my manual tests used the 106 and 200 models from our Expo and covered the following (with both SGE and Slurm):

  • submit locally, both single-threaded and parallel=TRUE (threads=8)
  • submit to the grid, both single-threaded and parallel=TRUE (threads=8)
  • run a bootstrap of 200 replicates, submitted to the grid in batches of 100 (via bbr bootstrap)

Comment on lines +10 to +16
const slurmTemplate string = `#!/bin/bash
#SBATCH --job-name={{.JobName | shquote}}
#SBATCH --output=slurm.out
#SBATCH --export=ALL
{{- if .Config.Parallel}}
#SBATCH --ntasks={{.Config.Threads}}{{end}}
#SBATCH --chdir={{.WorkingDirectory | shquote}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the context of Metworx, I suspect callers may end up needing to set additional parameters. I'm not sure exactly how that should look (and am not proposing it for this PR), but, as mentioned in 4e7df05 (sge: convert qsub options to template directives, 2024-12-06), I think this series gets us closer to making that happen:

And now that submission happens with a plain 'qsub {script}' call, it
opens up the possibility of allowing callers to override the template
(e.g., via a command-line argument).

@kyleam kyleam merged commit eb49f40 into main Jan 13, 2025
4 checks passed
@kyleam kyleam deleted the slurm branch January 13, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants