-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Slurm execution #338
Conversation
errpanic is defined in cmd/sge.go but used throughout several cmd/ files, including ones that are conceptually "upstream" of sge. Move errpanic to root.go.
Several spots assign a slice of strings and then immediately and unconditionally append to it. Absorb the appended values into the initial assignment.
buildNonMemCommandString includes empty strings that get mapped to extra spaces in the final shell command (i.e. have no functional consequence).
nonMemExecutionTemplate is used by sge for writing the grid.sh script with the 'bbi nonmem run local ...' command and by local for writing the {run}.sh script with the nmfe command. local's use works, though it's a bit odd that it includes a '#$ -wd ...' comment directive that is only relevant for the SGE submission. An upcoming commit will add even more directives, and another will add a new template for Slurm. Replace nonMemExecutionTemplate with a dedicated template for local and SGE so that the local template doesn't contain several noop SGE directives.
For local and SGE execution, various spots in the generated shell scripts may need quoting. The ShQuote helper [*] is selectively applied to the working directory (for SGE the directive) and to different pieces of a command before joining them into the final command string. The next commit will move all the qsub options into the SGE template. That leads to more values that may need to be quoted (because they are coming through at the script level rather than via exec.Command arguments). We could continue to handle all of the quoting outside the template, but it's a better fit for the template because what needs quoted is inherently tied to how the script is rendered (which is the responsibility of the template to define). Also, in the future we may allow users to override the template, in which case they should be able to control quoting rather than relying on some upstream and inaccessible code to quote the right values. Note that this new approach sends all of the command's pieces through ShQuote. That's relatively inexpensive and won't lead to superfluous quoting because ShQuote first checks whether the item needs quoting. [*] ShQuote was added in e71dc47 (run: quote tokens if needed when generating shell scripts, 2023-02-15). See that commit's message for an overview of the quoting situation.
This has the advantage of capturing more of details about the run on disk and will also make it easier to align the submission logic across different batch systems because the call is reduced to a submission command that takes the submission script as the positional argument. And now that submission happens with a plain 'qsub {script}' call, it opens up the possibility of allowing callers to override the template (e.g., via a command-line argument).
* The description of --bbi_binary mentions goroutines, but the key user-visible effect of this option is controlling the bbi executable that ends up in the submission script. * When 807cbf8 (sge: use process bbi executable in scripts unless overridden, 2022-07-30) changed --bbi_binary's default to "", it should have documented the behavior that "" maps to.
sgeOperation is used in only one spot and simply holds a slice of SGEModel. Just use the slice directly.
All *LocalModel and LocalModel methods use l as the parameter. *SGEModel and SGEModel methods, on the other hand, use s for the pointer methods and l for the value methods. That inconsistency (which is flagged by golint) is confusing and presumably due to copying sge.go from local.go. Switch the parameter to s in all cases.
The upcoming slurm subcommand is going to require a lot of the same logic as the sge subcommand. Move those shared parts to grid.go and introduce a gridSpec struct that captures the scheduler-specific details.
In the upcoming Metworx, /usr/bin/mpiexec won't exist. We could set MPIEXEC_PATH to the default value of /usr/local/mpich3/bin/mpiexec (or even drop the handling entirely). The main downside is that tests could then no longer detect whether overriding the default mpiexec path with --mpi_exec_path was effective. Instead set this to the _resolved_ path of the default location. This is a value that 1) works across Metworx versions and 2) is distinct from the default bbi value because we can rely on it being a symlink. (Another option would be to set it to the highest precedence mpiexec in PATH.) While at it, make it possible to choose a different location by setting MPIEXEC_PATH in the calling environment.
Test IDs aren't used in the current validation approach. Drop them from the SGE tests to prepare for updating these tests for Slurm.
bbi_sge_test.go contains two 'bbi nonmem run sge ...' tests, one for a call without --parallel and another for a call with --parallel. Much of the test logic is not specific to SGE. Extract out the core checks so that they can be reused to test the upcoming slurm subcommand.
By default, 'make vt-test' exits with a non-zero status (via fmttests) if a test is skipped. With the upcoming support for Slurm, any given test run will only test SGE or Slurm (depending on the machine) and skip the other.
The new 'bbi nonmem run slurm' command follows 'bbi nonmem run sge' in terms of going through gridSpec, attempting to align the template directives, and testing with the same helpers. The one deliberate deviation in behavior is the output file name. For SGE, standard output and standard error are redirected to the default file name of '{run name}.o{job id}'. There's been a request (gh-312) to clean up these .o* and .po* files (only the former is relevant for Slurm). Cleaning them up is probably a bit aggressive because the .o* file contains the 'bbi run local ...' output, which is useful for troubleshooting. However, I suspect a core pain point with the these files is the _changing name_ when overwriting a model for reasons that have to do with how SVN handles deletions. So, don't mimic this behavior and instead use a consistent name to hopefully avoid this issue for Slurm. Closes #303
The 'bbi nonmem run local ...' tests include a modquote case (model named "m'd.ctl"). SGE doesn't include this case because it gets tripped up by it and ends up in an Eqw state, as described in e71dc47 (run: quote tokens if needed when generating shell scripts, 2023-02-15). Slurm handles it fine.
|
Nonmem *NonMemModel | ||
Cancel chan bool | ||
postworkInstructions *PostExecutionHookEnvironment | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this point, the code in this file is coming with light modifications from what was in sge.go
. Reviewing the diff locally will likely be much easier to digest.
Similarly, most of the integration/nonmem/grid.go
content is coming unchanged from bbi_sge_test.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a comment so don't consider it blocking. This resides in the cmd
directory, which is normally understood to contain command line actions or sub functions. It doesn't look like this returns a cobra command anywhere, so if this is just a type that we're using, it may make more sense to move it into internal
somewhere.
I think before the implementations were in the sge
file which was why it was ok, but it may make sense to just longer term move the actual interface implementations into something at the top level (if we need to expose them) or in internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shairozan Good point, I agree it makes sense to move grid.go from under cmd/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a quick look, I think that restructuring would require a larger rework because grid.go
depends on other things defined in cmd/
(e.g., NonMemModel
defined cmd/nonmem.go
and PostExecutionHookEnvironment
defined in cmd/run.go
). In the long run, I don't think that larger rework would be a bad thing, though for now I'm okay with introducing a cmd/
file that doesn't actually map to a CLI subcommand. (We could of course absorb the code into an existing file like cmd/run.go
, though in my view that's not an improvement organization-wise.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great to me, nice work.
I did some manual testing (using the bbr
in metrumresearchgroup/bbr#724) on both SGE and Slurm and all worked as I expected. At a high-level, my manual tests used the 106 and 200 models from our Expo and covered the following (with both SGE and Slurm):
- submit locally, both single-threaded and parallel=TRUE (threads=8)
- submit to the grid, both single-threaded and parallel=TRUE (threads=8)
- run a bootstrap of 200 replicates, submitted to the grid in batches of 100 (via bbr bootstrap)
const slurmTemplate string = `#!/bin/bash | ||
#SBATCH --job-name={{.JobName | shquote}} | ||
#SBATCH --output=slurm.out | ||
#SBATCH --export=ALL | ||
{{- if .Config.Parallel}} | ||
#SBATCH --ntasks={{.Config.Threads}}{{end}} | ||
#SBATCH --chdir={{.WorkingDirectory | shquote}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside the context of Metworx, I suspect callers may end up needing to set additional parameters. I'm not sure exactly how that should look (and am not proposing it for this PR), but, as mentioned in 4e7df05 (sge: convert qsub options to template directives, 2024-12-06), I think this series gets us closer to making that happen:
And now that submission happens with a plain 'qsub {script}' call, it
opens up the possibility of allowing callers to override the template
(e.g., via a command-line argument).
This series adds Slurm support via a new
bbi nonmem run slurm
subcommand.Overview
These are cleanup commits that prepare for pulling out code from sge.go that will also be used for Slurm.
These commits move the qsub options to the template, so that submission command is reduced to
qsub {submission script}
and switching to submission command to Slurm can be done by swapping "qsub" for "sbatch".More cleanup in preparation for splitting
sge.go
into shared code and SGE-specific bits.This commit does the main work of the series, extracting the generic "submit to grid" code from
sge.go
togrid.go
.These commits prepare the integration tests for Slurm.
This commit wires up and exposes
bbi nonmem run slurm
.Adds a test case that works for Slurm but not for SGE.