Support Slurm execution #338

kyleam · 2024-12-06T22:37:09Z

This series adds Slurm support via a new bbi nonmem run slurm subcommand.

Overview

[01] cmd: move errpanic from sge to root
[02] cmd: remove unnecessary appends
[03] cmd: move template error checks immediately after call sites
[04] cmd: drop extra spaces from nmfe command string
[05] sge: simplify an append

These are cleanup commits that prepare for pulling out code from sge.go that will also be used for Slurm.

[06] local, sge: decouple script templates
[07] local, sge: do shell quoting at template level
[08] sge: convert qsub options to template directives

These commits move the qsub options to the template, so that submission command is reduced to qsub {submission script} and switching to submission command to Slurm can be done by swapping "qsub" for "sbatch".

[09] sge: simplify the description of --grid_name_prefix
[10] sge: improve the description of --bbi_binary
[11] sge: drop sgeOperation struct
[12] sge: use consistent parameter name for SGEModel methods

More cleanup in preparation for splitting sge.go into shared code and SGE-specific bits.

[13] sge: decouple SGE-specific details from grid submission

This commit does the main work of the series, extracting the generic "submit to grid" code from sge.go to grid.go.

[14] run-integration-tests: adjust MPIEXEC_PATH for Metworx compatibility
[15] integration/nonmem: prune commented code in SGE tests
[16] integration/nonmem: drop test IDs from SGE tests
[17] integration/nonmem: extract SGE checks to test helpers
[18] vt-test: allow skipped tests

These commits prepare the integration tests for Slurm.

[19] nonmem run: support Slurm submission

This commit wires up and exposes bbi nonmem run slurm.

[20] slurm: include modquote test case

Adds a test case that works for Slurm but not for SGE.

errpanic is defined in cmd/sge.go but used throughout several cmd/ files, including ones that are conceptually "upstream" of sge. Move errpanic to root.go.

Several spots assign a slice of strings and then immediately and unconditionally append to it. Absorb the appended values into the initial assignment.

buildNonMemCommandString includes empty strings that get mapped to extra spaces in the final shell command (i.e. have no functional consequence).

nonMemExecutionTemplate is used by sge for writing the grid.sh script with the 'bbi nonmem run local ...' command and by local for writing the {run}.sh script with the nmfe command. local's use works, though it's a bit odd that it includes a '#$ -wd ...' comment directive that is only relevant for the SGE submission. An upcoming commit will add even more directives, and another will add a new template for Slurm. Replace nonMemExecutionTemplate with a dedicated template for local and SGE so that the local template doesn't contain several noop SGE directives.

For local and SGE execution, various spots in the generated shell scripts may need quoting. The ShQuote helper [*] is selectively applied to the working directory (for SGE the directive) and to different pieces of a command before joining them into the final command string. The next commit will move all the qsub options into the SGE template. That leads to more values that may need to be quoted (because they are coming through at the script level rather than via exec.Command arguments). We could continue to handle all of the quoting outside the template, but it's a better fit for the template because what needs quoted is inherently tied to how the script is rendered (which is the responsibility of the template to define). Also, in the future we may allow users to override the template, in which case they should be able to control quoting rather than relying on some upstream and inaccessible code to quote the right values. Note that this new approach sends all of the command's pieces through ShQuote. That's relatively inexpensive and won't lead to superfluous quoting because ShQuote first checks whether the item needs quoting. [*] ShQuote was added in e71dc47 (run: quote tokens if needed when generating shell scripts, 2023-02-15). See that commit's message for an overview of the quoting situation.

This has the advantage of capturing more of details about the run on disk and will also make it easier to align the submission logic across different batch systems because the call is reduced to a submission command that takes the submission script as the positional argument. And now that submission happens with a plain 'qsub {script}' call, it opens up the possibility of allowing callers to override the template (e.g., via a command-line argument).

* The description of --bbi_binary mentions goroutines, but the key user-visible effect of this option is controlling the bbi executable that ends up in the submission script. * When 807cbf8 (sge: use process bbi executable in scripts unless overridden, 2022-07-30) changed --bbi_binary's default to "", it should have documented the behavior that "" maps to.

sgeOperation is used in only one spot and simply holds a slice of SGEModel. Just use the slice directly.

All *LocalModel and LocalModel methods use l as the parameter. *SGEModel and SGEModel methods, on the other hand, use s for the pointer methods and l for the value methods. That inconsistency (which is flagged by golint) is confusing and presumably due to copying sge.go from local.go. Switch the parameter to s in all cases.

The upcoming slurm subcommand is going to require a lot of the same logic as the sge subcommand. Move those shared parts to grid.go and introduce a gridSpec struct that captures the scheduler-specific details.

In the upcoming Metworx, /usr/bin/mpiexec won't exist. We could set MPIEXEC_PATH to the default value of /usr/local/mpich3/bin/mpiexec (or even drop the handling entirely). The main downside is that tests could then no longer detect whether overriding the default mpiexec path with --mpi_exec_path was effective. Instead set this to the _resolved_ path of the default location. This is a value that 1) works across Metworx versions and 2) is distinct from the default bbi value because we can rely on it being a symlink. (Another option would be to set it to the highest precedence mpiexec in PATH.) While at it, make it possible to choose a different location by setting MPIEXEC_PATH in the calling environment.

Test IDs aren't used in the current validation approach. Drop them from the SGE tests to prepare for updating these tests for Slurm.

bbi_sge_test.go contains two 'bbi nonmem run sge ...' tests, one for a call without --parallel and another for a call with --parallel. Much of the test logic is not specific to SGE. Extract out the core checks so that they can be reused to test the upcoming slurm subcommand.

By default, 'make vt-test' exits with a non-zero status (via fmttests) if a test is skipped. With the upcoming support for Slurm, any given test run will only test SGE or Slurm (depending on the machine) and skip the other.

The new 'bbi nonmem run slurm' command follows 'bbi nonmem run sge' in terms of going through gridSpec, attempting to align the template directives, and testing with the same helpers. The one deliberate deviation in behavior is the output file name. For SGE, standard output and standard error are redirected to the default file name of '{run name}.o{job id}'. There's been a request (gh-312) to clean up these .o* and .po* files (only the former is relevant for Slurm). Cleaning them up is probably a bit aggressive because the .o* file contains the 'bbi run local ...' output, which is useful for troubleshooting. However, I suspect a core pain point with the these files is the _changing name_ when overwriting a model for reasons that have to do with how SVN handles deletions. So, don't mimic this behavior and instead use a consistent name to hopefully avoid this issue for Slurm. Closes #303

The 'bbi nonmem run local ...' tests include a modquote case (model named "m'd.ctl"). SGE doesn't include this case because it gets tripped up by it and ends up in an Eqw state, as described in e71dc47 (run: quote tokens if needed when generating shell scripts, 2023-02-15). Slurm handles it fine.

kyleam · 2024-12-06T22:49:43Z

make vt-test is passing for me on both SGE and Slurm workflows, with the caveat that both the SGE and Slurm runs fail (inconsistently) if a worker isn't primed (related discussion). I'm still not sure what's going on there, though I'm fairly convinced that the wait is in effect and that it's not a matter of increasing the wait time.

kyleam · 2024-12-06T23:09:23Z

cmd/grid.go

+	Nonmem               *NonMemModel
+	Cancel               chan bool
+	postworkInstructions *PostExecutionHookEnvironment
+}


After this point, the code in this file is coming with light modifications from what was in sge.go. Reviewing the diff locally will likely be much easier to digest.

Similarly, most of the integration/nonmem/grid.go content is coming unchanged from bbi_sge_test.go.

More of a comment so don't consider it blocking. This resides in the cmd directory, which is normally understood to contain command line actions or sub functions. It doesn't look like this returns a cobra command anywhere, so if this is just a type that we're using, it may make more sense to move it into internal somewhere.

I think before the implementations were in the sge file which was why it was ok, but it may make sense to just longer term move the actual interface implementations into something at the top level (if we need to expose them) or in internal.

@shairozan Good point, I agree it makes sense to move grid.go from under cmd/.

Taking a quick look, I think that restructuring would require a larger rework because grid.go depends on other things defined in cmd/ (e.g., NonMemModel defined cmd/nonmem.go and PostExecutionHookEnvironment defined in cmd/run.go). In the long run, I don't think that larger rework would be a bad thing, though for now I'm okay with introducing a cmd/ file that doesn't actually map to a CLI subcommand. (We could of course absorb the code into an existing file like cmd/run.go, though in my view that's not an improvement organization-wise.)

seth127

This all looks great to me, nice work.

I did some manual testing (using the bbr in metrumresearchgroup/bbr#724) on both SGE and Slurm and all worked as I expected. At a high-level, my manual tests used the 106 and 200 models from our Expo and covered the following (with both SGE and Slurm):

submit locally, both single-threaded and parallel=TRUE (threads=8)
submit to the grid, both single-threaded and parallel=TRUE (threads=8)
run a bootstrap of 200 replicates, submitted to the grid in batches of 100 (via bbr bootstrap)

kyleam · 2025-01-10T22:08:59Z

cmd/slurm.go

+const slurmTemplate string = `#!/bin/bash
+#SBATCH --job-name={{.JobName | shquote}}
+#SBATCH --output=slurm.out
+#SBATCH --export=ALL
+{{- if .Config.Parallel}}
+#SBATCH --ntasks={{.Config.Threads}}{{end}}
+#SBATCH --chdir={{.WorkingDirectory | shquote}}


Outside the context of Metworx, I suspect callers may end up needing to set additional parameters. I'm not sure exactly how that should look (and am not proposing it for this PR), but, as mentioned in 4e7df05 (sge: convert qsub options to template directives, 2024-12-06), I think this series gets us closer to making that happen:

And now that submission happens with a plain 'qsub {script}' call, it
opens up the possibility of allowing callers to override the template
(e.g., via a command-line argument).

kyleam added 20 commits December 6, 2024 09:05

cmd: move errpanic from sge to root

8c5de8f

errpanic is defined in cmd/sge.go but used throughout several cmd/ files, including ones that are conceptually "upstream" of sge. Move errpanic to root.go.

cmd: remove unnecessary appends

9a81d33

Several spots assign a slice of strings and then immediately and unconditionally append to it. Absorb the appended values into the initial assignment.

cmd: move template error checks immediately after call sites

e0511c5

cmd: drop extra spaces from nmfe command string

9f5195f

buildNonMemCommandString includes empty strings that get mapped to extra spaces in the final shell command (i.e. have no functional consequence).

sge: simplify an append

dbf9c34

sge: simplify the description of --grid_name_prefix

7d07162

sge: drop sgeOperation struct

4076613

sgeOperation is used in only one spot and simply holds a slice of SGEModel. Just use the slice directly.

sge: decouple SGE-specific details from grid submission

cf4e835

The upcoming slurm subcommand is going to require a lot of the same logic as the sge subcommand. Move those shared parts to grid.go and introduce a gridSpec struct that captures the scheduler-specific details.

integration/nonmem: prune commented code in SGE tests

237de2b

integration/nonmem: drop test IDs from SGE tests

a466557

Test IDs aren't used in the current validation approach. Drop them from the SGE tests to prepare for updating these tests for Slurm.

vt-test: allow skipped tests

e31e3bb

By default, 'make vt-test' exits with a non-zero status (via fmttests) if a test is skipped. With the upcoming support for Slurm, any given test run will only test SGE or Slurm (depending on the machine) and skip the other.

kyleam mentioned this pull request Dec 6, 2024

Add submission mode for Slurm metrumresearchgroup/bbr#724

Merged

kyleam requested a review from seth127 December 6, 2024 22:51

kyleam commented Dec 6, 2024

View reviewed changes

seth127 approved these changes Jan 10, 2025

View reviewed changes

kyleam commented Jan 10, 2025

View reviewed changes

kyleam merged commit eb49f40 into main Jan 13, 2025
4 checks passed

kyleam deleted the slurm branch January 13, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Slurm execution #338

Support Slurm execution #338

kyleam commented Dec 6, 2024

kyleam commented Dec 6, 2024

kyleam Dec 6, 2024

shairozan Dec 9, 2024

kyleam Dec 9, 2024

kyleam Dec 9, 2024

seth127 left a comment

kyleam Jan 10, 2025

Support Slurm execution #338

Support Slurm execution #338

Conversation

kyleam commented Dec 6, 2024

Overview

kyleam commented Dec 6, 2024

kyleam Dec 6, 2024

Choose a reason for hiding this comment

shairozan Dec 9, 2024

Choose a reason for hiding this comment

kyleam Dec 9, 2024

Choose a reason for hiding this comment

kyleam Dec 9, 2024

Choose a reason for hiding this comment

seth127 left a comment

Choose a reason for hiding this comment

kyleam Jan 10, 2025

Choose a reason for hiding this comment