Add submission mode for Slurm #724

kyleam · 2024-12-06T22:41:36Z

This is the sister PR to metrumresearchgroup/bbi#338 and adds support for calling bbi nonmem run slurm by passing .mode = "slurm" to submit_model.

Most of the commits in this series are minor cleanup of the tests that check dry-run calls to submit_model. The seventh commit makes the main change, which boils down to registering the new mode and adjusting the check_mode_argument helper.

Reminder for testing: by default, bbr will fail if a developmental version of bbi is used. You can prevent that by setting the bbr.DEV_no_min_version option to TRUE.

T is a variable that can be overridden and is best avoided outside of interactive contexts.

Test IDs aren't used in the current validation approach. Drop them since an upcoming commit is going to touch all these lines anyway.

bbr.R takes care of the glue::glue import for shared use.

Switch from withr::with_options to withr::local_options to avoid needing to indent all of the tests. While touching these lines, clean up extra spaces and feed each file through styler::style_file.

The cmd_prefix variable in test-submit-model.R and test-submit-models.R helps keep the expected values a bit more manageable by capturing the prefix that's shared across most cases. Make that prefix apply to even more cases by dropping the default_mode from it. That means {default_mode} needs to be specified, but doing that is more readable than inlining the prefix.

kyleam · 2024-12-06T23:12:17Z

tests/testthat/test-submit-model.R

+  )
+})
+
+test_that("submit_model aborts if .mode='sge' is used with Slurm's qsub", {


The tests from here down are new.

kyleam · 2024-12-06T23:13:11Z

tests/testthat/test-submit-models.R

-withr::with_options(list(bbr.bbi_exe_path = read_bbi_path()), {
+withr::local_options(list(
+  bbr.bbi_exe_path = read_bbi_path(),
+  bbr.DEV_skip_system_mode_checks = TRUE


This option is new, but there is no new test content beyond the other_mode change on line 193.

seth127

I stopped by here first, on my way to metrumresearchgroup/bbi#338, and dropped a couple of comments.

All looking good overall, but I'm going to look at bbi a bit and do some manual testing before I come back around to approve this.

As usual, thank for the detailed commit messages and the nicely organized clean up.

seth127 · 2024-12-18T18:39:27Z

R/submit-model.R

+    # with a qsub shim. That leads to a confusing situation where 'bbi nonmem
+    # run sge ...' will not abort upfront, but the execution doesn't entirely
+    # match the expected behavior (especially under --parallel).
+    if (identical(qsub, "/opt/slurm/bin/qsub")) {


I like this idea, but I want to think through situations where a user may have been previously using bbr/bbi with Slurm via the qsub shim... If they now upgrade to this (upcoming) bbr release, will it be as simple as this?

They start seeing this error

They need to upgrade bbi and switch their code to explicitly call .mode = "slurm (or set the option)

It should work as it did before

It seems like there may be some nuance that I'm missing though. Maybe we don't need to try to go too far to cover these somewhat hacky cases, but I want to at least discuss it.

[...] I want to think through situations where a user may have been previously using bbr/bbi with Slurm via the qsub shim...

I agree this is worth thinking through. Here's the rationale from ba073b1 for this (admittedly aggressive) approach:

... Slurm includes a qsub shim that has some behavioral differences when used under 'bbi nonmem run sge ...' (e.g., output files going to the parent directory). That shim would very likely cause confusion when people on Slurm machines forget to override the "sge" default. Prevent that confusion on Metworx or other systems using ParallelCluster by aborting if "sge" is specified for .mode and qsub is Slurm's shim. That's aggressive and perhaps will be triggered by people that have already been making of the shim. However, it's better to be strict about it because it's not an option we intend to support.

To rephrase and expand a bit:

I'm concerned about users unintentionally selecting "sge" given the default

I was seeing some pretty wonky results during my (unsystematic) testing of bbi nonmme run sge with Slurm's qsub shim. Two things that come to mind (both with --parallel, I think) is 1) some generated files being placed outside the output directory and 2) the --threads value not being respected.

If they now upgrade to this (upcoming) bbr release, will it be as simple as this? [...]

I think that captures it, though "should work as it did before" will hopefully not include any weirdness like what I mentioned above.

Maybe we don't need to try to go too far to cover these somewhat hacky cases, but I want to at least discuss it.

Definitely worth discussing, and I'm open to other ideas here. As mentioned above, I would be concerned with removing the guard entirely.

If some users rely on this shim heavily, they could abuse the bbr.DEV_skip_system_mode_checks option to prevent the guard from firing, at least in the short term. If turns out that the shim is relied on more than we think, we can elevate that to a non-dev option and document it.

I thought about this a bit and we discussed some offline. I agree that it's unlikely any users will actually have negative consequences from this guard and, if they do, there are at least two known workarounds:

Use an older bbr to keep using the qsub shim for their Slurm submission

"Abuse" the bbr.DEV_skip_system_mode_checks option to bypass the guard

In summary, I think we should leave this in here, because of the benefit that it provides to users who would otherwise be using this shim unintentionally.

tests/testthat/test-submit-model.R

The upcoming bbi 3.4.0 adds a 'bbi nonmem run slurm ...' command. Register "slurm" as a valid mode. Keep "sge" as the default (on Linux) for now but add some guards to help catch cases where "sge" is specified on a machine that doesn't support SGE. Slurm includes a qsub shim that has some behavioral differences when used under 'bbi nonmem run sge ...' (e.g., output files going to the parent directory). That shim would very likely cause confusion when people on Slurm machines forget to override the "sge" default. Prevent that confusion on Metworx or other systems using ParallelCluster by aborting if "sge" is specified for .mode and qsub is Slurm's shim. That's aggressive and perhaps will be triggered by people that have already been making of the shim. However, it's better to be strict about it because it's not an option we intend to support.

The default mode is "sge" on Linux and "local" otherwise, so the test for inheriting bbr.bbi_exe_mode makes sure to use a non-default value so that it can detect that the override was effective. Looking forward, "sge" will eventually be removed, so rework the condition to instead check "local" and use "slurm" if that's the default. While touching these lines, convert the 'switch' (which sprawls across four lines when formatted with styler) to an arguably simpler 'if'.

The test that passing .mode overrides the bbr.bbi_exe_mode option doesn't account for bbr.bbi_exe_mode being "local" on non-Linux machines (i.e. the test would pass even if the argument didn't have an effect). Use the same approach that other tests do to explicitly choose a mode other than the default. Also extract the check and put it in its own test next to a similar check.

seth127

This all looks great to me, nice work.

I did some manual testing (using the bbi in metrumresearchgroup/bbi#338) on both SGE and Slurm and all worked as I expected. At a high-level, my manual tests used the 106 and 200 models from our Expo and covered the following (with both SGE and Slurm):

submit locally, both single-threaded and parallel=TRUE (threads=8)
submit to the grid, both single-threaded and parallel=TRUE (threads=8)
run a bootstrap of 200 replicates, submitted to the grid in batches of 100 (via bbr bootstrap)

kyleam · 2025-01-10T20:25:36Z

I did some manual testing [...]

Thanks for giving it thorough testing.

kyleam added 6 commits December 6, 2024 14:33

DESCRIPTION: bump roxygen2 version

ae9e105

test-submit-model{,s}.R: avoid T shorthand for TRUE

4b7df3e

T is a variable that can be overridden and is best avoided outside of interactive contexts.

test-submit-model{,s}.R: drop test IDs

3adfa57

Test IDs aren't used in the current validation approach. Drop them since an upcoming commit is going to touch all these lines anyway.

test-submit-model{,s}.R: don't qualify glue::glue

b740881

bbr.R takes care of the glue::glue import for shared use.

test-submit-model{,s}.R: reduce indentation with local_options

04ab211

Switch from withr::with_options to withr::local_options to avoid needing to indent all of the tests. While touching these lines, clean up extra spaces and feed each file through styler::style_file.

kyleam requested a review from seth127 December 6, 2024 22:51

kyleam commented Dec 6, 2024

View reviewed changes

seth127 reviewed Dec 18, 2024

View reviewed changes

kyleam added 3 commits December 19, 2024 13:05

kyleam force-pushed the slurm branch from 25a7f1c to 01f39e6 Compare December 19, 2024 18:07

seth127 mentioned this pull request Jan 7, 2025

batch_submit_bg() doesn't see options(bbr.DEV_no_min_version) #732

Open

seth127 added the next release Issues/PRs to be included on the next release label Jan 9, 2025

seth127 mentioned this pull request Jan 10, 2025

Support Slurm execution metrumresearchgroup/bbi#338

Merged

seth127 approved these changes Jan 10, 2025

View reviewed changes

kyleam merged commit de4cc8d into main Jan 13, 2025
8 checks passed

kyleam deleted the slurm branch January 13, 2025 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add submission mode for Slurm #724

Add submission mode for Slurm #724

kyleam commented Dec 6, 2024

kyleam Dec 6, 2024

kyleam Dec 6, 2024 •

edited

Loading

seth127 left a comment

seth127 Dec 18, 2024

kyleam Dec 19, 2024

seth127 Jan 10, 2025

seth127 left a comment

kyleam commented Jan 10, 2025

Add submission mode for Slurm #724

Add submission mode for Slurm #724

Conversation

kyleam commented Dec 6, 2024

kyleam Dec 6, 2024

Choose a reason for hiding this comment

kyleam Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

seth127 left a comment

Choose a reason for hiding this comment

seth127 Dec 18, 2024

Choose a reason for hiding this comment

kyleam Dec 19, 2024

Choose a reason for hiding this comment

seth127 Jan 10, 2025

Choose a reason for hiding this comment

seth127 left a comment

Choose a reason for hiding this comment

kyleam commented Jan 10, 2025

kyleam Dec 6, 2024 •

edited

Loading