Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add submission mode for Slurm #724

Merged
merged 9 commits into from
Jan 13, 2025
Merged

Add submission mode for Slurm #724

merged 9 commits into from
Jan 13, 2025

Conversation

kyleam
Copy link
Collaborator

@kyleam kyleam commented Dec 6, 2024

This is the sister PR to metrumresearchgroup/bbi#338 and adds support for calling bbi nonmem run slurm by passing .mode = "slurm" to submit_model.

Most of the commits in this series are minor cleanup of the tests that check dry-run calls to submit_model. The seventh commit makes the main change, which boils down to registering the new mode and adjusting the check_mode_argument helper.


Reminder for testing: by default, bbr will fail if a developmental version of bbi is used. You can prevent that by setting the bbr.DEV_no_min_version option to TRUE.

T is a variable that can be overridden and is best avoided outside of
interactive contexts.
Test IDs aren't used in the current validation approach.  Drop them
since an upcoming commit is going to touch all these lines anyway.
bbr.R takes care of the glue::glue import for shared use.
Switch from withr::with_options to withr::local_options to avoid
needing to indent all of the tests.

While touching these lines, clean up extra spaces and feed each file
through styler::style_file.
The cmd_prefix variable in test-submit-model.R and
test-submit-models.R helps keep the expected values a bit more
manageable by capturing the prefix that's shared across most cases.

Make that prefix apply to even more cases by dropping the default_mode
from it.  That means {default_mode} needs to be specified, but doing
that is more readable than inlining the prefix.
@kyleam kyleam requested a review from seth127 December 6, 2024 22:51
)
})

test_that("submit_model aborts if .mode='sge' is used with Slurm's qsub", {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests from here down are new.

withr::with_options(list(bbr.bbi_exe_path = read_bbi_path()), {
withr::local_options(list(
bbr.bbi_exe_path = read_bbi_path(),
bbr.DEV_skip_system_mode_checks = TRUE
Copy link
Collaborator Author

@kyleam kyleam Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is new, but there is no new test content beyond the other_mode change on line 193.

Copy link
Collaborator

@seth127 seth127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped by here first, on my way to metrumresearchgroup/bbi#338, and dropped a couple of comments.

All looking good overall, but I'm going to look at bbi a bit and do some manual testing before I come back around to approve this.

As usual, thank for the detailed commit messages and the nicely organized clean up.

# with a qsub shim. That leads to a confusing situation where 'bbi nonmem
# run sge ...' will not abort upfront, but the execution doesn't entirely
# match the expected behavior (especially under --parallel).
if (identical(qsub, "/opt/slurm/bin/qsub")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, but I want to think through situations where a user may have been previously using bbr/bbi with Slurm via the qsub shim... If they now upgrade to this (upcoming) bbr release, will it be as simple as this?

  • They start seeing this error
  • They need to upgrade bbi and switch their code to explicitly call .mode = "slurm (or set the option)
  • It should work as it did before

It seems like there may be some nuance that I'm missing though. Maybe we don't need to try to go too far to cover these somewhat hacky cases, but I want to at least discuss it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] I want to think through situations where a user may have been previously using bbr/bbi with Slurm via the qsub shim...

I agree this is worth thinking through. Here's the rationale from ba073b1 for this (admittedly aggressive) approach:

...

Slurm includes a qsub shim that has some behavioral differences when
used under 'bbi nonmem run sge ...' (e.g., output files going to the
parent directory).  That shim would very likely cause confusion when
people on Slurm machines forget to override the "sge" default.

Prevent that confusion on Metworx or other systems using
ParallelCluster by aborting if "sge" is specified for .mode and qsub
is Slurm's shim.  That's aggressive and perhaps will be triggered by
people that have already been making of the shim.  However, it's
better to be strict about it because it's not an option we intend to
support.

To rephrase and expand a bit:

  1. I'm concerned about users unintentionally selecting "sge" given the default

  2. I was seeing some pretty wonky results during my (unsystematic) testing of bbi nonmme run sge with Slurm's qsub shim. Two things that come to mind (both with --parallel, I think) is 1) some generated files being placed outside the output directory and 2) the --threads value not being respected.

If they now upgrade to this (upcoming) bbr release, will it be as simple as this? [...]

I think that captures it, though "should work as it did before" will hopefully not include any weirdness like what I mentioned above.

Maybe we don't need to try to go too far to cover these somewhat hacky cases, but I want to at least discuss it.

Definitely worth discussing, and I'm open to other ideas here. As mentioned above, I would be concerned with removing the guard entirely.

If some users rely on this shim heavily, they could abuse the bbr.DEV_skip_system_mode_checks option to prevent the guard from firing, at least in the short term. If turns out that the shim is relied on more than we think, we can elevate that to a non-dev option and document it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a bit and we discussed some offline. I agree that it's unlikely any users will actually have negative consequences from this guard and, if they do, there are at least two known workarounds:

  1. Use an older bbr to keep using the qsub shim for their Slurm submission
  2. "Abuse" the bbr.DEV_skip_system_mode_checks option to bypass the guard

In summary, I think we should leave this in here, because of the benefit that it provides to users who would otherwise be using this shim unintentionally.

tests/testthat/test-submit-model.R Outdated Show resolved Hide resolved
The upcoming bbi 3.4.0 adds a 'bbi nonmem run slurm ...' command.
Register "slurm" as a valid mode.

Keep "sge" as the default (on Linux) for now but add some guards to
help catch cases where "sge" is specified on a machine that doesn't
support SGE.

Slurm includes a qsub shim that has some behavioral differences when
used under 'bbi nonmem run sge ...' (e.g., output files going to the
parent directory).  That shim would very likely cause confusion when
people on Slurm machines forget to override the "sge" default.

Prevent that confusion on Metworx or other systems using
ParallelCluster by aborting if "sge" is specified for .mode and qsub
is Slurm's shim.  That's aggressive and perhaps will be triggered by
people that have already been making of the shim.  However, it's
better to be strict about it because it's not an option we intend to
support.
The default mode is "sge" on Linux and "local" otherwise, so the test
for inheriting bbr.bbi_exe_mode makes sure to use a non-default value
so that it can detect that the override was effective.

Looking forward, "sge" will eventually be removed, so rework the
condition to instead check "local" and use "slurm" if that's the
default.

While touching these lines, convert the 'switch' (which sprawls across
four lines when formatted with styler) to an arguably simpler 'if'.
The test that passing .mode overrides the bbr.bbi_exe_mode option
doesn't account for bbr.bbi_exe_mode being "local" on non-Linux
machines (i.e. the test would pass even if the argument didn't have an
effect).

Use the same approach that other tests do to explicitly choose a mode
other than the default.  Also extract the check and put it in its own
test next to a similar check.
Copy link
Collaborator

@seth127 seth127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me, nice work.

I did some manual testing (using the bbi in metrumresearchgroup/bbi#338) on both SGE and Slurm and all worked as I expected. At a high-level, my manual tests used the 106 and 200 models from our Expo and covered the following (with both SGE and Slurm):

  • submit locally, both single-threaded and parallel=TRUE (threads=8)
  • submit to the grid, both single-threaded and parallel=TRUE (threads=8)
  • run a bootstrap of 200 replicates, submitted to the grid in batches of 100 (via bbr bootstrap)

@kyleam
Copy link
Collaborator Author

kyleam commented Jan 10, 2025

I did some manual testing [...]

Thanks for giving it thorough testing.

@kyleam kyleam merged commit de4cc8d into main Jan 13, 2025
8 checks passed
@kyleam kyleam deleted the slurm branch January 13, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next release Issues/PRs to be included on the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants