-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add submission mode for Slurm #724
Conversation
T is a variable that can be overridden and is best avoided outside of interactive contexts.
Test IDs aren't used in the current validation approach. Drop them since an upcoming commit is going to touch all these lines anyway.
bbr.R takes care of the glue::glue import for shared use.
Switch from withr::with_options to withr::local_options to avoid needing to indent all of the tests. While touching these lines, clean up extra spaces and feed each file through styler::style_file.
The cmd_prefix variable in test-submit-model.R and test-submit-models.R helps keep the expected values a bit more manageable by capturing the prefix that's shared across most cases. Make that prefix apply to even more cases by dropping the default_mode from it. That means {default_mode} needs to be specified, but doing that is more readable than inlining the prefix.
) | ||
}) | ||
|
||
test_that("submit_model aborts if .mode='sge' is used with Slurm's qsub", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests from here down are new.
withr::with_options(list(bbr.bbi_exe_path = read_bbi_path()), { | ||
withr::local_options(list( | ||
bbr.bbi_exe_path = read_bbi_path(), | ||
bbr.DEV_skip_system_mode_checks = TRUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is new, but there is no new test content beyond the other_mode
change on line 193.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stopped by here first, on my way to metrumresearchgroup/bbi#338, and dropped a couple of comments.
All looking good overall, but I'm going to look at bbi
a bit and do some manual testing before I come back around to approve this.
As usual, thank for the detailed commit messages and the nicely organized clean up.
# with a qsub shim. That leads to a confusing situation where 'bbi nonmem | ||
# run sge ...' will not abort upfront, but the execution doesn't entirely | ||
# match the expected behavior (especially under --parallel). | ||
if (identical(qsub, "/opt/slurm/bin/qsub")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, but I want to think through situations where a user may have been previously using bbr/bbi with Slurm via the qsub
shim... If they now upgrade to this (upcoming) bbr release, will it be as simple as this?
- They start seeing this error
- They need to upgrade
bbi
and switch their code to explicitly call.mode = "slurm
(or set the option) - It should work as it did before
It seems like there may be some nuance that I'm missing though. Maybe we don't need to try to go too far to cover these somewhat hacky cases, but I want to at least discuss it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] I want to think through situations where a user may have been previously using bbr/bbi with Slurm via the
qsub
shim...
I agree this is worth thinking through. Here's the rationale from ba073b1 for this (admittedly aggressive) approach:
...
Slurm includes a qsub shim that has some behavioral differences when
used under 'bbi nonmem run sge ...' (e.g., output files going to the
parent directory). That shim would very likely cause confusion when
people on Slurm machines forget to override the "sge" default.
Prevent that confusion on Metworx or other systems using
ParallelCluster by aborting if "sge" is specified for .mode and qsub
is Slurm's shim. That's aggressive and perhaps will be triggered by
people that have already been making of the shim. However, it's
better to be strict about it because it's not an option we intend to
support.
To rephrase and expand a bit:
-
I'm concerned about users unintentionally selecting "sge" given the default
-
I was seeing some pretty wonky results during my (unsystematic) testing of
bbi nonmme run sge
with Slurm's qsub shim. Two things that come to mind (both with--parallel
, I think) is 1) some generated files being placed outside the output directory and 2) the--threads
value not being respected.
If they now upgrade to this (upcoming) bbr release, will it be as simple as this? [...]
I think that captures it, though "should work as it did before" will hopefully not include any weirdness like what I mentioned above.
Maybe we don't need to try to go too far to cover these somewhat hacky cases, but I want to at least discuss it.
Definitely worth discussing, and I'm open to other ideas here. As mentioned above, I would be concerned with removing the guard entirely.
If some users rely on this shim heavily, they could abuse the bbr.DEV_skip_system_mode_checks
option to prevent the guard from firing, at least in the short term. If turns out that the shim is relied on more than we think, we can elevate that to a non-dev option and document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this a bit and we discussed some offline. I agree that it's unlikely any users will actually have negative consequences from this guard and, if they do, there are at least two known workarounds:
- Use an older
bbr
to keep using theqsub
shim for their Slurm submission - "Abuse" the
bbr.DEV_skip_system_mode_checks
option to bypass the guard
In summary, I think we should leave this in here, because of the benefit that it provides to users who would otherwise be using this shim unintentionally.
The upcoming bbi 3.4.0 adds a 'bbi nonmem run slurm ...' command. Register "slurm" as a valid mode. Keep "sge" as the default (on Linux) for now but add some guards to help catch cases where "sge" is specified on a machine that doesn't support SGE. Slurm includes a qsub shim that has some behavioral differences when used under 'bbi nonmem run sge ...' (e.g., output files going to the parent directory). That shim would very likely cause confusion when people on Slurm machines forget to override the "sge" default. Prevent that confusion on Metworx or other systems using ParallelCluster by aborting if "sge" is specified for .mode and qsub is Slurm's shim. That's aggressive and perhaps will be triggered by people that have already been making of the shim. However, it's better to be strict about it because it's not an option we intend to support.
The default mode is "sge" on Linux and "local" otherwise, so the test for inheriting bbr.bbi_exe_mode makes sure to use a non-default value so that it can detect that the override was effective. Looking forward, "sge" will eventually be removed, so rework the condition to instead check "local" and use "slurm" if that's the default. While touching these lines, convert the 'switch' (which sprawls across four lines when formatted with styler) to an arguably simpler 'if'.
The test that passing .mode overrides the bbr.bbi_exe_mode option doesn't account for bbr.bbi_exe_mode being "local" on non-Linux machines (i.e. the test would pass even if the argument didn't have an effect). Use the same approach that other tests do to explicitly choose a mode other than the default. Also extract the check and put it in its own test next to a similar check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great to me, nice work.
I did some manual testing (using the bbi
in metrumresearchgroup/bbi#338) on both SGE and Slurm and all worked as I expected. At a high-level, my manual tests used the 106 and 200 models from our Expo and covered the following (with both SGE and Slurm):
- submit locally, both single-threaded and parallel=TRUE (threads=8)
- submit to the grid, both single-threaded and parallel=TRUE (threads=8)
- run a bootstrap of 200 replicates, submitted to the grid in batches of 100 (via bbr bootstrap)
Thanks for giving it thorough testing. |
This is the sister PR to metrumresearchgroup/bbi#338 and adds support for calling
bbi nonmem run slurm
by passing.mode = "slurm"
tosubmit_model
.Most of the commits in this series are minor cleanup of the tests that check dry-run calls to
submit_model
. The seventh commit makes the main change, which boils down to registering the new mode and adjusting thecheck_mode_argument
helper.Reminder for testing: by default, bbr will fail if a developmental version of bbi is used. You can prevent that by setting the
bbr.DEV_no_min_version
option toTRUE
.