Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scontrol reconfigure fails if slurmd charm is deployed to machine with < 1G RealMemory in slurm.conf #64

Open
NucciTheBoss opened this issue Jan 16, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@NucciTheBoss
Copy link
Member

Bug Description

scontrol reconfigure fails with the error message scontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0 after adding a compute node to slurm.conf if the machine the slurmd charm is deployed to has less than 1G of memory allocated to it.

The cause of this error is that value of RealMemory can never be less than the value of MemSpecLimit, however, in the slurmd charm, the value of RealMemory can be dynamic as it is reported by slurmd -C and interpreted by the charm, but the value of MemSpecLimit is set as the constant "1024":

def get_slurmd_info() -> dict[str, str | list[str]]:
"""Get machine info as reported by `slurmd -C`.
For details see: https://slurm.schedmd.com/slurmd.html
"""
try:
r = subprocess.check_output(["slurmd", "-C"], text=True).strip()
except subprocess.CalledProcessError as e:
_logger.error(e)
raise
info = {}
for opt in r.split()[:-1]:
k, v = opt.split("=")
if k == "Gres":
info[k] = v.split(",")
continue
info[k] = v
return info

"MemSpecLimit": "1024",

slurmd -C can report that the machine's available RealMemory is less than 1G, but the slurmd charm has no way of handling this edge case. Since the bad node configuration can be written to the slurm.conf file without any checks at all, this syntax error won't be caught until scontrol reconfigure is run to signal to all the daemons to reload their configuration files.

The easiest fix here is to just develop a heuristic for determining what MemSpecLimit should be when the machine has < 1G of memory available. This is really only a problem for test and/or exploratory deployments, and isn't an optional configuration for production-level clusters, so we should maybe also log a warning that the node memory configuration is not optimal for production deployments, but is suitable for test/exploratory deployments.

To Reproduce

  1. Assume Juju controller is bootstrapped on LXD cloud...
  2. Assume default LXD profile limits allocated virtual machine memory to < 1G...
  3. juju deploy slurmd --base [email protected] --channel "edge" --constraints="virt-type=virtual-machine"
  4. juju deploy slurmctld --base [email protected] --channel "edge" --constraints="virt-type=virtual-machine"
  5. juju integrate slurmd slurmctld
  6. Wait for small cluster to become active...
  7. juju run slurmctld/leader resume <slurmd/0-hostname>
  8. See uncaught SlurmOpsError exception in juju debug-log --include slurmcltd/0 --level ERROR

Environment

  • LXD version: 5.21.2 LTS
  • Juju client version: 3.6.1-genericlinux-amd64
  • Juju controller version: 3.6.0
  • slurmctld revision number: 86
  • slurmd revision number: 107

Relevant log output

unit-slurmctld-0: 14:20:49 ERROR unit.slurmctld/0.juju-log slurmd:5: command ['scontrol', 'reconfigure'] failed with message scontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0
slurm_reconfigure error: Socket timed out on send/recv operation

unit-slurmctld-0: 14:20:50 ERROR unit.slurmctld/0.juju-log slurmd:5: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 499, in <module>
    main.main(SlurmctldCharm)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/__init__.py", line 348, in main
    return _legacy_main.main(
           ^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/main.py", line 45, in main
    return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 543, in main
    manager.run()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 529, in run
    self._emit()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 518, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/_main.py", line 134, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/src/interface_slurmd.py", line 153, in _on_relation_changed
    self.on.partition_available.emit()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 308, in _on_write_slurm_conf
    self._slurmctld.scontrol("reconfigure")
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 943, in scontrol
    return _call("scontrol", *args).stdout
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 153, in _call
    raise SlurmOpsError(f"command {cmd} failed. stderr:\n{result.stderr}")
charms.hpc_libs.v0.slurm_ops.SlurmOpsError: command ['scontrol', 'reconfigure'] failed. stderr:
scontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0
slurm_reconfigure error: Socket timed out on send/recv operation

unit-slurmctld-0: 14:20:50 ERROR juju.worker.uniter.operation hook "slurmd-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

@NucciTheBoss NucciTheBoss added the bug Something isn't working label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant