Running `node-configured` action on `slurmd` can cause `Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed` #63

dsloanm · 2025-01-16T12:29:19Z

Bug Description

On a minimally deployed Charmed HPC cluster with single slurmctld and slurmd units, running juju run slurmd/0 node-configured can result in error:

Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed

See attached video.

error.webm

To Reproduce

Running on a bootstrapped AWS controller with an empty model:

juju deploy --channel latest/edge --base [email protected] slurmctld
juju deploy --channel latest/edge --base [email protected] slurmd --constraints="instance-type=g4dn.xlarge"
juju integrate slurmctld:slurmd slurmd:slurmctld
Wait for deployment to complete.
Repeatedly run juju run slurmd/0 node-configured until error.

Environment

Deploying latest/edge on AWS. The slurmd instance on a GPU-enabled g4dn.xlarge node.

Relevant log output

unit-slurmd-1: 12:11:50 ERROR unit.slurmd/1.juju-log slurmctld:2: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-slurmd-1/charm/./src/charm.py", line 385, in <module>
    main.main(SlurmdCharm)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/__init__.py", line 348, in main
    return _legacy_main.main(
           ^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/main.py", line 45, in main
    return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 543, in main
    manager.run()
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 529, in run
    self._emit()
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 518, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 134, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/src/interface_slurmctld.py", line 121, in _on_relation_changed
    self.on.slurmctld_available.emit(**cluster_info)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/./src/charm.py", line 190, in _on_slurmctld_available
    self._slurmd.service.restart()
  File "/var/lib/juju/agents/unit-slurmd-1/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 411, in restart
    _systemctl("reload-or-restart", self._service.value)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 178, in _systemctl
    return _call("systemctl", *args).stdout
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmd-1/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 153, in _call
    raise SlurmOpsError(f"command {cmd} failed. stderr:\n{result.stderr}")
charms.hpc_libs.v0.slurm_ops.SlurmOpsError: command ['systemctl', 'reload-or-restart', 'slurmd'] failed. stderr:
Job for slurmd.service failed.


Jan 16 12:11:50 ip-172-31-27-227 (kill)[2652]: slurmd.service: Referenced but unset environment variable evaluates to an empty string: MAINPID
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]: Usage:
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  kill [options] <pid> [...]
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]: Options:
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  <pid> [...]            send signal to every <pid> listed
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -<signal>, -s, --signal <signal>
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:                         specify the <signal> to be sent
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -q, --queue <value>    integer value to be sent with the signal
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -l, --list=[<signal>]  list all signal names, or convert one to a name
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -L, --table            list all signal names in a nice table
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -h, --help     display this help and exit
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -V, --version  output version information and exit
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]: For more details see kill(1).
Jan 16 12:11:50 ip-172-31-27-227 systemd[1]: slurmd.service: Control process exited, code=exited, status=1/FAILURE

Additional context

No response

The text was updated successfully, but these errors were encountered:

NucciTheBoss · 2025-01-16T14:43:48Z

Good catch. Adding the needs triage label because I'd like to understand how this MAINPID issue is being caused. Wondering if we switching to systemctl restart ... will resolve the problem or if there is a larger issue with how the service file is written.

jedel1043 · 2025-01-16T17:01:48Z

Reproduced this, ~~and also found out that the machines cannot ping to each other (on AWS specifically)~~ Ignore this, I was using the wrong IP address whoops

NucciTheBoss added bug Something isn't working needs triage Needs further investigation to determine cause and/or work required to implement fix/feature labels Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running `node-configured` action on `slurmd` can cause `Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed` #63

Running `node-configured` action on `slurmd` can cause `Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed` #63

dsloanm commented Jan 16, 2025

NucciTheBoss commented Jan 16, 2025

jedel1043 commented Jan 16, 2025 •

edited

Loading

Running node-configured action on slurmd can cause Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed #63

Running node-configured action on slurmd can cause Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed #63

Comments

dsloanm commented Jan 16, 2025

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

NucciTheBoss commented Jan 16, 2025

jedel1043 commented Jan 16, 2025 • edited Loading

Running `node-configured` action on `slurmd` can cause `Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed` #63

Running `node-configured` action on `slurmd` can cause `Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed` #63

jedel1043 commented Jan 16, 2025 •

edited

Loading