Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running node-configured action on slurmd can cause Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed #63

Open
dsloanm opened this issue Jan 16, 2025 · 2 comments
Labels
bug Something isn't working needs triage Needs further investigation to determine cause and/or work required to implement fix/feature

Comments

@dsloanm
Copy link
Contributor

dsloanm commented Jan 16, 2025

Bug Description

On a minimally deployed Charmed HPC cluster with single slurmctld and slurmd units, running juju run slurmd/0 node-configured can result in error:

Uncaught SlurmOpsError in charm code: command ['systemctl', 'reload-or-restart', 'slurmd'] failed

See attached video.

error.webm

To Reproduce

Running on a bootstrapped AWS controller with an empty model:

  1. juju deploy --channel latest/edge --base [email protected] slurmctld
  2. juju deploy --channel latest/edge --base [email protected] slurmd --constraints="instance-type=g4dn.xlarge"
  3. juju integrate slurmctld:slurmd slurmd:slurmctld
  4. Wait for deployment to complete.
  5. Repeatedly run juju run slurmd/0 node-configured until error.

Environment

Deploying latest/edge on AWS. The slurmd instance on a GPU-enabled g4dn.xlarge node.

Relevant log output

unit-slurmd-1: 12:11:50 ERROR unit.slurmd/1.juju-log slurmctld:2: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-slurmd-1/charm/./src/charm.py", line 385, in <module>
    main.main(SlurmdCharm)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/__init__.py", line 348, in main
    return _legacy_main.main(
           ^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/main.py", line 45, in main
    return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 543, in main
    manager.run()
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 529, in run
    self._emit()
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 518, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name, self._juju_context)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/_main.py", line 134, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/src/interface_slurmctld.py", line 121, in _on_relation_changed
    self.on.slurmctld_available.emit(**cluster_info)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 347, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 857, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/venv/ops/framework.py", line 947, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/./src/charm.py", line 190, in _on_slurmctld_available
    self._slurmd.service.restart()
  File "/var/lib/juju/agents/unit-slurmd-1/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 411, in restart
    _systemctl("reload-or-restart", self._service.value)
  File "/var/lib/juju/agents/unit-slurmd-1/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 178, in _systemctl
    return _call("systemctl", *args).stdout
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmd-1/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 153, in _call
    raise SlurmOpsError(f"command {cmd} failed. stderr:\n{result.stderr}")
charms.hpc_libs.v0.slurm_ops.SlurmOpsError: command ['systemctl', 'reload-or-restart', 'slurmd'] failed. stderr:
Job for slurmd.service failed.


Jan 16 12:11:50 ip-172-31-27-227 (kill)[2652]: slurmd.service: Referenced but unset environment variable evaluates to an empty string: MAINPID
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]: Usage:
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  kill [options] <pid> [...]
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]: Options:
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  <pid> [...]            send signal to every <pid> listed
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -<signal>, -s, --signal <signal>
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:                         specify the <signal> to be sent
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -q, --queue <value>    integer value to be sent with the signal
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -l, --list=[<signal>]  list all signal names, or convert one to a name
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -L, --table            list all signal names in a nice table
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -h, --help     display this help and exit
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]:  -V, --version  output version information and exit
Jan 16 12:11:50 ip-172-31-27-227 kill[2652]: For more details see kill(1).
Jan 16 12:11:50 ip-172-31-27-227 systemd[1]: slurmd.service: Control process exited, code=exited, status=1/FAILURE

Additional context

No response

@NucciTheBoss NucciTheBoss added bug Something isn't working needs triage Needs further investigation to determine cause and/or work required to implement fix/feature labels Jan 16, 2025
@NucciTheBoss
Copy link
Member

Good catch. Adding the needs triage label because I'd like to understand how this MAINPID issue is being caused. Wondering if we switching to systemctl restart ... will resolve the problem or if there is a larger issue with how the service file is written.

@jedel1043
Copy link
Collaborator

jedel1043 commented Jan 16, 2025

Reproduced this, and also found out that the machines cannot ping to each other (on AWS specifically) Ignore this, I was using the wrong IP address whoops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Needs further investigation to determine cause and/or work required to implement fix/feature
Projects
None yet
Development

No branches or pull requests

3 participants