Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(slurmctld): fix node-configured action #50

Merged
merged 2 commits into from
Dec 9, 2024

Conversation

dsloanm
Copy link
Contributor

@dsloanm dsloanm commented Dec 6, 2024

This PR splits the argument passed to _slurmctld.scontrol in _resume_nodes to fix a subsequent erroneous subprocess.run call in hpc-libs. Specifically, subprocess.run was being passed:

['scontrol', 'update nodename=<nodes> state=resume']

rather than (the now corrected):

['scontrol', 'update', 'nodename=<nodes>', 'state=resume']

A unit test for _resume_nodes has also been added.

This fixes juju run slurmd/<id> node-configured so hook failed: "slurmd-relation-changed" is no longer reported. Edit: Added the relevant log lines for the hook failure from the slurmctld/0 log /var/log/juju/unit-slurmctld-0.log:

/var/log/juju/unit-slurmctld-0.log reported failure
2024-12-06 09:24:49 ERROR unit.slurmctld/0.juju-log server.go:325 slurmd:0: command ['scontrol', 'update nodename=juju-b8248f-1 state=resume'] failed with message invalid keyword: update nodename=juju-b8248f-1 state=resume

2024-12-06 09:24:49 ERROR unit.slurmctld/0.juju-log server.go:325 slurmd:0: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 427, in <module>
    main.main(SlurmctldCharm)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/main.py", line 551, in main
    manager.run()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/main.py", line 530, in run
    self._emit()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/main.py", line 519, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/main.py", line 147, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/src/interface_slurmd.py", line 134, in _on_relation_changed
    self.on.partition_available.emit()
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 265, in _on_write_slurm_conf
    self._resume_nodes(transitioning_nodes)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/./src/charm.py", line 379, in _resume_nodes
    self._slurmctld.scontrol(update_cmd)
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 891, in scontrol
    return _call("scontrol", *args).stdout
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/juju/agents/unit-slurmctld-0/charm/lib/charms/hpc_libs/v0/slurm_ops.py", line 141, in _call
    raise SlurmOpsError(f"command {cmd} failed. stderr:\n{result.stderr}")
charms.hpc_libs.v0.slurm_ops.SlurmOpsError: command ['scontrol', 'update nodename=juju-b8248f-1 state=resume'] failed. stderr:
invalid keyword: update nodename=juju-b8248f-1 state=resume
Hook Failure Reproducer Steps
$ juju deploy slurmctld --constraints="virt-type=virtual-machine" --channel "edge"
$ juju deploy slurmd --constraints="virt-type=virtual-machine" --channel "edge"
$ juju integrate slurmctld:slurmd slurmd:slurmctld
$ juju run slurmd/0 node-configured
$ juju status
(slurmctld/0* reports hook failed: "slurmd-relation-changed")

Splits argument passed to _slurmctld.scontrol in _resume_nodes.
Fixes erroneous subprocess.run call:

  ['scontrol', 'update nodename=<nodes> state=resume']

Now corrected to:

  ['scontrol', 'update', 'nodename=<nodes>', 'state=resume']

Action `juju run slurmd/<id> node-configured` no longer reports a
hook error.

Adds a unit test for _resume_nodes to check arguments.
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this and adding a complimenting unit tests! Just one comment/question 😃

charms/slurmctld/src/charm.py Outdated Show resolved Hide resolved
@dsloanm dsloanm requested a review from NucciTheBoss December 9, 2024 18:05
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I was able to successfully test your branch on my local device :shipit:

@NucciTheBoss NucciTheBoss merged commit 9e38e3a into charmed-hpc:main Dec 9, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants