scontrol reconfigure
fails if slurmd charm is deployed to machine with < 1G RealMemory
in slurm.conf
#64
Labels
bug
Something isn't working
Bug Description
scontrol reconfigure
fails with the error messagescontrol: error: NodeNames=juju-d566c2-1 MemSpecLimit=1024 is invalid, reset to 0
after adding a compute node to slurm.conf if the machine the slurmd charm is deployed to has less than 1G of memory allocated to it.The cause of this error is that value of
RealMemory
can never be less than the value ofMemSpecLimit
, however, in the slurmd charm, the value ofRealMemory
can be dynamic as it is reported byslurmd -C
and interpreted by the charm, but the value ofMemSpecLimit
is set as the constant"1024"
:slurm-charms/charms/slurmd/src/utils/machine.py
Lines 23 to 43 in 8fd9d73
slurm-charms/charms/slurmd/src/charm.py
Line 367 in 8fd9d73
slurmd -C
can report that the machine's availableRealMemory
is less than 1G, but the slurmd charm has no way of handling this edge case. Since the bad node configuration can be written to the slurm.conf file without any checks at all, this syntax error won't be caught untilscontrol reconfigure
is run to signal to all the daemons to reload their configuration files.The easiest fix here is to just develop a heuristic for determining what
MemSpecLimit
should be when the machine has < 1G of memory available. This is really only a problem for test and/or exploratory deployments, and isn't an optional configuration for production-level clusters, so we should maybe also log a warning that the node memory configuration is not optimal for production deployments, but is suitable for test/exploratory deployments.To Reproduce
juju deploy slurmd --base [email protected] --channel "edge" --constraints="virt-type=virtual-machine"
juju deploy slurmctld --base [email protected] --channel "edge" --constraints="virt-type=virtual-machine"
juju integrate slurmd slurmctld
juju run slurmctld/leader resume <slurmd/0-hostname>
SlurmOpsError
exception injuju debug-log --include slurmcltd/0 --level ERROR
Environment
slurmctld
revision number: 86slurmd
revision number: 107Relevant log output
Additional context
No response
The text was updated successfully, but these errors were encountered: