nvidia-smi
fails with Failed to initialize NVML
after some time in Pods using **systemd** cgroups
#266
Labels
nvidia-smi
fails with Failed to initialize NVML
after some time in Pods using **systemd** cgroups
#266
Summary
Customer reported that
nvidia-smi
stops working in Kubernetes pods with the errorFailed to initialize NVML: Unknown Error
after some time.🟢 Worth noting, applications already running on the GPU remain fully functional and unaffected
⚙️ Customer is using
nvidia-smi
for the metrics. Hence, this affects their metrics, which is operationally important for them.This is a known issue detailed in NVIDIA Container Toolkit Issue #48, and the behavior was reproduced in our environment.
Reproducer
Create a
nvidia-smi-loop.yaml
file with the following pod configuration:Deploy the pod:
Trigger a
daemon-reload
after about 10 seconds the pod's been running:Check pod logs:
Result
Configuration Differences
The issue is specific to environments using systemd cgroup management with the NVIDIA container runtime. Observations from different environments:
K3s-based Provider:
Systemd cgroup is enabled in the
containerd
configuration:containerd
configuration (/etc/containerd/config.toml
):it appears
containerd
enablesSystemdCgroup
by default when it's not explicitly set in thecontainerd
config:Kubespray-based Provider:
Systemd cgroup is not enabled in the
containerd
configuration:As seen from the kubespray-based provider
systemdCgroup = true
option is absent from the[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
configuration.Next Steps
Proposed Fix: Explicitly disable systemd cgroup for NVIDIA container runtime in k3s-based providers.
Generate
config.toml.tmpl
withSystemdCgroup = false
fornvidia-container-runtime
on all GPU-enabled nodes:Restart
k3s-agent
(workers) and/ork3s
(control-planes) systemd service:And if this is a control-plane node:
Verify
SystemdCgroup
is disabled:crictl info | grep -i -C2 nvidia-container-runtime
Test the reproducer. (
nvidia-smi-loop.yaml
steps from above)One liner
/var/lib/rancher
directory/data/k3s
Verify if it is used:
Verification/cleanup
Verify the provider status endpoint:
If reported values seem off, bounce the
operator-inventory
:See if there are any failed pods to delete:
To delete Failed pods
Documentation Update
If this works out, we need to advise K3s-based providers to disable systemd cgroup management in the NVIDIA container runtime.
And update the
server-mgmt
documentation.The text was updated successfully, but these errors were encountered: