`nvidia-smi` fails with `Failed to initialize NVML` after some time in Pods using systemd cgroups #266

andy108369 · 2024-11-25T14:10:36Z

Summary

Customer reported that nvidia-smi stops working in Kubernetes pods with the error Failed to initialize NVML: Unknown Error after some time.

🟢 Worth noting, applications already running on the GPU remain fully functional and unaffected

⚙️ Customer is using nvidia-smi for the metrics. Hence, this affects their metrics, which is operationally important for them.

This is a known issue detailed in NVIDIA Container Toolkit Issue #48, and the behavior was reproduced in our environment.

Reproducer

Create a nvidia-smi-loop.yaml file with the following pod configuration:

Make sure to set kubernetes.io/hostname to the desired node name of your cluster.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-nvidia-smi-loop
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda
    image: "nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04"
    command: ["/bin/sh", "-c"]
    args: ["while true; do nvidia-smi -L; sleep 5; done"]
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    kubernetes.io/hostname: node10

Deploy the pod:
```
kubectl apply -f nvidia-smi-loop.yaml
```
Trigger a daemon-reload after about 10 seconds the pod's been running:
```
sleep 15
systemctl daemon-reload
```

Check pod logs:

kubectl logs cuda-nvidia-smi-loop --timestamps

Result

2024-11-25T13:33:37.068625936Z GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-9a3643e7-ac3c-850e-3436-5de6cfa48c23)
2024-11-25T13:33:42.128740632Z GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-9a3643e7-ac3c-850e-3436-5de6cfa48c23)
2024-11-25T13:33:52.245576418Z Failed to initialize NVML: Unknown Error
2024-11-25T13:33:57.297379775Z Failed to initialize NVML: Unknown Error

Configuration Differences

The issue is specific to environments using systemd cgroup management with the NVIDIA container runtime. Observations from different environments:

K3s-based Provider:

Systemd cgroup is enabled in the containerd configuration:

root@node1:~# crictl ps |grep nvid
ec0f71ea4d12b       159abe21a6880       3 weeks ago         Running             nvidia-device-plugin-ctr   0                   8e1c0567b6d49       nvdp-nvidia-device-plugin-b59hh

root@node1:~# crictl inspect ec0f71ea4d12b | grep -A3 runtimeOptions
    "runtimeOptions": {
      "binary_name": "/usr/bin/nvidia-container-runtime",
      "systemd_cgroup": true
    },

containerd configuration (/etc/containerd/config.toml):

# cat /etc/containerd/config.toml
disabled_plugins = ["cri"]

it appears containerd enables SystemdCgroup by default when it's not explicitly set in the containerd config:

# crictl info | grep -i -C2 nvidia-container-runtime
          "runtimeRoot": "",
          "options": {
            "BinaryName": "/usr/bin/nvidia-container-runtime",
            "SystemdCgroup": true
          },

Kubespray-based Provider:

Systemd cgroup is not enabled in the containerd configuration:

root@worker-01:~# crictl inspect 04ac886af1ec7 |grep -A3 runtimeOptions
    "runtimeOptions": {
      "binary_name": "/usr/bin/nvidia-container-runtime"
    },
    "config": {

As seen from the kubespray-based provider systemdCgroup = true option is absent from the [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] configuration.

# cat /etc/containerd/config.toml
version = 2
root = "/data/containerd"
state = "/run/containerd"
oom_score = 0

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[debug]
  level = "info"

[metrics]
  address = ""
  grpc_histogram = false

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "registry.k8s.io/pause:3.9"
    max_container_log_line_size = -1
    enable_unprivileged_ports = false
    enable_unprivileged_icmp = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      snapshotter = "overlayfs"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          base_runtime_spec = "/etc/containerd/cri-base.json"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            systemdCgroup = true
            binaryName = "/usr/local/bin/runc"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]

# crictl info | grep -i -C2 nvidia-container-runtime
          "runtimeRoot": "",
          "options": {
            "BinaryName": "/usr/bin/nvidia-container-runtime"
          },
          "privileged_without_host_devices": false,

Next Steps

Proposed Fix: Explicitly disable systemd cgroup for NVIDIA container runtime in k3s-based providers.

Generate config.toml.tmpl with SystemdCgroup = false for nvidia-container-runtime on all GPU-enabled nodes:

cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml | \
sed '/BinaryName = "\/usr\/bin\/nvidia-container-runtime"/!b;n;s/SystemdCgroup = true/SystemdCgroup = false/' \
> /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

Restart k3s-agent (workers) and/or k3s (control-planes) systemd service:

NOTE: restarting the containerd service will likely cause other pods running on the affected node to restart or experience disruptions. ⚠️
If this is a worker node:
```
systemctl restart k3s-agent.service
```
And if this is a control-plane node:
```
systemctl restart k3s.service
```

Verify SystemdCgroup is disabled:

crictl info | grep -i -C2 nvidia-container-runtime

Test the reproducer. (nvidia-smi-loop.yaml steps from above)

One liner

This command will perform all three (1-3) of the above steps automatically:

default installation under /var/lib/rancher directory

test -f /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl || { cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml | \
sed '/BinaryName = "\/usr\/bin\/nvidia-container-runtime"/!b;n;s/SystemdCgroup = true/SystemdCgroup = false/' | tee /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl ; systemctl is-active --quiet k3s.service && systemctl restart k3s.service || (systemctl is-active --quiet k3s-agent.service && systemctl restart k3s-agent.service); sleep 5s; crictl info | grep -i -C2 nvidia-container-runtime; }

custom data-dir /data/k3s

Verify if it is used:

grep -A1 data-dir /etc/systemd/system/k3s.service /etc/systemd/system/k3s-agent.service 2>/dev/null

crictl -c /data/k3s/agent/etc/crictl.yaml ps

test -f /data/k3s/agent/etc/containerd/config.toml.tmpl || { cat /data/k3s/agent/etc/containerd/config.toml | \
sed '/BinaryName = "\/usr\/bin\/nvidia-container-runtime"/!b;n;s/SystemdCgroup = true/SystemdCgroup = false/' | tee /data/k3s/agent/etc/containerd/config.toml.tmpl ; systemctl is-active --quiet k3s.service && systemctl restart k3s.service || (systemctl is-active --quiet k3s-agent.service && systemctl restart k3s-agent.service); sleep 5s; crictl -c /data/k3s/agent/etc/crictl.yaml info | grep -i -C2 nvidia-container-runtime; }

Verification/cleanup

Verify the provider status endpoint:

provider_info2.sh <provider-address>

If reported values seem off, bounce the operator-inventory:

kubectl -n akash-services rollout restart deployment operator-inventory

See if there are any failed pods to delete:

kubectl get pods -A -o wide --sort-by='{.metadata.creationTimestamp}' 
kubectl get pods -A --field-selector status.phase=Failed

To delete Failed pods

kubectl delete pods -A --field-selector status.phase=Failed

Documentation Update

If this works out, we need to advise K3s-based providers to disable systemd cgroup management in the NVIDIA container runtime.
And update the server-mgmt documentation.

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-11-27T16:14:58Z

We will perform scheduled maintenance on these five K3s-based providers, today November 27th at 17:00 UTC to address this issue:

provider.h100.sdg.val.akash.pub
provider.h100.hou.val.akash.pub
provider.rtx4090.wyo.eg.akash.pub
provider.a100.iah.val.akash.pub
provider.cato.akash.pub

During this maintenance, deployments will restart. We've informed the clients to ensure their deployments are running correctly once the maintenance is completed.

andy108369 · 2024-11-27T18:27:07Z

The maintenance complete.

andy108369 added awaiting-triage testnet-5 labels Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nvidia-smi` fails with `Failed to initialize NVML` after some time in Pods using systemd cgroups #266

`nvidia-smi` fails with `Failed to initialize NVML` after some time in Pods using systemd cgroups #266

andy108369 commented Nov 25, 2024 •

edited

Loading

andy108369 commented Nov 27, 2024 •

edited

Loading

andy108369 commented Nov 27, 2024

nvidia-smi fails with Failed to initialize NVML after some time in Pods using **systemd** cgroups #266

nvidia-smi fails with Failed to initialize NVML after some time in Pods using **systemd** cgroups #266

Comments

andy108369 commented Nov 25, 2024 • edited Loading

Summary

Reproducer

Result

Configuration Differences

Next Steps

One liner

Verification/cleanup

Documentation Update

andy108369 commented Nov 27, 2024 • edited Loading

andy108369 commented Nov 27, 2024

`nvidia-smi` fails with `Failed to initialize NVML` after some time in Pods using systemd cgroups #266

`nvidia-smi` fails with `Failed to initialize NVML` after some time in Pods using systemd cgroups #266

andy108369 commented Nov 25, 2024 •

edited

Loading

andy108369 commented Nov 27, 2024 •

edited

Loading