the node does not show gpumem allocatable resources #612

gongysh2004 · 2024-11-14T11:31:24Z

What happened:
After installation, when I use kubectl describe node nodegpu1, it does not show the gpumem related resource.
What you expected to happen:
the ' nvidia.com/gpumem: ' show be shown under 'Allocatable' part of the node.
How to reproduce it (as minimally and precisely as possible):
according to the installation steps, and then run kubectl describe node nodegpu1.
My current info is:

Capacity:
  cpu:                   160
  ephemeral-storage:     3750157048Ki
  hugepages-1Gi:         0
  hugepages-2Mi:         0
  memory:                2113442580Ki
  nvidia.com/gpu:        80
  pods:                  110
Allocatable:
  cpu:                   160
  ephemeral-storage:     3456144729715
  hugepages-1Gi:         0
  hugepages-2Mi:         0
  memory:                2113340180Ki
  nvidia.com/gpu:        80
  pods:                  110

Anything else we need to know?:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The hami-device-plugin container logs
The hami-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Any relevant kernel output lines from dmesg

Environment:

HAMi version:

root@node7vm-1:~/test# helm ls -A | grep hami
hami            kube-system     2               2024-11-14 15:18:36.886955318 +0800 CST deployed        hami-2.4.0                      2.4.0      
my-hami-webui   kube-system     4               2024-11-14 17:18:24.678439025 +0800 CST deployed        hami-webui-1.0.3                1.0.3

nvidia driver or other AI device driver version:
Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Others:
node's annotation:

                    kubernetes.io/os=linux
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"node7bm-2"}
                    hami.io/node-handshake: Requesting_2024.11.14 11:25:56
                    hami.io/node-handshake-dcu: Deleted_2024.11.14 07:22:47
                    hami.io/node-nvidia-register:
                      GPU-fc28df76-54d2-c387-e52e-5f0a9495968c,10,49140,100,NVIDIA-NVIDIA L40S,0,true:GPU-b97db201-0442-8531-56d4-367e0c7d6edd,10,49140,100,NVID...
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AMXBF16,cpu-cpuid.AMXINT8,cpu-cpuid.AMXTILE,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BF16,cpu-...
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.94.94/24

The text was updated successfully, but these errors were encountered:

archlitchi · 2024-11-15T02:56:53Z

yes, that resources aren't registered in kubelet, you can see it by visiting {scheduler node ip}:31993/metrics

Nimbus318 · 2024-11-15T03:45:09Z

The design of device plugins is to simplify resource management, ensuring that each instance focuses on managing a single type of device resource. The related interfaces, like Registration and ListAndWatch, enforce this by limiting each device plugin instance to reporting and managing only one resource type.

In our case, to handle GPU resources like cores and memory for scheduling, we register the total allocatable amounts as annotations on the nodes. You can see an annotation like this:

hami.io/node-nvidia-register: GPU-fc28df76-54d2-c387-e52e-5f0a9495968c,10,49140,100,NVIDIA-NVIDIA L40S,0,true:GPU-b97db201-0442-8531-56d4-367e0c7d6edd,10,49140,100,NVID....

And because we can only report nvidia.com/gpu to the kubelet, this results in the 2 Insufficient nvidia.com/gpumem warning message from the default scheduler in the previous issue #611 .

gongysh2004 added the kind/bug Something isn't working label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the node does not show gpumem allocatable resources #612

the node does not show gpumem allocatable resources #612

gongysh2004 commented Nov 14, 2024

archlitchi commented Nov 15, 2024

Nimbus318 commented Nov 15, 2024

the node does not show gpumem allocatable resources #612

the node does not show gpumem allocatable resources #612

Comments

gongysh2004 commented Nov 14, 2024

archlitchi commented Nov 15, 2024

Nimbus318 commented Nov 15, 2024