Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the node does not show gpumem allocatable resources #612

Open
gongysh2004 opened this issue Nov 14, 2024 · 2 comments
Open

the node does not show gpumem allocatable resources #612

gongysh2004 opened this issue Nov 14, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@gongysh2004
Copy link

What happened:
After installation, when I use kubectl describe node nodegpu1, it does not show the gpumem related resource.
What you expected to happen:
the ' nvidia.com/gpumem: ' show be shown under 'Allocatable' part of the node.
How to reproduce it (as minimally and precisely as possible):
according to the installation steps, and then run kubectl describe node nodegpu1.
My current info is:

Capacity:
  cpu:                   160
  ephemeral-storage:     3750157048Ki
  hugepages-1Gi:         0
  hugepages-2Mi:         0
  memory:                2113442580Ki
  nvidia.com/gpu:        80
  pods:                  110
Allocatable:
  cpu:                   160
  ephemeral-storage:     3456144729715
  hugepages-1Gi:         0
  hugepages-2Mi:         0
  memory:                2113340180Ki
  nvidia.com/gpu:        80
  pods:                  110

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
root@node7vm-1:~/test# helm ls -A | grep hami
hami            kube-system     2               2024-11-14 15:18:36.886955318 +0800 CST deployed        hami-2.4.0                      2.4.0      
my-hami-webui   kube-system     4               2024-11-14 17:18:24.678439025 +0800 CST deployed        hami-webui-1.0.3                1.0.3   
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
    node's annotation:
                    kubernetes.io/os=linux
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"node7bm-2"}
                    hami.io/node-handshake: Requesting_2024.11.14 11:25:56
                    hami.io/node-handshake-dcu: Deleted_2024.11.14 07:22:47
                    hami.io/node-nvidia-register:
                      GPU-fc28df76-54d2-c387-e52e-5f0a9495968c,10,49140,100,NVIDIA-NVIDIA L40S,0,true:GPU-b97db201-0442-8531-56d4-367e0c7d6edd,10,49140,100,NVID...
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AMXBF16,cpu-cpuid.AMXINT8,cpu-cpuid.AMXTILE,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BF16,cpu-...
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.94.94/24
@gongysh2004 gongysh2004 added the kind/bug Something isn't working label Nov 14, 2024
@archlitchi
Copy link
Collaborator

yes, that resources aren't registered in kubelet, you can see it by visiting {scheduler node ip}:31993/metrics

@Nimbus318
Copy link
Contributor

The design of device plugins is to simplify resource management, ensuring that each instance focuses on managing a single type of device resource. The related interfaces, like Registration and ListAndWatch, enforce this by limiting each device plugin instance to reporting and managing only one resource type.

In our case, to handle GPU resources like cores and memory for scheduling, we register the total allocatable amounts as annotations on the nodes. You can see an annotation like this:

hami.io/node-nvidia-register: GPU-fc28df76-54d2-c387-e52e-5f0a9495968c,10,49140,100,NVIDIA-NVIDIA L40S,0,true:GPU-b97db201-0442-8531-56d4-367e0c7d6edd,10,49140,100,NVID....

And because we can only report nvidia.com/gpu to the kubelet, this results in the 2 Insufficient nvidia.com/gpumem warning message from the default scheduler in the previous issue #611 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants