Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Local K8s cluster doesn't work with GPU models containing numbers only. #4608

Open
llc1123 opened this issue Jan 26, 2025 · 3 comments

Comments

@llc1123
Copy link

llc1123 commented Jan 26, 2025

I followed the troubleshooting guides to check GPU support:
https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-troubleshooting.html#checking-gpu-support

  • Step B0 - Is your cluster GPU-enabled? ✅
llc@LLC:~$ kubectl describe nodes
Name:               ai-dev
...
Capacity:
  ...
  nvidia.com/gpu:     1
  ...
  • Step B1 - Can you run a GPU pod? ✅
llc@LLC:~$ kubectl logs skygputest
Sun Jan 26 07:00:53 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
| 30%   32C    P8             13W /  450W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  • Step B2 - Are your nodes labeled correctly? ✅
llc@LLC:~$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, labels: .metadata.labels}'
...
skypilot.co/accelerator=4090
...
  • Step B3 - Can SkyPilot see your GPUs? ✅
llc@LLC:~$ sky show-gpus --cloud k8s
Kubernetes GPUs (context: default)
GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
4090  1                         1           1

Kubernetes per node accelerator availability
NODE_NAME  GPU_NAME  TOTAL_GPUS  FREE_GPUS
ai-dev     4090      1           1
  • Step B4 - Try launching a dummy GPU task ❌
llc@LLC:~$ sky launch -y -c mygpucluster --cloud k8s --gpus 4090:1 -- "nvidia-smi"
Task from command: nvidia-smi
No resource satisfying Kubernetes({'4090': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'4090': 1}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
@llc1123
Copy link
Author

llc1123 commented Jan 26, 2025

Fixed by relabel "4090" with "rtx4090". It seems that SkyPilot does not support GPU model names containing only numbers.

@llc1123 llc1123 closed this as completed Jan 26, 2025
@llc1123 llc1123 reopened this Jan 26, 2025
@llc1123
Copy link
Author

llc1123 commented Jan 26, 2025

Maybe still need to be fixed? I'll keep this issue open for follow-ups.

@llc1123 llc1123 changed the title Local K8s cluster doesn't work with 4090 GPUs. [BUG] Local K8s cluster doesn't work with GPU models containing numbers only. Jan 26, 2025
@romilbhardwaj
Copy link
Collaborator

Thanks for the report @llc1123. Given that step B3 worked but B4 failed indicates an issue with our instance selection logic.

@romilbhardwaj romilbhardwaj changed the title [BUG] Local K8s cluster doesn't work with GPU models containing numbers only. [k8s] Local K8s cluster doesn't work with GPU models containing numbers only. Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants