You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When changing the partitioning mode of a node from MPS to MIG, the nvidia-device-plugin crashes and therefore any new MIG device created by nos is never exposed to k8s as resource.
How to reproduce
Enable MPS partitioning on a node (kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps")
Create a Pod requesting MPS resources (for instance nvidia.com/gpu-10gb)
After the requested MPS resources are created and the Pod is scheduled on the node, delete the Pod and change the node's GPU partitioning mode to MIG (kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps")
Create a Pod requesting MIG resources (for instance nvidia.com/mig-1g.10gb)
Expected behaviour
After step 4, the MIG resources are created automatically and the Pod is scheduled on the node
Actual behaviour
After step 4, the MIG devices are created on the GPU, however the nvidia-device-plugin Pod crashes with error Cannot find configuration named <config-name>, where <config-name> is the name of the configuration set by nos during step 2.
The text was updated successfully, but these errors were encountered:
Problem description
When changing the partitioning mode of a node from MPS to MIG, the nvidia-device-plugin crashes and therefore any new MIG device created by
nos
is never exposed to k8s as resource.How to reproduce
kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps"
)nvidia.com/gpu-10gb
)kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps"
)nvidia.com/mig-1g.10gb
)Expected behaviour
After step 4, the MIG resources are created automatically and the Pod is scheduled on the node
Actual behaviour
After step 4, the MIG devices are created on the GPU, however the nvidia-device-plugin Pod crashes with error
Cannot find configuration named <config-name>
, where<config-name>
is the name of the configuration set bynos
during step 2.The text was updated successfully, but these errors were encountered: