Handle GPU partitioning mode changes on the same Node (MIG<>MPS) #16

Telemaco019 · 2023-02-19T18:41:10Z

Problem description

When changing the partitioning mode of a node from MPS to MIG, the nvidia-device-plugin crashes and therefore any new MIG device created by nos is never exposed to k8s as resource.

How to reproduce

Enable MPS partitioning on a node (kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps")
Create a Pod requesting MPS resources (for instance nvidia.com/gpu-10gb)
After the requested MPS resources are created and the Pod is scheduled on the node, delete the Pod and change the node's GPU partitioning mode to MIG (kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps")
Create a Pod requesting MIG resources (for instance nvidia.com/mig-1g.10gb)

Expected behaviour

After step 4, the MIG resources are created automatically and the Pod is scheduled on the node

Actual behaviour

After step 4, the MIG devices are created on the GPU, however the nvidia-device-plugin Pod crashes with error Cannot find configuration named <config-name>, where <config-name> is the name of the configuration set by nos during step 2.

The text was updated successfully, but these errors were encountered:

Telemaco019 added bug Something isn't working enhancement New feature or request and removed bug Something isn't working enhancement New feature or request labels Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle GPU partitioning mode changes on the same Node (MIG<>MPS) #16

Handle GPU partitioning mode changes on the same Node (MIG<>MPS) #16

Telemaco019 commented Feb 19, 2023

Handle GPU partitioning mode changes on the same Node (MIG<>MPS) #16

Handle GPU partitioning mode changes on the same Node (MIG<>MPS) #16

Comments

Telemaco019 commented Feb 19, 2023

Problem description

How to reproduce

Expected behaviour

Actual behaviour