MPS server not serving any request after connecting with wrong user ID #19

Telemaco019 · 2023-02-28T09:56:24Z

Problem description

MPS Server requires the clients to run with the same user ID, which is 1000 by default. If a container requesting MPS resources runs with a different user ID, the MPS server refuses the request and the container cannot access the GPU. This behaviour is expected.

However, after that happens, any new container running with user 1000 and requesting MPS resources incurs the same problem.

How to replicate

Create a Pod requesting MPS resources and running with a user ID different from 1000 (in this case 0):

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  hostIPC: true
  restartPolicy: OnFailure
  containers:
  - name: cuda-test
    image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
    command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
    resources:
      limits:
         nvidia.com/gpu-2gb: 1

Create a new Pod requesting MPS resources, this time running with user ID 1000:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-2
spec:
  hostIPC: true
  restartPolicy: OnFailure
  securityContext:
    runAsUser: 1000
    runAsNonRoot: true
  containers:
  - name: cuda-test
    image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
    command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
    resources:
      limits:
         nvidia.com/gpu-2gb: 1

Expected behaviour

The first Pod running as user 1000 should not be able to access GPU. The second Pod running as user 1000 should instead be able to access the requested GPU slice.

Actual behaviour

Both Pods are stuck when requesting GPU access, as the MPS server enqueued the requests and never serve them.
These are the logs from the MPS server running in the device-plugin when the Pods running as 1000 tries to connect to the GPU:

nvidia-mps-server [2023-02-28 09:31:00.573 Control    54] Accepting connection...
nvidia-mps-server [2023-02-28 09:31:00.573 Control    54] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list

Temporary solution

Restart the MPS server running on the node by restarting the device-plugin Pod on that node.

The text was updated successfully, but these errors were encountered:

VOLCANO0203 · 2023-07-28T04:29:14Z

I have encountered the same problem, and restarting mps-server can indeed solve it, but according to my practice, if there is a load on the gpu, restarting mps-server will fail.

Telemaco019 mentioned this issue Mar 17, 2023

Documentation is not clear when it comes to cuda requirements #22

Closed

Telemaco019 added the bug Something isn't working label Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPS server not serving any request after connecting with wrong user ID #19

MPS server not serving any request after connecting with wrong user ID #19

Telemaco019 commented Feb 28, 2023

VOLCANO0203 commented Jul 28, 2023

MPS server not serving any request after connecting with wrong user ID #19

MPS server not serving any request after connecting with wrong user ID #19

Comments

Telemaco019 commented Feb 28, 2023

Problem description

How to replicate

Expected behaviour

Actual behaviour

Temporary solution

VOLCANO0203 commented Jul 28, 2023