Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPS server not serving any request after connecting with wrong user ID #19

Open
Telemaco019 opened this issue Feb 28, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@Telemaco019
Copy link
Member

Problem description

MPS Server requires the clients to run with the same user ID, which is 1000 by default. If a container requesting MPS resources runs with a different user ID, the MPS server refuses the request and the container cannot access the GPU. This behaviour is expected.

However, after that happens, any new container running with user 1000 and requesting MPS resources incurs the same problem.

How to replicate

  1. Create a Pod requesting MPS resources and running with a user ID different from 1000 (in this case 0):
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  hostIPC: true
  restartPolicy: OnFailure
  containers:
  - name: cuda-test
    image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
    command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
    resources:
      limits:
         nvidia.com/gpu-2gb: 1
  1. Create a new Pod requesting MPS resources, this time running with user ID 1000:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-2
spec:
  hostIPC: true
  restartPolicy: OnFailure
  securityContext:
    runAsUser: 1000
    runAsNonRoot: true
  containers:
  - name: cuda-test
    image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
    command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
    resources:
      limits:
         nvidia.com/gpu-2gb: 1

Expected behaviour

The first Pod running as user 1000 should not be able to access GPU. The second Pod running as user 1000 should instead be able to access the requested GPU slice.

Actual behaviour

Both Pods are stuck when requesting GPU access, as the MPS server enqueued the requests and never serve them.
These are the logs from the MPS server running in the device-plugin when the Pods running as 1000 tries to connect to the GPU:

nvidia-mps-server [2023-02-28 09:31:00.573 Control    54] Accepting connection...
nvidia-mps-server [2023-02-28 09:31:00.573 Control    54] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list

Temporary solution

Restart the MPS server running on the node by restarting the device-plugin Pod on that node.

@VOLCANO0203
Copy link

I have encountered the same problem, and restarting mps-server can indeed solve it, but according to my practice, if there is a load on the gpu, restarting mps-server will fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants