You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MPS Server requires the clients to run with the same user ID, which is 1000 by default. If a container requesting MPS resources runs with a different user ID, the MPS server refuses the request and the container cannot access the GPU. This behaviour is expected.
However, after that happens, any new container running with user 1000 and requesting MPS resources incurs the same problem.
How to replicate
Create a Pod requesting MPS resources and running with a user ID different from 1000 (in this case 0):
The first Pod running as user 1000 should not be able to access GPU. The second Pod running as user 1000 should instead be able to access the requested GPU slice.
Actual behaviour
Both Pods are stuck when requesting GPU access, as the MPS server enqueued the requests and never serve them.
These are the logs from the MPS server running in the device-plugin when the Pods running as 1000 tries to connect to the GPU:
nvidia-mps-server [2023-02-28 09:31:00.573 Control 54] Accepting connection...
nvidia-mps-server [2023-02-28 09:31:00.573 Control 54] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
Temporary solution
Restart the MPS server running on the node by restarting the device-plugin Pod on that node.
The text was updated successfully, but these errors were encountered:
I have encountered the same problem, and restarting mps-server can indeed solve it, but according to my practice, if there is a load on the gpu, restarting mps-server will fail.
Problem description
MPS Server requires the clients to run with the same user ID, which is 1000 by default. If a container requesting MPS resources runs with a different user ID, the MPS server refuses the request and the container cannot access the GPU. This behaviour is expected.
However, after that happens, any new container running with user 1000 and requesting MPS resources incurs the same problem.
How to replicate
Expected behaviour
The first Pod running as user 1000 should not be able to access GPU. The second Pod running as user 1000 should instead be able to access the requested GPU slice.
Actual behaviour
Both Pods are stuck when requesting GPU access, as the MPS server enqueued the requests and never serve them.
These are the logs from the MPS server running in the device-plugin when the Pods running as 1000 tries to connect to the GPU:
Temporary solution
Restart the MPS server running on the node by restarting the device-plugin Pod on that node.
The text was updated successfully, but these errors were encountered: