-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod stuck pending at resource overuse #45
Comments
As an update, I have obtained the logs of nebuly-nos gpu-agent that crashes with CrashLoopBackOff error:
I am assuming the nebuly-nos nvml does not link properly to my .so files, which are under /usr/lib/x86_64-linux-gnu. Is there a way to fix this by specifying my path? Also, when trying with default plugin (with affinity and label on my node), I observe 0 allocatable for nvidia.gpu in my node. This is the log of default plugin : |
Hi,
I am allocating only 1gb of 24gb available memory to gpu operator that is shown in my node's labels. I also have another gpu device plugin (the default one) in my cluster but I have done the necessary affinity configurations to prevent both running. Basically, my pod stucks at pending (the sleep pod that is shared on documentation) with the reasoning of resource overuse, and does not get scheduled. MPS server occupies even less than 1gb on my gpu, and seems to be running in the output of nvidia-smi.
I have followed steps in doc about user mode 1000 and necessary gpu-operator config arrangements (mig mode mixed etc.)
Any help would be much appreciated.
The text was updated successfully, but these errors were encountered: