-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in reporting GPUs per user #7
Comments
If I understand your question correctly - from my understanding- allocated means how many resoruces (GPU's in this case) are reserved or assigned to a job when it's scheduled to run. The resource requirements specified when the job was submitted determine how many resources are reserved/allocated- so say a user submits a job requesting 10 GPUs, then once the job starts, those 10 GPUs are allocated to that job- great., sure- but say the the job uses only 5 out of the 10 allocated GPUs effectively (i.e., at 100% usage), the full 10 GPUs are still reserved for that job, and other jobs can't use them. I've definitely come across that- trying to test blasting all GPU's in a small cluster and the jobs didn't actually use what was allocated. Utilized is the actual usage of the resources by the job during its execution. So allocated and utilized rarely are 100% in sync- there are some ways to make sure that those numbers get closer- but it would take some user hygiene and perhaps some pre/post run logic in slurm- to either ignore when a user tries to allocate more than they use. Some of the reasons they don't match would be stuff like inefficient code, overestimation of resource needs, etc. Let me know if I misunderstand your question- because I'm assuming you know the above info. |
Yeah I think this is a misunderstanding. The users and accounts reports should equal the sum of how many GPUs have been allocated. This isn't comparing dcgm it's only comparing slurm requests. For there to be 200 allocated GPUs someone(s) must have allocated 200 GPUs and the user tab should reflect that. |
Investigating timestamp:
7-15-2023 15:00:00 GMT
Utilized GPUs per user reports 136
Allocated GPUs reports 233
Why is there a difference?
The text was updated successfully, but these errors were encountered: