Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-related prometheus metrics #8168

Open
mrocklin opened this issue Sep 6, 2023 · 3 comments
Open

GPU-related prometheus metrics #8168

mrocklin opened this issue Sep 6, 2023 · 3 comments
Labels
diagnostics feature Something is missing

Comments

@mrocklin
Copy link
Member

mrocklin commented Sep 6, 2023

If GPUs are present we have some nice Dask dashboards that give us real-time information about things like GPU memory and GPU utilization.

It would be nice to expose these also as prometheus metrics for offline analysis.

cc @jacobtomlinson @crusaderky @ntabris

@ntabris
Copy link
Contributor

ntabris commented Sep 6, 2023

Is there a benefit to exposing these via Dask, rather than expecting folks to use https://github.com/NVIDIA/dcgm-exporter if they want GPU metrics? Does Dask have distinct GPU-related metrics? (Genuine question, I'm not sure.)

@mrocklin
Copy link
Member Author

mrocklin commented Sep 6, 2023

Not particularly distinct. If there is some standard for this already that people can use probably that's fine. I'd defer to @jacobtomlinson though

@jacobtomlinson
Copy link
Member

The kind of fine memory metrics that @charlesbluca is talking about in #8148 wouldn't be exposed by DCGM so there probably is value in exposing that in Dask.

@hendrikmakait hendrikmakait added diagnostics feature Something is missing and removed needs triage labels Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
diagnostics feature Something is missing
Projects
None yet
Development

No branches or pull requests

4 participants