Collect GPU usage metrics with prometheus #5296

yuvipanda · 2024-12-19T03:10:24Z

We use prometheus node exporter, deployed as part of our prometheus chart, to collect metrics about CPU and memory usage.

This deploys NVIDIA's dcgm-exporter which collects information about GPU usage.

As we work towards more cost monitoring and usage monitoring, collecting this information should allow us to help users get more bang for the buck from their GPU use. Since we only collect information after the exporters are deployed, this starts the information collection process even if it's not directly visible to end users.

Works towards https://2i2c.productboard.com/entity-detail/features/30046512, initially requested as part of https://2i2c.freshdesk.com/a/tickets/2545.

We use [prometheus node exporter](https://github.com/prometheus/node_exporter), deployed as part of our prometheus chart, to collect metrics about CPU and memory usage. This deploys NVIDIA's [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) which collects information about GPU usage. As we work towards more cost monitoring and usage monitoring, collecting this information should allow us to help users get more bang for the buck from their GPU use. Since we only collect information after the exporters are deployed, this starts the information collection process even if it's not directly visible to end users. Works towards https://2i2c.productboard.com/entity-detail/features/30046512, initially requested as part of https://2i2c.freshdesk.com/a/tickets/2545.

for more information, see https://pre-commit.ci

github-actions · 2024-12-19T03:11:32Z

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider	Cluster Name	Upgrade Support?	Reason for Support Redeploy	Upgrade Staging?
gcp	awi-ciroh	Yes	Support helm chart has been modified	No
gcp	catalystproject-latam	Yes	Support helm chart has been modified	No
aws	jupyter-health	Yes	Support helm chart has been modified	No
aws	2i2c-aws-us	Yes	Support helm chart has been modified	No
aws	jupyter-meets-the-earth	Yes	Support helm chart has been modified	No
aws	nasa-veda	Yes	Support helm chart has been modified	No
aws	gridsst	Yes	Support helm chart has been modified	No
aws	nmfs-openscapes	Yes	Support helm chart has been modified	No
aws	maap	Yes	Support helm chart has been modified	No
kubeconfig	utoronto	Yes	Support helm chart has been modified	No
aws	kitware	Yes	Support helm chart has been modified	No
gcp	2i2c-uk	Yes	Support helm chart has been modified	No
kubeconfig	queensu	Yes	Support helm chart has been modified	No
aws	victor	Yes	Support helm chart has been modified	No
aws	catalystproject-africa	Yes	Support helm chart has been modified	No
gcp	pangeo-hubs	Yes	Support helm chart has been modified	No
aws	earthscope	Yes	Support helm chart has been modified	No
gcp	hhmi	Yes	Support helm chart has been modified	No
aws	strudel	Yes	Support helm chart has been modified	No
aws	ubc-eoas	Yes	Support helm chart has been modified	No
gcp	dubois	Yes	Support helm chart has been modified	No
gcp	cloudbank	Yes	Support helm chart has been modified	No
aws	nasa-cryo	Yes	Support helm chart has been modified	No
gcp	2i2c	Yes	Support helm chart has been modified	No
gcp	leap	Yes	Support helm chart has been modified	No
aws	smithsonian	Yes	Support helm chart has been modified	No
aws	openscapes	Yes	Support helm chart has been modified	No
aws	opensci	Yes	Support helm chart has been modified	No
aws	projectpythia	Yes	Support helm chart has been modified	No
aws	nasa-ghg	Yes	Support helm chart has been modified	No

Production deployments

No production hub upgrades will be triggered

yuvipanda · 2024-12-19T03:27:15Z

Unfortunately this doesn't work on GCP yet:

  Warning  FailedCreate  9m (x18 over 19m)  daemonset-controller  Error creating: insufficient quota to match these scopes: [{PriorityClass In [system-node-critical system-cluster-critical]}]

yuvipanda · 2024-12-19T03:47:39Z

We can set .priorityClassName to get around this on GCP. But we don't have a clean way to schedule this only on GPU nodes yet, as it will just crash and burn on non-GPU nodes.

yuvipanda and others added 2 commits December 18, 2024 19:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

75b28a4

for more information, see https://pre-commit.ci

yuvipanda marked this pull request as draft December 19, 2024 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect GPU usage metrics with prometheus #5296

Collect GPU usage metrics with prometheus #5296

yuvipanda commented Dec 19, 2024

github-actions bot commented Dec 19, 2024 •

edited

Loading

yuvipanda commented Dec 19, 2024

yuvipanda commented Dec 19, 2024

Collect GPU usage metrics with prometheus #5296

Are you sure you want to change the base?

Collect GPU usage metrics with prometheus #5296

Conversation

yuvipanda commented Dec 19, 2024

github-actions bot commented Dec 19, 2024 • edited Loading

Support and Staging deployments

Production deployments

yuvipanda commented Dec 19, 2024

yuvipanda commented Dec 19, 2024

github-actions bot commented Dec 19, 2024 •

edited

Loading