Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect GPU usage metrics with prometheus #5296

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yuvipanda
Copy link
Member

We use prometheus node exporter, deployed as part of our prometheus chart, to collect metrics about CPU and memory usage.

This deploys NVIDIA's dcgm-exporter which collects information about GPU usage.

As we work towards more cost monitoring and usage monitoring, collecting this information should allow us to help users get more bang for the buck from their GPU use. Since we only collect information after the exporters are deployed, this starts the information collection process even if it's not directly visible to end users.

Works towards https://2i2c.productboard.com/entity-detail/features/30046512, initially requested as part of https://2i2c.freshdesk.com/a/tickets/2545.

yuvipanda and others added 2 commits December 18, 2024 19:09
We use [prometheus node exporter](https://github.com/prometheus/node_exporter),
deployed as part of our prometheus chart, to collect metrics about
CPU and memory usage.

This deploys NVIDIA's [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter)
which collects information about GPU usage.

As we work towards more cost monitoring and usage monitoring,
collecting this information should allow us to help users get more
bang for the buck from their GPU use. Since we only collect information
after the exporters are deployed, this starts the information collection
process even if it's not directly visible to end users.

Works towards https://2i2c.productboard.com/entity-detail/features/30046512,
initially requested as part of https://2i2c.freshdesk.com/a/tickets/2545.
Copy link

github-actions bot commented Dec 19, 2024

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
gcp awi-ciroh Yes Support helm chart has been modified No
gcp catalystproject-latam Yes Support helm chart has been modified No
aws jupyter-health Yes Support helm chart has been modified No
aws 2i2c-aws-us Yes Support helm chart has been modified No
aws jupyter-meets-the-earth Yes Support helm chart has been modified No
aws nasa-veda Yes Support helm chart has been modified No
aws gridsst Yes Support helm chart has been modified No
aws nmfs-openscapes Yes Support helm chart has been modified No
aws maap Yes Support helm chart has been modified No
kubeconfig utoronto Yes Support helm chart has been modified No
aws kitware Yes Support helm chart has been modified No
gcp 2i2c-uk Yes Support helm chart has been modified No
kubeconfig queensu Yes Support helm chart has been modified No
aws victor Yes Support helm chart has been modified No
aws catalystproject-africa Yes Support helm chart has been modified No
gcp pangeo-hubs Yes Support helm chart has been modified No
aws earthscope Yes Support helm chart has been modified No
gcp hhmi Yes Support helm chart has been modified No
aws strudel Yes Support helm chart has been modified No
aws ubc-eoas Yes Support helm chart has been modified No
gcp dubois Yes Support helm chart has been modified No
gcp cloudbank Yes Support helm chart has been modified No
aws nasa-cryo Yes Support helm chart has been modified No
gcp 2i2c Yes Support helm chart has been modified No
gcp leap Yes Support helm chart has been modified No
aws smithsonian Yes Support helm chart has been modified No
aws openscapes Yes Support helm chart has been modified No
aws opensci Yes Support helm chart has been modified No
aws projectpythia Yes Support helm chart has been modified No
aws nasa-ghg Yes Support helm chart has been modified No

Production deployments

No production hub upgrades will be triggered

@yuvipanda
Copy link
Member Author

Unfortunately this doesn't work on GCP yet:

  Warning  FailedCreate  9m (x18 over 19m)  daemonset-controller  Error creating: insufficient quota to match these scopes: [{PriorityClass In [system-node-critical system-cluster-critical]}]

@yuvipanda yuvipanda marked this pull request as draft December 19, 2024 03:27
@yuvipanda
Copy link
Member Author

We can set .priorityClassName to get around this on GCP. But we don't have a clean way to schedule this only on GPU nodes yet, as it will just crash and burn on non-GPU nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant