Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225

Merged
merged 3 commits into from
Aug 14, 2024

Conversation

sky333999
Copy link

@sky333999 sky333999 commented Aug 13, 2024

Description:
#214 introduced NVIDIA GPU count metrics to track request, limit & usage counts at a pod, node & cluster level. However, this implementation was scraping all the metrics from the leader node for the entire cluster and introducing new log events of PodGPU Type.

This change moves the scraping of these metrics into the pod store such that they are scraped on each agent/collector pod as opposed to just the leader and they are now emitted as part of the existing Type Pod and Type Node log events (similar to the cpu & mem metrics).

This change also tweaks the metrics to more closely match the cpu metrics.

  • Rename (pod|node)_gpu_total -> (pod|node)_gpu_usage_total
  • Add (pod|node)_gpu_reserved_capacity

Testing: Deployed changes to a test cluster running a mix of GPU and non-GPU nodes.

  • Validated overall EMF logs count remains the same before & after changes
  • Validated EMF log events corresponding to pods and nodes that are not related to GPU workloads remain un-impacted
  • Validated EMF log events for GPU pods and nodes now publish metrics as expected.

@sky333999 sky333999 force-pushed the sky333999/gpu-metrics branch from 9979856 to 8a39713 Compare August 13, 2024 20:05
@sky333999 sky333999 force-pushed the sky333999/gpu-metrics branch from 8a39713 to 9c631d4 Compare August 14, 2024 02:25
@sky333999 sky333999 changed the title Update GPU Metrics calculations for request, limit, total Move GPU metrics for request, limit, usage & reserved_capacity into pod store Aug 14, 2024
@sky333999 sky333999 marked this pull request as ready for review August 14, 2024 02:49
@sky333999 sky333999 requested a review from mxiamxia as a code owner August 14, 2024 02:49
@sky333999 sky333999 merged commit 5427408 into aws-cwa-dev Aug 14, 2024
146 checks passed
@sky333999 sky333999 deleted the sky333999/gpu-metrics branch August 14, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants