Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225

sky333999 · 2024-08-13T19:08:49Z

Description:
#214 introduced NVIDIA GPU count metrics to track request, limit & usage counts at a pod, node & cluster level. However, this implementation was scraping all the metrics from the leader node for the entire cluster and introducing new log events of PodGPU Type.

This change moves the scraping of these metrics into the pod store such that they are scraped on each agent/collector pod as opposed to just the leader and they are now emitted as part of the existing Type Pod and Type Node log events (similar to the cpu & mem metrics).

This change also tweaks the metrics to more closely match the cpu metrics.

Rename (pod|node)_gpu_total -> (pod|node)_gpu_usage_total
Add (pod|node)_gpu_reserved_capacity

Testing: Deployed changes to a test cluster running a mix of GPU and non-GPU nodes.

Validated overall EMF logs count remains the same before & after changes
Validated EMF log events corresponding to pods and nodes that are not related to GPU workloads remain un-impacted
Validated EMF log events for GPU pods and nodes now publish metrics as expected.

… and node stores

receiver/awscontainerinsightreceiver/internal/stores/utils_test.go

receiver/awscontainerinsightreceiver/internal/stores/podstore.go

sky333999 force-pushed the sky333999/gpu-metrics branch from 9979856 to 8a39713 Compare August 13, 2024 20:05

Move GPU request, limit, usage and reserved_capacity metrics into pod…

9c631d4

… and node stores

sky333999 force-pushed the sky333999/gpu-metrics branch from 8a39713 to 9c631d4 Compare August 14, 2024 02:25

sky333999 changed the title ~~Update GPU Metrics calculations for request, limit, total~~ Move GPU metrics for request, limit, usage & reserved_capacity into pod store Aug 14, 2024

sky333999 marked this pull request as ready for review August 14, 2024 02:49

sky333999 requested a review from mxiamxia as a code owner August 14, 2024 02:49

sky333999 mentioned this pull request Aug 14, 2024

[containerinsights] Update GPU usage metrics emitted aws/amazon-cloudwatch-agent#1298

Merged

movence reviewed Aug 14, 2024

View reviewed changes

receiver/awscontainerinsightreceiver/internal/stores/utils_test.go Show resolved Hide resolved

receiver/awscontainerinsightreceiver/internal/stores/podstore.go Outdated Show resolved Hide resolved

sky333999 added 2 commits August 14, 2024 10:55

Update test

bab8231

Minor tweaks

959860c

dricross approved these changes Aug 14, 2024

View reviewed changes

movence approved these changes Aug 14, 2024

View reviewed changes

sky333999 merged commit 5427408 into aws-cwa-dev Aug 14, 2024
146 checks passed

sky333999 deleted the sky333999/gpu-metrics branch August 14, 2024 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225

Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225

sky333999 commented Aug 13, 2024 •

edited

Loading

Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225

Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225

Conversation

sky333999 commented Aug 13, 2024 • edited Loading

sky333999 commented Aug 13, 2024 •

edited

Loading