Add gpu count metrics including limit, request and total counts #214

movence · 2024-05-20T16:15:28Z

** Revision **

Remove the logic to copy resource metric attributes down to data points

Description:
Add NVDIA GPU count metrics including _limit, _request and _total at pod, node and cluster levels. This change will have the leader agent pod to collect GPU count metrics rather than individual agent pods to collect them. This is to collect limit and request metrics for pods that are still in pending status due to lack of available GPU devices. The leader agent will still gather information of pending pods for limit and request, but *_total metric will only include gpu counts data from running pods.

Testing:
Tested with a cluster which has 2 g4dn.12xlarge instances with 4 gpu devices each. There are 4 RUNNING pods (total 8 allocated gpu devices) and 2 PENDING pods requesting 2 gpu devices each.

internal/aws/containerinsight/utils.go

receiver/awscontainerinsightreceiver/internal/k8sapiserver/k8sapiserver.go

internal/aws/containerinsight/utils.go

internal/aws/containerinsight/const.go

internal/aws/containerinsight/utils_test.go

internal/aws/containerinsight/utils.go

…d, node and cluster levels

remove bool flag for ConvertToOTLPMetrics

movence requested a review from mxiamxia as a code owner May 20, 2024 16:15

movence requested review from mitali-salvi and sky333999 and removed request for mxiamxia May 21, 2024 12:19

movence mentioned this pull request May 21, 2024

nvidia gpu count metrics and bugfix aws/amazon-cloudwatch-agent#1183

Merged

sky333999 reviewed May 23, 2024

View reviewed changes

mitali-salvi reviewed May 23, 2024

View reviewed changes

internal/aws/containerinsight/utils.go Outdated Show resolved Hide resolved

internal/aws/containerinsight/utils.go Outdated Show resolved Hide resolved

movence added 2 commits May 23, 2024 11:16

add gpu count metrics including limit, request and total counts at po…

87b2dcd

…d, node and cluster levels

revert removed log while getting metric prefix

1ed1342

remove bool flag for ConvertToOTLPMetrics

movence force-pushed the ci-gpu-count branch from c31e7e7 to 1ed1342 Compare May 24, 2024 15:32

remove the extra process to copy attributes

e7eafc4

sky333999 approved these changes May 24, 2024

View reviewed changes

mitali-salvi approved these changes May 29, 2024

View reviewed changes

movence merged commit 2728c19 into amazon-contributing:aws-cwa-dev May 29, 2024
111 of 122 checks passed

sky333999 mentioned this pull request Aug 14, 2024

Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpu count metrics including limit, request and total counts #214

Add gpu count metrics including limit, request and total counts #214

movence commented May 20, 2024 •

edited

Loading

Add gpu count metrics including limit, request and total counts #214

Add gpu count metrics including limit, request and total counts #214

Conversation

movence commented May 20, 2024 • edited Loading

movence commented May 20, 2024 •

edited

Loading