Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[containerinsights] Update GPU usage metrics emitted #1298

Merged
merged 3 commits into from
Aug 14, 2024

Conversation

sky333999
Copy link
Contributor

Description of changes

Updating metric declarations for GPU request, limit, usage & reserved_capacity metrics to more closely resemble CPU metrics. Depends on amazon-contributing/opentelemetry-collector-contrib#225.

  • Renaming (pod|node)_gpu_total to (pod|node)_gpu_usage_total
  • Adding (pod|node)_gpu_reserved_capacity metrics
  • Dropping cluster_gpu_* metrics since those can be derived using the [ClusterName] aggregation on the node_gpu_* metrics.
  • node_gpu_request will exist in the EMF entry, but not be extracted as a metric - similar to node_cpu_request.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  • Deployed changes to a cluster running both GPU and non-GPU nodes and validated the EMF log events look as expected.
  • New metrics being extracted on a EMF log of Type Pod for a GPU pod:
{
  "CloudWatchMetrics": [
    {
      "Namespace": "ContainerInsights",
      "Dimensions": [
        [
          "ClusterName",
          "Namespace",
          "PodName"
        ],
        [
          "ClusterName"
        ],
        [
          "ClusterName",
          "FullPodName",
          "Namespace",
          "PodName"
        ]
      ],
      "Metrics": [
        {
          "Name": "pod_gpu_reserved_capacity",
          "Unit": "Percent"
        },
        {
          "Name": "pod_gpu_limit",
          "Unit": "Count"
        },
        {
          "Name": "pod_gpu_usage_total",
          "Unit": "Count"
        },
        {
          "Name": "pod_gpu_request",
          "Unit": "Count"
        }
      ]
    }
  ]
}
  • New metrics being extracted on a EMF log of Type Node for a GPU node:
"CloudWatchMetrics": [
    {
      "Namespace": "ContainerInsights",
      "Dimensions": [
        [
          "ClusterName"
        ],
        [
          "ClusterName",
          "InstanceId",
          "NodeName"
        ]
      ],
      "Metrics": [
        {
          "Name": "node_gpu_reserved_capacity",
          "Unit": "Percent"
        },
        {
          "Name": "node_gpu_limit",
          "Unit": "Count"
        },
        {
          "Name": "node_gpu_usage_total",
          "Unit": "Count"
        }
      ]
    }
  ]

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@sky333999 sky333999 requested a review from a team as a code owner August 14, 2024 13:32
dricross
dricross previously approved these changes Aug 14, 2024
@sky333999 sky333999 merged commit 1f6c19c into main Aug 14, 2024
6 checks passed
@sky333999 sky333999 deleted the sky333999/gpu-metrics branch August 14, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants