Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NVIDIA GPU metrics #1033

Merged
merged 21 commits into from
Mar 1, 2024
Merged

Support NVIDIA GPU metrics #1033

merged 21 commits into from
Mar 1, 2024

Conversation

movence
Copy link
Contributor

@movence movence commented Feb 16, 2024

Revision

  • rev3:
    • prebuild gpu attribute filter lists
    • update test cases
    • update feature toggle variable name
  • rev2:
    • update feature toggle flag to accelerated_compute_metrics
    • use constant variables
    • use slices for label filtering
    • stop adding gpuattribtues processor when it's turned off
  • rev1:
    • revive GPU metric processor to filter out attributes
    • remove emf exporter configs for adding metric units. Units are being added in GPU decorator in CI receiver.

GPU Metric flow

  • awscontainerinsightsreceiver/dcgmscraper (1m polling interval)
  • gpudecorator (metrics consumer): add attributes including kubernetes and PodName using k8s stores
  • metrictransfomer (OTEL processor): duplicates container level metrics to pod/node levels (container_ to pod_ and node_)
  • gpuattributes (processor): filter out kubernetes attribute blob for corresponding resource types by checking metric name prefix
  • awsemfexporter

Description of changes

This change is a new feature to support NVIDIA GPU metrics in k8s clusters. The agent will get NVIDIA GPU metrics by scraping an prometheus endpoint exposed by dcgm-exporter (PR). The changes include:

  • add translation rules for metrictransformer processor
  • add GPU processor and register it to container insights pipeline

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Tested on a test cluster with 1 GPU node (with 4 GPUs) + 1 regular node: only 3 GPUs get workload

  • running pods
amazon-cloudwatch  cloudwatch-agent-52jfj                                           ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │.us-west │
│ amazon-cloudwatch  cloudwatch-agent-gr4kr                                           ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │.us-west │
│ amazon-cloudwatch  dcgm-exporter-md5p4                                              ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │
│ amazon-cloudwatch  gpu-burn-1-5b78d96f7b-w8kgf                                      ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │
Screenshot 2024-02-14 at 11 33 19 AM

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@movence movence requested a review from a team as a code owner February 16, 2024 17:34
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved
translator/translate/otel/exporter/awsemf/kubernetes.go Outdated Show resolved Hide resolved
Copy link
Contributor

@straussb straussb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help if your commit message / PR description / comment in the code included a description of the path of the metrics as they travel through the various components.

awscontainerinsightreceiver (DcgmScraper -> decoratorConsumer) -> metricstransformprocessor -> gpuattributesprocessor -> awsemfexporter.

})
if err := c.Unmarshal(&cfg); err != nil {
return nil, fmt.Errorf("unable to unmarshal into metricstransform config: %w", err)
}

return cfg, nil
}

func isGpuEnabled(conf *confmap.Conf) bool {
return common.GetOrDefaultBool(conf, common.ConfigKey(common.LogsKey, common.MetricsCollectedKey, common.KubernetesKey, common.EnableGpuMetric), true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does EnableGpuMetric get set in the JSON in the first place? (Operator question.)

Copy link
Contributor Author

@movence movence Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually more of an agent config and addon related question, but customers need to add an entry in agent json config like following:

"logs": {
    "metrics_collected": {
      "kubernetes": {
        "enhanced_container_insights": true,
        "accelerated_compute_metrics": false
      }
    }
}

Custom agent config should be supplied in Optional configuration settings of the addon. For helm, the default config should be updated in values.yaml file.

plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
internal/containerinsightscommon/const.go Outdated Show resolved Hide resolved
translator/translate/otel/common/common.go Outdated Show resolved Hide resolved
plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved
containerinsightscommon.GpuUniqueId,
}

var containerK8sBlobLabels = []string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the extra "kubernetes" blob fields that you want to get rid of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really filtering out anything at container level, but listing all in the array to keep what's populated by the decorator.

straussb
straussb previously approved these changes Mar 1, 2024
straussb
straussb previously approved these changes Mar 1, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 89.30233% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 63.78%. Comparing base (96d4763) to head (cf0292f).
Report is 508 commits behind head on main.

Files Patch % Lines
plugins/processors/gpuattributes/processor.go 75.34% 12 Missing and 6 partials ⚠️
plugins/processors/gpuattributes/factory.go 85.71% 2 Missing and 1 partial ⚠️
plugins/processors/gpuattributes/config.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1033      +/-   ##
==========================================
+ Coverage   57.58%   63.78%   +6.20%     
==========================================
  Files         370      369       -1     
  Lines       17548    19186    +1638     
==========================================
+ Hits        10105    12238    +2133     
+ Misses       6848     6311     -537     
- Partials      595      637      +42     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@okankoAMZ okankoAMZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@movence movence merged commit ba44b4e into aws:main Mar 1, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants