Support NVIDIA GPU metrics #1033

movence · 2024-02-16T17:34:22Z

Revision

rev3:
- prebuild gpu attribute filter lists
- update test cases
- update feature toggle variable name
rev2:
- update feature toggle flag to accelerated_compute_metrics
- use constant variables
- use slices for label filtering
- stop adding gpuattribtues processor when it's turned off
rev1:
- revive GPU metric processor to filter out attributes
- remove emf exporter configs for adding metric units. Units are being added in GPU decorator in CI receiver.

GPU Metric flow

awscontainerinsightsreceiver/dcgmscraper (1m polling interval)
gpudecorator (metrics consumer): add attributes including kubernetes and PodName using k8s stores
metrictransfomer (OTEL processor): duplicates container level metrics to pod/node levels (container_ to pod_ and node_)
gpuattributes (processor): filter out kubernetes attribute blob for corresponding resource types by checking metric name prefix
awsemfexporter

Description of changes

This change is a new feature to support NVIDIA GPU metrics in k8s clusters. The agent will get NVIDIA GPU metrics by scraping an prometheus endpoint exposed by dcgm-exporter (PR). The changes include:

add translation rules for metrictransformer processor
add GPU processor and register it to container insights pipeline

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Tested on a test cluster with 1 GPU node (with 4 GPUs) + 1 regular node: only 3 GPUs get workload

running pods

amazon-cloudwatch  cloudwatch-agent-52jfj                                           ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │.us-west │
│ amazon-cloudwatch  cloudwatch-agent-gr4kr                                           ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │.us-west │
│ amazon-cloudwatch  dcgm-exporter-md5p4                                              ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │
│ amazon-cloudwatch  gpu-burn-1-5b78d96f7b-w8kgf                                      ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │

Requirements

Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make lint

plugins/processors/gpu/processor.go

translator/translate/otel/processor/metricstransformprocessor/translator.go

plugins/processors/gpu/processor.go

translator/translate/otel/exporter/awsemf/kubernetes.go

straussb

It would help if your commit message / PR description / comment in the code included a description of the path of the metrics as they travel through the various components.

awscontainerinsightreceiver (DcgmScraper -> decoratorConsumer) -> metricstransformprocessor -> gpuattributesprocessor -> awsemfexporter.

straussb · 2024-02-27T21:56:16Z

translator/translate/otel/processor/metricstransformprocessor/translator.go

 	})
 	if err := c.Unmarshal(&cfg); err != nil {
 		return nil, fmt.Errorf("unable to unmarshal into metricstransform config: %w", err)
 	}

 	return cfg, nil
 }
+
+func isGpuEnabled(conf *confmap.Conf) bool {
+	return common.GetOrDefaultBool(conf, common.ConfigKey(common.LogsKey, common.MetricsCollectedKey, common.KubernetesKey, common.EnableGpuMetric), true)


How does EnableGpuMetric get set in the JSON in the first place? (Operator question.)

It's actually more of an agent config and addon related question, but customers need to add an entry in agent json config like following:

"logs": { "metrics_collected": { "kubernetes": { "enhanced_container_insights": true, "accelerated_compute_metrics": false } } }

Custom agent config should be supplied in Optional configuration settings of the addon. For helm, the default config should be updated in values.yaml file.

plugins/processors/gpuattributes/processor.go

translator/tocwconfig/sampleConfig/emf_and_kubernetes_config.yaml

translator/translate/otel/pipeline/containerinsights/translator.go

translator/translate/otel/processor/metricstransformprocessor/translator.go

internal/containerinsightscommon/const.go

translator/translate/otel/common/common.go

translator/translate/otel/pipeline/containerinsights/translator.go

plugins/processors/gpuattributes/processor.go

translator/tocwconfig/sampleConfig/emf_and_kubernetes_config.yaml

plugins/processors/gpuattributes/processor.go

straussb · 2024-02-29T22:08:43Z

plugins/processors/gpuattributes/processor.go

+	containerinsightscommon.GpuUniqueId,
+}
+
+var containerK8sBlobLabels = []string{


What are the extra "kubernetes" blob fields that you want to get rid of?

Not really filtering out anything at container level, but listing all in the array to keep what's populated by the decorator.

plugins/processors/gpuattributes/processor.go

This reverts commit 7a6784b24ef6a222f7f79ee1ad4b74909d3ccc2d.

use constant variables use slices for label filtering stop adding gpuattribtues processor when it's turned off

update test cases update feature toggle variable name

codecov-commenter · 2024-03-01T22:56:56Z

Codecov Report

Attention: Patch coverage is 89.30233% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 63.78%. Comparing base (96d4763) to head (cf0292f).
Report is 508 commits behind head on main.

Files	Patch %	Lines
plugins/processors/gpuattributes/processor.go	75.34%	12 Missing and 6 partials ⚠️
plugins/processors/gpuattributes/factory.go	85.71%	2 Missing and 1 partial ⚠️
plugins/processors/gpuattributes/config.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1033      +/-   ##
==========================================
+ Coverage   57.58%   63.78%   +6.20%     
==========================================
  Files         370      369       -1     
  Lines       17548    19186    +1638     
==========================================
+ Hits        10105    12238    +2133     
+ Misses       6848     6311     -537     
- Partials      595      637      +42

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

okankoAMZ

Looks good to me

movence requested a review from a team as a code owner February 16, 2024 17:34

mitali-salvi reviewed Feb 16, 2024

View reviewed changes

plugins/processors/gpu/processor.go Outdated Show resolved Hide resolved

jefchien reviewed Feb 27, 2024

View reviewed changes

straussb reviewed Feb 28, 2024

View reviewed changes

straussb reviewed Feb 29, 2024

View reviewed changes

straussb reviewed Mar 1, 2024

View reviewed changes

plugins/processors/gpuattributes/processor.go Outdated Show resolved Hide resolved

straussb previously approved these changes Mar 1, 2024

View reviewed changes

movence dismissed straussb’s stale review via e95d0b8 March 1, 2024 15:40

movence force-pushed the ci-nvidia-gpu branch from 3e47169 to e95d0b8 Compare March 1, 2024 15:47

movence mentioned this pull request Mar 1, 2024

Support NVIDIA GPU metrics #1068

Closed

movence force-pushed the ci-nvidia-gpu branch from e95d0b8 to 4f1bff7 Compare March 1, 2024 15:53

straussb previously approved these changes Mar 1, 2024

View reviewed changes

movence added 18 commits March 1, 2024 14:28

support nvidia gpu metrics and update test configs

af24536

add gpu processor and update metric declarations for gpu

90ef122

use metric transformer to handle gpu metrics and labels & clean up

91e8182

update metric trasformer rules & remove unused funcs/files

c652722

fix lint

10f9a33

remove gpu processor

1bbf610

Revert "remove gpu processor"

e413c95

This reverts commit 7a6784b24ef6a222f7f79ee1ad4b74909d3ccc2d.

bring gpu processor back to filter attributes

ecc28b6

update test

3f42b76

rename gpu processor package to gpuattributes and address comments

887a9d7

remove start from test

e6d95b8

update feature toggle flag to accelerated_compute_metrics

5a11adb

use constant variables use slices for label filtering stop adding gpuattribtues processor when it's turned off

prebuild gpu attribute filter lists

c47375d

update test cases update feature toggle variable name

fix format

8037d9b

update test

312a86b

prepopulate label filter

6452bcd

format

c70fefc

format

7803297

movence force-pushed the ci-nvidia-gpu branch from 4f1bff7 to 7803297 Compare March 1, 2024 19:38

update otel contrib

1d1fef0

movence dismissed straussb’s stale review via 1d1fef0 March 1, 2024 22:24

movence and others added 2 commits March 1, 2024 17:25

Merge branch 'main' into ci-nvidia-gpu

0e05e52

fix test

cf0292f

lisguo approved these changes Mar 1, 2024

View reviewed changes

okankoAMZ approved these changes Mar 1, 2024

View reviewed changes

movence merged commit ba44b4e into aws:main Mar 1, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NVIDIA GPU metrics #1033

Support NVIDIA GPU metrics #1033

movence commented Feb 16, 2024 •

edited

Loading

straussb left a comment

straussb Feb 27, 2024

movence Feb 28, 2024 •

edited

Loading

straussb Feb 29, 2024

movence Mar 1, 2024

codecov-commenter commented Mar 1, 2024 •

edited

Loading

okankoAMZ left a comment

Support NVIDIA GPU metrics #1033

Support NVIDIA GPU metrics #1033

Conversation

movence commented Feb 16, 2024 • edited Loading

GPU Metric flow

Description of changes

License

Tests

Requirements

straussb left a comment

Choose a reason for hiding this comment

straussb Feb 27, 2024

Choose a reason for hiding this comment

movence Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

straussb Feb 29, 2024

Choose a reason for hiding this comment

movence Mar 1, 2024

Choose a reason for hiding this comment

codecov-commenter commented Mar 1, 2024 • edited Loading

Codecov Report

okankoAMZ left a comment

Choose a reason for hiding this comment

movence commented Feb 16, 2024 •

edited

Loading

movence Feb 28, 2024 •

edited

Loading

codecov-commenter commented Mar 1, 2024 •

edited

Loading