-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support NVIDIA GPU metrics #1033
Conversation
translator/translate/otel/processor/metricstransformprocessor/translator.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help if your commit message / PR description / comment in the code included a description of the path of the metrics as they travel through the various components.
awscontainerinsightreceiver (DcgmScraper -> decoratorConsumer) -> metricstransformprocessor -> gpuattributesprocessor -> awsemfexporter.
}) | ||
if err := c.Unmarshal(&cfg); err != nil { | ||
return nil, fmt.Errorf("unable to unmarshal into metricstransform config: %w", err) | ||
} | ||
|
||
return cfg, nil | ||
} | ||
|
||
func isGpuEnabled(conf *confmap.Conf) bool { | ||
return common.GetOrDefaultBool(conf, common.ConfigKey(common.LogsKey, common.MetricsCollectedKey, common.KubernetesKey, common.EnableGpuMetric), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does EnableGpuMetric get set in the JSON in the first place? (Operator question.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually more of an agent config and addon related question, but customers need to add an entry in agent json config like following:
"logs": {
"metrics_collected": {
"kubernetes": {
"enhanced_container_insights": true,
"accelerated_compute_metrics": false
}
}
}
Custom agent config should be supplied in Optional configuration settings
of the addon. For helm, the default config should be updated in values.yaml
file.
translator/tocwconfig/sampleConfig/emf_and_kubernetes_config.yaml
Outdated
Show resolved
Hide resolved
translator/translate/otel/pipeline/containerinsights/translator.go
Outdated
Show resolved
Hide resolved
translator/translate/otel/processor/metricstransformprocessor/translator.go
Outdated
Show resolved
Hide resolved
translator/translate/otel/processor/metricstransformprocessor/translator.go
Outdated
Show resolved
Hide resolved
translator/translate/otel/pipeline/containerinsights/translator.go
Outdated
Show resolved
Hide resolved
translator/tocwconfig/sampleConfig/emf_and_kubernetes_config.yaml
Outdated
Show resolved
Hide resolved
containerinsightscommon.GpuUniqueId, | ||
} | ||
|
||
var containerK8sBlobLabels = []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the extra "kubernetes" blob fields that you want to get rid of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really filtering out anything at container level, but listing all in the array to keep what's populated by the decorator.
This reverts commit 7a6784b24ef6a222f7f79ee1ad4b74909d3ccc2d.
use constant variables use slices for label filtering stop adding gpuattribtues processor when it's turned off
update test cases update feature toggle variable name
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1033 +/- ##
==========================================
+ Coverage 57.58% 63.78% +6.20%
==========================================
Files 370 369 -1
Lines 17548 19186 +1638
==========================================
+ Hits 10105 12238 +2133
+ Misses 6848 6311 -537
- Partials 595 637 +42 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Revision
accelerated_compute_metrics
GPU Metric flow
kubernetes
andPodName
using k8s storescontainer_
topod_
andnode_
)kubernetes
attribute blob for corresponding resource types by checking metric name prefixDescription of changes
This change is a new feature to support NVIDIA GPU metrics in k8s clusters. The agent will get NVIDIA GPU metrics by scraping an prometheus endpoint exposed by
dcgm-exporter
(PR). The changes include:License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Tested on a test cluster with 1 GPU node (with 4 GPUs) + 1 regular node: only 3 GPUs get workload
Requirements
Before commit the code, please do the following steps.
make fmt
andmake fmt-sh
make lint