-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dcgm scraper to collect nvidia GPU metrics #160
Add dcgm scraper to collect nvidia GPU metrics #160
Conversation
receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/stores/servicestore.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper.go
Outdated
Show resolved
Hide resolved
}, | ||
}, | ||
RelabelConfigs: []*relabel.Config{ | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be little late, but why exactly do we need this, is this used somewhere in the agent at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rules need to be removed. Thanks!
@@ -151,6 +151,10 @@ const ( | |||
// Special type for pause container | |||
// because containerd does not set container name pause container name to POD like docker does. | |||
TypeInfraContainer = "InfraContainer" | |||
TypeGpuContainer = "ContainerGPU" | |||
TypeGpuPod = "PodGPU" | |||
TypeGpuNode = "NodeGPU" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeGpuNode, and TypeGpuCluster are nowhere used, do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that they are not currently being used. The intention is to use them for GPU count metrics.
// Copyright The OpenTelemetry Authors | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
package gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this decorator generic , I get it now its only for GPU but if we make it configurable i.e it adds unit based on the Map provided, also adding the K8 Label seem to re-usable say in Neuron monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree with this.
} | ||
|
||
// value is not needed for label decoration | ||
fields[m.Name()] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could set it to nil
to be more explicit that it's not going to be used.
Is this the only field?
receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/extractor.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/gpu/dcgmscraper.go
Outdated
Show resolved
Hide resolved
logger.Warn("Unsupported metric type", zap.String("metric", m.Name()), zap.String("type", m.Type().String())) | ||
} | ||
|
||
// should support metrics with more than 1 datapoints? | ||
if dps.Len() > 1 { | ||
logger.Warn("Metric with more than 1 datapoint is not supported", zap.String("metric", m.Name()), zap.Int("datapoints", dps.Len())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we ever know if we're hitting these in production? Do we have a guarantee that DcgmScraper is only producing metrics in your supported format? How do you know it won't change? (Not to mention, by putting this method in a common utils package, you're implying that any OTLP metrics producer should be able to call it...)
If we have to do this ConvertToFieldsAndTags
, then probably you should just handle all cases. Return an array if you have to?
Long term, it kind of seems like the decorator should become a processor and operate on OTLP metrics directly, so we don't have to do this double-conversion back and forth (OTLP -> fields/tags -> OTLP)... does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. There is no guarantee that it won't change. I will update to handle all cases.
It was done this way to leverage existing k8s decorator, but I do agree we can skip those conversions and just make it more like a processor to selectively decorate GPU specific stuffs without getting all unnecessary attributes from k8sdecorator. This could be a future improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed making this update?
resourceTags := make(map[string]string) | ||
rms := md.ResourceMetrics() | ||
for i := 0; i < rms.Len(); i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to move this entire loop into the ConvertToFieldsAndTags()
method. It would be the inverse of the existing ConvertToOTLPMetrics()
method. Take in pmetrics.Metrics
and return an array of fields/tags pairs (or wrap in struct).
Then here in decoratorConsumer, iterate through again and call the decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original idea as well to make it inverse of ConvertToOTLPMetrics
. However, existing CI metrics do get populated on the fly which makes sense to start from fields/tags set where the tags get applied at resource level. For GPU, OTLP metrics are already constructed, and they just need to be decorated at datapoint level by leveraging the k8sdecorator which consumes fields/tags. With this approach, we can avoid the last portion of conversion process: OTLP-fields/tags-OTLP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting point, we are only using K8sDecorator
for tags in this case, we don't need the fields at all. Hmm... seems like we could have done some additional refactoring then... separate K8sDecorator
into a series of sub-decorators - cadvisor would use all but dcgm would only use the couple that it needs. I won't push on it now though.
Also worth noting that Cadvisor
/K8sDecorator
only work this way because they were ported from the original CWAgent. If we were writing from scratch, we would probably operate on OTLP metrics directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense. I was wondering why existing decorator wouldn't just start with OTLP metrics.
Breaking k8sdecorator will be an interesting one. It currently depends on 2 stores (podstore
and servicestore
) that it uses to decorate. podstore
is the main one with most of data (such as node and container) available at the pod level. Making these pluggable for different needs will be nice and reduce unnecessary overhead and processing.
// Copyright The OpenTelemetry Authors | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
package gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree with this.
} | ||
|
||
promFactory := prometheusreceiver.NewFactory() | ||
promReceiver, err := promFactory.CreateMetricsReceiver(opts.Ctx, params, &promConfig, &decoConsumer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what will happen if there is no DCGM exporter running on the node? How often does the scraper retry? Will we get a bunch of errors in the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scraper will print out en error message every minute if dcgm exporter is not available. If this becomes concern, we could probably check the instance type withEnableGpuMetrics
flag to decide whether to register gpu scraper or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be very confusing to see those errors if we enable the scraper on a non-GPU node. Can the operator only set enable_gpu_metrics
if it also deploys DCGM exporter on a node? Or can the agent do something, like only start up the Prometheus scraper if it sees a local DCGM exporter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dcgm exporter is an independent daemonset that is not controlled by the operator. It gets started by nodeAffinity rule which we pass all NVIDIA GPU instance types.
I think this is an improvement that we should do as a fast follow. Options I can think of:
- Check the current node type (maybe IMDS?) similar to how dcgm exporter does with nodeAffinity
- Check if dcgm exporter pod exists on the same node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry you're right, I should say add-on, not operator.
Can you show what the log will look like on a non-GPU node?
I will say this is a blocker, it's going to affect every non-GPU customer. I like your idea of checking for DCGM exporter. Or can we do something quick to swallow the error message? (Although we don't want to lose it for troubleshooting purposes if there are real errors contacting DCGM exporter on a GPU node.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it will print a debug log that looks like:
2024-03-01T14:06:52Z D! {"caller":"scrape/scrape.go:1351","msg":"Scrape failed","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metr │
│ ics","scrape_pool":"containerInsightsDCGMExporterScraper","target":"http://dcgm-exporter-service.amazon-cloudwatch.svc:9400/metrics","error":"Get \"http │
│ ://dcgm-exporter-service.amazon-cloudwatch.svc:9400/metrics\": context deadline exceeded"}
I think we can troubleshoot by running the agent in debug mode in case there is an issue. For other cases, scraper won't clutter the logs.
As the long term solution, we should not run the scraper at all on non-GPU instances. I will start work on that DCGM pod scanning option for the next phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's great, debug line is fine. Agree on your plans for the next phase as well.
@@ -20,6 +20,7 @@ const ( | |||
|
|||
PodStatus = "pod_status" | |||
ContainerStatus = "container_status" | |||
GpuDevice = "GpuDevice" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor: move this up a few lines to the previous block of constants - it's an attribute not a metric name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also rename it AttributeGpuDevice
so it's clear without the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like renaming to make it more intuitive
receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/extractor.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/extractor_helpers_test.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/extractor_test.go
Outdated
Show resolved
Hide resolved
// skip prometheus metadata metrics including "up" | ||
if !strings.HasPrefix(metric.Name(), "DCGM") { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, is there no way to filter these out at the scraper side? Otherwise they propagate down the OTel pipeline. Or does k8sapiserver
have the same issue and we filter these out some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a way. prometheus_scraper
also gets those meta metrics:
- mock response: https://github.com/amazon-contributing/opentelemetry-collector-contrib/blob/aws-cwa-dev/receiver/awscontainerinsightreceiver/internal/k8sapiserver/prometheus_scraper_test.go#L27
- actual response even after relabeling
// only expected metric is 'http_connected_total`
===== up
===== scrape_duration_seconds
===== scrape_samples_scraped
===== scrape_samples_post_metric_relabeling
===== scrape_series_added
===== http_connected_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, OK, I guess it gets filtered out some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should look deeper to see exactly at what point these metadata metrics get dropped.
mp.Wg.Wait() | ||
mp.Wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't think to have this comment on aditya-purang#1 , but can you add an assertion that the mockConsumer
actually did get called? Otherwise we could be completely skipping validation and would have no idea. Maybe add a boolean flag to mockConsumer
that it sets at the end of ConsumeMetrics()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we're waiting for the scrapes, but then what? It doesn't look like we're verifying anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric consumer is passed as a parameter to the scraper. ConsumeMetrics
has some assertions.
@@ -20,6 +20,7 @@ const ( | |||
|
|||
PodStatus = "pod_status" | |||
ContainerStatus = "container_status" | |||
GpuDevice = "GpuDevice" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also rename it AttributeGpuDevice
so it's clear without the comment.
logger.Warn("Unsupported metric type", zap.String("metric", m.Name()), zap.String("type", m.Type().String())) | ||
} | ||
|
||
// should support metrics with more than 1 datapoints? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be answered or is this more of a TODO?
var result []*extractors.RawContainerInsightsMetric | ||
for _, m := range metrics { | ||
// add tags for EKS | ||
if dc.containerOrchestrator == ci.EKS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Move this check out of the for-loop or even out of the function.
Shutdown() error | ||
} | ||
|
||
func (dc *decorateConsumer) decorateMetrics(metrics []*extractors.RawContainerInsightsMetric) []*extractors.RawContainerInsightsMetric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we returning a slice that we don't use?
mp.Wg.Wait() | ||
mp.Wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we're waiting for the scrapes, but then what? It doesn't look like we're verifying anything.
receiver/awscontainerinsightreceiver/internal/gpu/decorator_test.go
Outdated
Show resolved
Hide resolved
type mockNextConsumer struct { | ||
} | ||
|
||
func (mc *mockNextConsumer) Capabilities() consumer.Capabilities { | ||
return consumer.Capabilities{ | ||
MutatesData: true, | ||
} | ||
} | ||
|
||
func (mc *mockNextConsumer) ConsumeMetrics(_ context.Context, md pmetric.Metrics) error { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This isn't used.
type mockGpuMetric struct { | ||
tags map[string]string | ||
fields map[string]any | ||
} | ||
|
||
func (m *mockGpuMetric) HasField(key string) bool { | ||
return m.fields[key] != nil | ||
} | ||
|
||
func (m *mockGpuMetric) AddField(key string, val any) { | ||
m.fields[key] = val | ||
} | ||
|
||
func (m *mockGpuMetric) GetField(key string) any { | ||
return m.fields[key] | ||
} | ||
|
||
func (m *mockGpuMetric) HasTag(key string) bool { | ||
return m.tags[key] != "" | ||
} | ||
|
||
func (m *mockGpuMetric) AddTag(key, val string) { | ||
m.tags[key] = val | ||
} | ||
|
||
func (m *mockGpuMetric) GetTag(key string) string { | ||
return m.tags[key] | ||
} | ||
|
||
func (m *mockGpuMetric) RemoveTag(key string) { | ||
delete(m.tags, key) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This isn't used.
// copy down resource metrics only when it's missing at datapoint | ||
for rtk, rtv := range resourceTags { | ||
if _, ok := tags[rtk]; !ok { | ||
tags[rtk] = rtv | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean "copy down resource attributes". Can you explain why we're doing this in the receiver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this doesn't really add any value. removed.
If you are asking why we are adding attributes in the receiver? It was modeled similar to CI metric generation. There is no strong reason this should be done at the receiver.
c.Logger.Debug(fmt.Sprintf("metric being merged has conflict in fields, src: %v, dest: %v \n", *src, *c)) | ||
c.Logger.Debug("metric being merged has conflict in fields", zap.String("src", src.ContainerName), zap.String("dest", c.ContainerName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there meant to be two of these? Seems like a merge that had a conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now sure how I missed this...
func NewRawContainerInsightsMetricWithData(mType string, fields map[string]any, tags map[string]string, logger *zap.Logger) *RawContainerInsightsMetric { | ||
metric := NewRawContainerInsightsMetric(mType, logger) | ||
metric.Fields = fields | ||
metric.Tags = tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to overwrite the ci.MetricType
that was inserted into the Tags
in the NewRawContainerInsightsMetric
function? If not, then maybe you can just call the AddTag
/AddField
functions for the passed in maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, type in tags parameter takes precedence. Existing CI flow won't use this function in the first place just fyi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then is there a point in passing the mType
in? It's just going to get overwritten by the tags
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. It is just to follow the convention with existing initializer without data params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I think it's just a bit misleading and a bit of unexpected behavior. Can we at least document?
// NewRawContainerInsightsMetricsWithData sets the fields and tags on the metric. Discards the mType.
}) | ||
converted = append(converted, FieldsAndTagsPair{ | ||
Fields: map[string]any{ | ||
m.Name(): nil, // metric value not needed for attribute decoration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to have this limitation (plus only supporting gauge and sum), then I think just move this helper method to the gpu decorator (it's only used from there right now anyway). Having it here in this common place implies that it should work for any OTLP metric.
} | ||
|
||
promFactory := prometheusreceiver.NewFactory() | ||
promReceiver, err := promFactory.CreateMetricsReceiver(opts.Ctx, params, &promConfig, &decoConsumer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's great, debug line is fine. Agree on your plans for the next phase as well.
update dcgm relabeling rules
use constant variables use the same scrape configs in dcgm scraper test remove unnecessary attribute decoration for GPU metrics add dcgm as source for dim
support multiple datapoints when converting pmetric into fields/tags move RawContainerInsightsMetric struct to stores package update feature toggle flag var name to match json key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
d3bf111
into
amazon-contributing:aws-cwa-dev
Revisions
Attribute
prefixRawContainerInsightsMetric
struct to stores packagegpu_metrics
toaccelerated_computing_metrics
RawContainerInsights
dcgm
as a new source forSources
attributeDescription:
Testing:
Tested on a test cluster with 1 GPU node (with 4 GPUs) + 1 regular node: only 3 GPUs get workload