Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dcgm scraper to collect nvidia GPU metrics #160

Merged
merged 17 commits into from
Mar 1, 2024

Conversation

movence
Copy link

@movence movence commented Feb 14, 2024

Revisions

  • rev3:
    • rename k8s attribute vars with Attribute prefix
    • support multiple datapoints when converting pmetric into fields/tags
    • move RawContainerInsightsMetric struct to stores package
    • update feature toggle flag var name to match json key
  • rev2:
    • update feature toggle flag fromgpu_metrics to accelerated_computing_metrics
    • consolidate CI metrics structs into single RawContainerInsights
    • use constant variables
    • use the same scrape configs in dcgm scraper test
    • remove unnecessary attribute decoration for GPU metrics
    • add dcgm as a new source for Sources attribute
  • rev1:
    • Add GPU metric consumer which intercepts metrics to decorate with k8s attributes
    • Update k8s decorator to use CIMetric interface rather than cadvisor

Description:

  • Add a new prometheus scraper to scrape NVIDIA GPU metrics from dcgm-exporter under container insights receiver
  • Add a feature flag for GPU data collection
  • Move scraper mock for testing out of k8sapi package to share with other scrapers

Testing:
Tested on a test cluster with 1 GPU node (with 4 GPUs) + 1 regular node: only 3 GPUs get workload

  • running pods
amazon-cloudwatch  cloudwatch-agent-52jfj                                           ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │.us-west │
│ amazon-cloudwatch  cloudwatch-agent-gr4kr                                           ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │.us-west │
│ amazon-cloudwatch  dcgm-exporter-md5p4                                              ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │
│ amazon-cloudwatch  gpu-burn-1-5b78d96f7b-w8kgf                                      ●  1/1   Running         0 x.x.x.x  ip-x.x.x.x.us-west │
Screenshot 2024-02-14 at 11 33 19 AM

@movence movence requested a review from mxiamxia as a code owner February 14, 2024 16:41
exporter/awsemfexporter/config.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/metric_translator.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/metric_translator_test.go Outdated Show resolved Hide resolved
receiver/awscontainerinsightreceiver/receiver.go Outdated Show resolved Hide resolved
jefchien
jefchien previously approved these changes Feb 16, 2024
jefchien
jefchien previously approved these changes Feb 16, 2024
mitali-salvi
mitali-salvi previously approved these changes Feb 16, 2024
@movence movence dismissed stale reviews from mitali-salvi and jefchien via 1444acd February 26, 2024 20:54
},
},
RelabelConfigs: []*relabel.Config{
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be little late, but why exactly do we need this, is this used somewhere in the agent at the end?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rules need to be removed. Thanks!

@@ -151,6 +151,10 @@ const (
// Special type for pause container
// because containerd does not set container name pause container name to POD like docker does.
TypeInfraContainer = "InfraContainer"
TypeGpuContainer = "ContainerGPU"
TypeGpuPod = "PodGPU"
TypeGpuNode = "NodeGPU"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeGpuNode, and TypeGpuCluster are nowhere used, do we need this?

Copy link
Author

@movence movence Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that they are not currently being used. The intention is to use them for GPU count metrics.

// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0

package gpu
Copy link

@sam6134 sam6134 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this decorator generic , I get it now its only for GPU but if we make it configurable i.e it adds unit based on the Map provided, also adding the K8 Label seem to re-usable say in Neuron monitor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agree with this.

}

// value is not needed for label decoration
fields[m.Name()] = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could set it to nil to be more explicit that it's not going to be used.

Is this the only field?

internal/aws/containerinsight/utils.go Outdated Show resolved Hide resolved
internal/aws/containerinsight/utils.go Outdated Show resolved Hide resolved
Comment on lines 187 to 192
logger.Warn("Unsupported metric type", zap.String("metric", m.Name()), zap.String("type", m.Type().String()))
}

// should support metrics with more than 1 datapoints?
if dps.Len() > 1 {
logger.Warn("Metric with more than 1 datapoint is not supported", zap.String("metric", m.Name()), zap.Int("datapoints", dps.Len()))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we ever know if we're hitting these in production? Do we have a guarantee that DcgmScraper is only producing metrics in your supported format? How do you know it won't change? (Not to mention, by putting this method in a common utils package, you're implying that any OTLP metrics producer should be able to call it...)

If we have to do this ConvertToFieldsAndTags, then probably you should just handle all cases. Return an array if you have to?

Long term, it kind of seems like the decorator should become a processor and operate on OTLP metrics directly, so we don't have to do this double-conversion back and forth (OTLP -> fields/tags -> OTLP)... does that sound right?

Copy link
Author

@movence movence Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. There is no guarantee that it won't change. I will update to handle all cases.

It was done this way to leverage existing k8s decorator, but I do agree we can skip those conversions and just make it more like a processor to selectively decorate GPU specific stuffs without getting all unnecessary attributes from k8sdecorator. This could be a future improvement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed making this update?

Comment on lines +116 to +53
resourceTags := make(map[string]string)
rms := md.ResourceMetrics()
for i := 0; i < rms.Len(); i++ {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to move this entire loop into the ConvertToFieldsAndTags() method. It would be the inverse of the existing ConvertToOTLPMetrics() method. Take in pmetrics.Metrics and return an array of fields/tags pairs (or wrap in struct).

Then here in decoratorConsumer, iterate through again and call the decorator.

Copy link
Author

@movence movence Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my original idea as well to make it inverse of ConvertToOTLPMetrics. However, existing CI metrics do get populated on the fly which makes sense to start from fields/tags set where the tags get applied at resource level. For GPU, OTLP metrics are already constructed, and they just need to be decorated at datapoint level by leveraging the k8sdecorator which consumes fields/tags. With this approach, we can avoid the last portion of conversion process: OTLP-fields/tags-OTLP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting point, we are only using K8sDecorator for tags in this case, we don't need the fields at all. Hmm... seems like we could have done some additional refactoring then... separate K8sDecorator into a series of sub-decorators - cadvisor would use all but dcgm would only use the couple that it needs. I won't push on it now though.

Also worth noting that Cadvisor/K8sDecorator only work this way because they were ported from the original CWAgent. If we were writing from scratch, we would probably operate on OTLP metrics directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. I was wondering why existing decorator wouldn't just start with OTLP metrics.

Breaking k8sdecorator will be an interesting one. It currently depends on 2 stores (podstore and servicestore) that it uses to decorate. podstore is the main one with most of data (such as node and container) available at the pod level. Making these pluggable for different needs will be nice and reduce unnecessary overhead and processing.

// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0

package gpu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agree with this.

}

promFactory := prometheusreceiver.NewFactory()
promReceiver, err := promFactory.CreateMetricsReceiver(opts.Ctx, params, &promConfig, &decoConsumer)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what will happen if there is no DCGM exporter running on the node? How often does the scraper retry? Will we get a bunch of errors in the logs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scraper will print out en error message every minute if dcgm exporter is not available. If this becomes concern, we could probably check the instance type withEnableGpuMetrics flag to decide whether to register gpu scraper or not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be very confusing to see those errors if we enable the scraper on a non-GPU node. Can the operator only set enable_gpu_metrics if it also deploys DCGM exporter on a node? Or can the agent do something, like only start up the Prometheus scraper if it sees a local DCGM exporter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dcgm exporter is an independent daemonset that is not controlled by the operator. It gets started by nodeAffinity rule which we pass all NVIDIA GPU instance types.
I think this is an improvement that we should do as a fast follow. Options I can think of:

  • Check the current node type (maybe IMDS?) similar to how dcgm exporter does with nodeAffinity
  • Check if dcgm exporter pod exists on the same node

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you're right, I should say add-on, not operator.

Can you show what the log will look like on a non-GPU node?

I will say this is a blocker, it's going to affect every non-GPU customer. I like your idea of checking for DCGM exporter. Or can we do something quick to swallow the error message? (Although we don't want to lose it for troubleshooting purposes if there are real errors contacting DCGM exporter on a GPU node.)

Copy link
Author

@movence movence Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it will print a debug log that looks like:

 2024-03-01T14:06:52Z D! {"caller":"scrape/scrape.go:1351","msg":"Scrape failed","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metr │
│ ics","scrape_pool":"containerInsightsDCGMExporterScraper","target":"http://dcgm-exporter-service.amazon-cloudwatch.svc:9400/metrics","error":"Get \"http │
│ ://dcgm-exporter-service.amazon-cloudwatch.svc:9400/metrics\": context deadline exceeded"}

I think we can troubleshoot by running the agent in debug mode in case there is an issue. For other cases, scraper won't clutter the logs.

As the long term solution, we should not run the scraper at all on non-GPU instances. I will start work on that DCGM pod scanning option for the next phase.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's great, debug line is fine. Agree on your plans for the next phase as well.

@@ -20,6 +20,7 @@ const (

PodStatus = "pod_status"
ContainerStatus = "container_status"
GpuDevice = "GpuDevice"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: move this up a few lines to the previous block of constants - it's an attribute not a metric name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also rename it AttributeGpuDevice so it's clear without the comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like renaming to make it more intuitive

receiver/awscontainerinsightreceiver/config.go Outdated Show resolved Hide resolved
Comment on lines +84 to +90
// skip prometheus metadata metrics including "up"
if !strings.HasPrefix(metric.Name(), "DCGM") {
continue
}
Copy link

@straussb straussb Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, is there no way to filter these out at the scraper side? Otherwise they propagate down the OTel pipeline. Or does k8sapiserver have the same issue and we filter these out some other way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a way. prometheus_scraper also gets those meta metrics:

// only expected metric is 'http_connected_total`
===== up
===== scrape_duration_seconds
===== scrape_samples_scraped
===== scrape_samples_post_metric_relabeling
===== scrape_series_added
===== http_connected_total

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, OK, I guess it gets filtered out some other way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should look deeper to see exactly at what point these metadata metrics get dropped.

Comment on lines +253 to +257
mp.Wg.Wait()
mp.Wg.Wait()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't think to have this comment on aditya-purang#1 , but can you add an assertion that the mockConsumer actually did get called? Otherwise we could be completely skipping validation and would have no idea. Maybe add a boolean flag to mockConsumer that it sets at the end of ConsumeMetrics().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we're waiting for the scrapes, but then what? It doesn't look like we're verifying anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric consumer is passed as a parameter to the scraper. ConsumeMetrics has some assertions.

@@ -20,6 +20,7 @@ const (

PodStatus = "pod_status"
ContainerStatus = "container_status"
GpuDevice = "GpuDevice"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also rename it AttributeGpuDevice so it's clear without the comment.

logger.Warn("Unsupported metric type", zap.String("metric", m.Name()), zap.String("type", m.Type().String()))
}

// should support metrics with more than 1 datapoints?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be answered or is this more of a TODO?

var result []*extractors.RawContainerInsightsMetric
for _, m := range metrics {
// add tags for EKS
if dc.containerOrchestrator == ci.EKS {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move this check out of the for-loop or even out of the function.

Shutdown() error
}

func (dc *decorateConsumer) decorateMetrics(metrics []*extractors.RawContainerInsightsMetric) []*extractors.RawContainerInsightsMetric {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we returning a slice that we don't use?

Comment on lines +253 to +257
mp.Wg.Wait()
mp.Wg.Wait()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we're waiting for the scrapes, but then what? It doesn't look like we're verifying anything.

Comment on lines 66 to 45
type mockNextConsumer struct {
}

func (mc *mockNextConsumer) Capabilities() consumer.Capabilities {
return consumer.Capabilities{
MutatesData: true,
}
}

func (mc *mockNextConsumer) ConsumeMetrics(_ context.Context, md pmetric.Metrics) error {
return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This isn't used.

Comment on lines 22 to 53
type mockGpuMetric struct {
tags map[string]string
fields map[string]any
}

func (m *mockGpuMetric) HasField(key string) bool {
return m.fields[key] != nil
}

func (m *mockGpuMetric) AddField(key string, val any) {
m.fields[key] = val
}

func (m *mockGpuMetric) GetField(key string) any {
return m.fields[key]
}

func (m *mockGpuMetric) HasTag(key string) bool {
return m.tags[key] != ""
}

func (m *mockGpuMetric) AddTag(key, val string) {
m.tags[key] = val
}

func (m *mockGpuMetric) GetTag(key string) string {
return m.tags[key]
}

func (m *mockGpuMetric) RemoveTag(key string) {
delete(m.tags, key)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This isn't used.

Comment on lines 66 to 74
// copy down resource metrics only when it's missing at datapoint
for rtk, rtv := range resourceTags {
if _, ok := tags[rtk]; !ok {
tags[rtk] = rtv
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean "copy down resource attributes". Can you explain why we're doing this in the receiver?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this doesn't really add any value. removed.

If you are asking why we are adding attributes in the receiver? It was modeled similar to CI metric generation. There is no strong reason this should be done at the receiver.

Comment on lines +261 to +262
c.Logger.Debug(fmt.Sprintf("metric being merged has conflict in fields, src: %v, dest: %v \n", *src, *c))
c.Logger.Debug("metric being merged has conflict in fields", zap.String("src", src.ContainerName), zap.String("dest", c.ContainerName))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there meant to be two of these? Seems like a merge that had a conflict.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now sure how I missed this...

func NewRawContainerInsightsMetricWithData(mType string, fields map[string]any, tags map[string]string, logger *zap.Logger) *RawContainerInsightsMetric {
metric := NewRawContainerInsightsMetric(mType, logger)
metric.Fields = fields
metric.Tags = tags
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to overwrite the ci.MetricType that was inserted into the Tags in the NewRawContainerInsightsMetric function? If not, then maybe you can just call the AddTag/AddField functions for the passed in maps.

Copy link
Author

@movence movence Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, type in tags parameter takes precedence. Existing CI flow won't use this function in the first place just fyi.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then is there a point in passing the mType in? It's just going to get overwritten by the tags.

Copy link
Author

@movence movence Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. It is just to follow the convention with existing initializer without data params.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think it's just a bit misleading and a bit of unexpected behavior. Can we at least document?

// NewRawContainerInsightsMetricsWithData sets the fields and tags on the metric. Discards the mType.

})
converted = append(converted, FieldsAndTagsPair{
Fields: map[string]any{
m.Name(): nil, // metric value not needed for attribute decoration
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have this limitation (plus only supporting gauge and sum), then I think just move this helper method to the gpu decorator (it's only used from there right now anyway). Having it here in this common place implies that it should work for any OTLP metric.

}

promFactory := prometheusreceiver.NewFactory()
promReceiver, err := promFactory.CreateMetricsReceiver(opts.Ctx, params, &promConfig, &decoConsumer)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's great, debug line is fine. Agree on your plans for the next phase as well.

jefchien
jefchien previously approved these changes Mar 1, 2024
straussb
straussb previously approved these changes Mar 1, 2024
movence added 9 commits March 1, 2024 13:56
use constant variables
use the same scrape configs in dcgm scraper test
remove unnecessary attribute decoration for GPU metrics
add dcgm as source for dim
support multiple datapoints when converting pmetric into fields/tags
move RawContainerInsightsMetric struct to stores package
update feature toggle flag var name to match json key
@movence movence dismissed stale reviews from straussb and jefchien via 6a2de95 March 1, 2024 19:36
jefchien
jefchien previously approved these changes Mar 1, 2024
Paramadon
Paramadon previously approved these changes Mar 1, 2024
Copy link

@Paramadon Paramadon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

lisguo
lisguo previously approved these changes Mar 1, 2024
@movence movence dismissed stale reviews from lisguo, Paramadon, and jefchien via 852fff7 March 1, 2024 20:14
@movence movence merged commit d3bf111 into amazon-contributing:aws-cwa-dev Mar 1, 2024
43 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants