Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ci neuron #1

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
8920742
add dcgm exporter scraper and move prometheus scraper test mock to mo…
movence Feb 2, 2024
e0f5bcd
update emf exporter to handle GPU metrics with different metric types
movence Feb 13, 2024
51fe859
remove custom logic in emf exporter
movence Feb 14, 2024
cee2244
update gpu flag comment
movence Feb 14, 2024
3debd28
remove comments and test codes
movence Feb 14, 2024
3d9de49
add neuron monitor scraper
sam6134 Feb 15, 2024
9e70069
remove unused codes and rename scraper init funcs
movence Feb 16, 2024
b9a0e03
remove comments
movence Feb 16, 2024
52cd972
add changelog for gpu
movence Feb 16, 2024
eeb90e2
Merge branch 'ci-nvidia-gpu' into ci-neuron
sam6134 Feb 19, 2024
1747a50
Update Scraper for new metrics
sam6134 Feb 20, 2024
4f0e3e1
Make Neuron Scraper extension for simple prometheus scraper
sam6134 Feb 23, 2024
c95f590
Minor fixes
sam6134 Feb 23, 2024
609198b
EnableFlag default to false
sam6134 Feb 23, 2024
1444acd
add gpu metric consumer that uses k8s decorator for attributes
movence Feb 26, 2024
a89378e
Merge branch 'ci-nvidia-gpu' into ci-neuron
sam6134 Feb 27, 2024
d2c417d
testing support
sam6134 Mar 1, 2024
00a12dc
debugging pod resources store
aditya-purang Mar 1, 2024
d3bf111
Add dcgm scraper to collect nvidia GPU metrics (#160)
movence Mar 1, 2024
622200a
[internal/aws/proxy] Fix proxy server unit test (#177)
jefchien Mar 1, 2024
9cb314e
Adding default TLS to dcgmscraper (#178)
okankoAMZ Mar 1, 2024
a821803
add podresource scrapper and metric data printer
aditya-purang Mar 4, 2024
83896ab
refactor logMd
aditya-purang Mar 4, 2024
69969dd
Merge branch 'ci-nvidia-gpu' into ci-neuron
sam6134 Mar 4, 2024
3267653
Merge conflicts
sam6134 Mar 4, 2024
164bd84
More cleanups
sam6134 Mar 4, 2024
19223b1
Remove unused imports
sam6134 Mar 4, 2024
c65ad64
Add decorator to neuron scraper
sam6134 Mar 4, 2024
1f60d15
Merge branch 'ci-neuron' into docker-testing
sam6134 Mar 4, 2024
c6966db
Add decorator to add podResources
sam6134 Mar 5, 2024
05b1c75
Unified the decorator and added podResources decorator
sam6134 Mar 5, 2024
76e05aa
Minor fixes
sam6134 Mar 5, 2024
9e2f849
remove unused file
sam6134 Mar 5, 2024
3168bb2
Making Dcgm implement SimplePrometheusScraper
sam6134 Mar 6, 2024
0c8eac2
Merge branch 'ci-neuron' into ci-neuron
sam6134 Mar 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions receiver/awscontainerinsightreceiver/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ type Config struct {
// EnableGpuMetric toggles GPU monitoring where metrics are scraped from vendor specific sources
EnableGpuMetric bool `mapstructure:"gpu_metrics"`

// EnableNeuronMetric disables Neuron monitoring where metrics are scraped from vendor specific sources
// The default value is true meaning Neuron metrics get collected out of the box unless it's disabled
// EnableNeuronMetric toggles Neuron monitoring where metrics are scraped from neuron monitor
// The default value is false.
EnableNeuronMetric bool `mapstructure:"neuron_metrics"`
sam6134 marked this conversation as resolved.
Show resolved Hide resolved
}
4 changes: 4 additions & 0 deletions receiver/awscontainerinsightreceiver/factory.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ const (

// Don't enable EKS control plane metrics by default
defaultEnableControlPlaneMetrics = false

// Don't enable Neuron metrics by default
defaultEnableNeuronMetrics = false
)

// NewFactory creates a factory for AWS container insight receiver
Expand All @@ -64,6 +67,7 @@ func createDefaultConfig() component.Config {
ClusterName: defaultClusterName,
LeaderLockName: defaultLeaderLockName,
EnableControlPlaneMetrics: defaultEnableControlPlaneMetrics,
EnableNeuronMetric: defaultEnableNeuronMetrics,
}
}

Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0

package nueron

import (
"time"

"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/prometheusscraper"
"github.com/prometheus/common/model"
"github.com/prometheus/prometheus/config"
"github.com/prometheus/prometheus/discovery"
"github.com/prometheus/prometheus/discovery/kubernetes"
"github.com/prometheus/prometheus/model/relabel"
)

const (
caFile = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
collectionInterval = 60 * time.Second
jobName = "containerInsightsNeuronMonitorScraper"
)

func GetNueronScrapeConfig(opts prometheusscraper.SimplePromethuesScraperOpts) *config.ScrapeConfig {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "Nueron" - same on GetNueronMetricRelabelConfigs below and in the package name itself.


return &config.ScrapeConfig{
ScrapeInterval: model.Duration(collectionInterval),
ScrapeTimeout: model.Duration(collectionInterval),
JobName: jobName,
Scheme: "http",
MetricsPath: "/metrics",
ServiceDiscoveryConfigs: discovery.Configs{
&kubernetes.SDConfig{
Role: kubernetes.RoleService,
NamespaceDiscovery: kubernetes.NamespaceDiscovery{
IncludeOwnNamespace: true,
},
Selectors: []kubernetes.SelectorConfig{
{
Role: kubernetes.RoleService,
Label: "k8s-app=neuron-monitor-service",
},
},
AttachMetadata: kubernetes.AttachMetadataConfig{
Node: true,
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't notice this before, but do you need this? What effect does it have?

},
},
RelabelConfigs: []*relabel.Config{
{
SourceLabels: model.LabelNames{"__address__"},
Regex: relabel.MustNewRegexp("([^:]+)(?::\\d+)?"),
Replacement: "${1}:8000",
TargetLabel: "__address__",
Action: relabel.Replace,
},
},
MetricRelabelConfigs: GetNueronMetricRelabelConfigs(opts),
}
}

func GetNueronMetricRelabelConfigs(opts prometheusscraper.SimplePromethuesScraperOpts) []*relabel.Config {

return []*relabel.Config{
{
SourceLabels: model.LabelNames{"__name__"},
Regex: relabel.MustNewRegexp("neuron.*|system_.*|execution_.*"),
Action: relabel.Keep,
},
{
SourceLabels: model.LabelNames{"instance_name"},
TargetLabel: "NodeName",
Regex: relabel.MustNewRegexp("(.*)"),
Replacement: "${1}",
Action: relabel.Replace,
},
{
SourceLabels: model.LabelNames{"instance_id"},
TargetLabel: "InstanceId",
Regex: relabel.MustNewRegexp("(.*)"),
Replacement: "${1}",
Action: relabel.Replace,
},
{
SourceLabels: model.LabelNames{"neuroncore"},
TargetLabel: "DeviceId",
Regex: relabel.MustNewRegexp("(.*)"),
Replacement: "${1}",
Action: relabel.Replace,
},
// hacky way to inject static values (clusterName) to label set without additional processor
// relabel looks up an existing label then creates another label with given key (TargetLabel) and value (static)
{
SourceLabels: model.LabelNames{"instance_id"},
TargetLabel: "ClusterName",
Regex: relabel.MustNewRegexp("(.*)"),
Replacement: opts.HostInfoProvider.GetClusterName(),
Action: relabel.Replace,
},
}
}
Loading