-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ci neuron #1
Ci neuron #1
Conversation
update dcgm relabeling rules
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper.go
Outdated
Show resolved
Hide resolved
{ | ||
SourceLabels: model.LabelNames{"__address__"}, | ||
Regex: relabel.MustNewRegexp("([^:]+)(?::\\d+)?"), | ||
Replacement: "${1}:8000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My ignorance, why do you need this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing a comment in the new code about why we do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed , I think it doesnt affect testing
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper_test.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper_test.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper_test.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper_test.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper.go
Outdated
Show resolved
Hide resolved
Made changes, if the initial structure for neuron looks fine I will make changes to make dcgm also use the simple class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, the code and tests are so much easier to read now! Please go ahead with converting DCGM exporter as well.
AttachMetadata: kubernetes.AttachMetadataConfig{ | ||
Node: true, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry didn't notice this before, but do you need this? What effect does it have?
{ | ||
SourceLabels: model.LabelNames{"__address__"}, | ||
Regex: relabel.MustNewRegexp("([^:]+)(?::\\d+)?"), | ||
Replacement: "${1}:8000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing a comment in the new code about why we do this.
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper_test.go
Show resolved
Hide resolved
jobName = "containerInsightsNeuronMonitorScraper" | ||
) | ||
|
||
func GetNueronScrapeConfig(opts prometheusscraper.SimplePromethuesScraperOpts) *config.ScrapeConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "Nueron" - same on GetNueronMetricRelabelConfigs
below and in the package name itself.
receiver/awscontainerinsightreceiver/internal/prometheusscraper/simple_prometheus_scraper.go
Outdated
Show resolved
Hide resolved
scraper.Shutdown() | ||
}) | ||
|
||
// wait for 2 scrapes, one initiated by us, another by the new scraper process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still wondering about this one.
receiver/awscontainerinsightreceiver/internal/neuron/neuron_monitor_scraper_test.go
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore this file, this will be merged in this PR - amazon-contributing#167
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
…emetry#24676) **Description:** The metadata.yml for the SSH check receiver currently documents a resource attribute containing the SSH endpoint but this is not emitted. This PR updates the receiver to include this resource attribute. **Link to tracking Issue:** open-telemetry#24441 **Testing:** Example collector config: ```yaml receivers: sshcheck: endpoint: 13.245.150.131:22 username: ec2-user key_file: /Users/dewald.dejager/.ssh/sandbox.pem collection_interval: 15s known_hosts: /Users/dewald.dejager/.ssh/known_hosts ignore_host_key: false resource_attributes: "ssh.endpoint": enabled: true exporters: logging: verbosity: detailed prometheus: endpoint: 0.0.0.0:8081 resource_to_telemetry_conversion: enabled: true service: pipelines: metrics: receivers: [sshcheck] exporters: [logging, prometheus] ``` The log output looks like this: ``` 2023-07-30T16:52:38.724+0200 info MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 2, "data points": 2} 2023-07-30T16:52:38.724+0200 info ResourceMetrics #0 Resource SchemaURL: Resource attributes: -> ssh.endpoint: Str(13.245.150.131:22) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope otelcol/sshcheckreceiver 0.82.0-dev Metric #0 Descriptor: -> Name: sshcheck.duration -> Description: Measures the duration of SSH connection. -> Unit: ms -> DataType: Gauge NumberDataPoints #0 StartTimestamp: 2023-07-30 14:52:22.381672 +0000 UTC Timestamp: 2023-07-30 14:52:38.404003 +0000 UTC Value: 319 Metric #1 Descriptor: -> Name: sshcheck.status -> Description: 1 if the SSH client successfully connected, otherwise 0. -> Unit: 1 -> DataType: Sum -> IsMonotonic: false -> AggregationTemporality: Cumulative NumberDataPoints #0 StartTimestamp: 2023-07-30 14:52:22.381672 +0000 UTC Timestamp: 2023-07-30 14:52:38.404003 +0000 UTC Value: 1 ``` And the Prometheus metrics look like this: ``` # HELP sshcheck_duration Measures the duration of SSH connection. # TYPE sshcheck_duration gauge sshcheck_duration{ssh_endpoint="13.245.150.131:22"} 311 # HELP sshcheck_status 1 if the SSH client successfully connected, otherwise 0. # TYPE sshcheck_status gauge sshcheck_status{ssh_endpoint="13.245.150.131:22"} 1 ```
) **Description:** Adding command line argument `--status-code` to `telemetrygen traces`, which accepts `(Unset,Error,Ok)` (case sensitive) or the enum equivalent of `(0,1,2)`. Running ```shell telemetrygen traces --otlp-insecure --traces 1 --status-code 1 ``` against a minimal local collector yields ```txt 2023-07-29T21:27:57.862+0100 info ResourceSpans #0 Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0 Resource attributes: -> service.name: Str(telemetrygen) ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope telemetrygen Span #0 Trace ID : f6dc4be32c78b9999c69d504a79e68c1 Parent ID : 4e2cd6e0e90cf2ea ID : 20835413e32d26a5 Name : okey-dokey Kind : Server Start time : 2023-07-29 20:27:57.861602 +0000 UTC End time : 2023-07-29 20:27:57.861726 +0000 UTC Status code : Error Status message : Attributes: -> net.peer.ip: Str(1.2.3.4) -> peer.service: Str(telemetrygen-client) Span #1 Trace ID : f6dc4be32c78b9999c69d504a79e68c1 Parent ID : ID : 4e2cd6e0e90cf2ea Name : lets-go Kind : Client Start time : 2023-07-29 20:27:57.861584 +0000 UTC End time : 2023-07-29 20:27:57.861726 +0000 UTC Status code : Error Status message : Attributes: -> net.peer.ip: Str(1.2.3.4) -> peer.service: Str(telemetrygen-server) ``` and similarly (the string version) ```shell telemetrygen traces --otlp-insecure --traces 1 --status-code '"Ok"' ``` produces ```txt Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0 Resource attributes: -> service.name: Str(telemetrygen) ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope telemetrygen Span #0 Trace ID : dfd830da170acfe567b12f87685d7917 Parent ID : 8e15b390dc6a1ccc ID : 165c300130532072 Name : okey-dokey Kind : Server Start time : 2023-07-29 20:29:16.026965 +0000 UTC End time : 2023-07-29 20:29:16.027089 +0000 UTC Status code : Ok Status message : Attributes: -> net.peer.ip: Str(1.2.3.4) -> peer.service: Str(telemetrygen-client) Span #1 Trace ID : dfd830da170acfe567b12f87685d7917 Parent ID : ID : 8e15b390dc6a1ccc Name : lets-go Kind : Client Start time : 2023-07-29 20:29:16.026956 +0000 UTC End time : 2023-07-29 20:29:16.027089 +0000 UTC Status code : Ok Status message : Attributes: -> net.peer.ip: Str(1.2.3.4) -> peer.service: Str(telemetrygen-server) ``` The default is `Unset` which is the current behaviour. **Link to tracking Issue:** 24286 **Testing:** Added unit tests which covers both valid and invalid inputs. **Documentation:** Command line arguments are self documenting via the usage info in the flag. Co-authored-by: Pablo Baeyens <[email protected]>
Co-authored-by: matianjun1 <[email protected]>
assert.Contains(t, err.Error(), "failed to parse proxy URL") | ||
assert.Contains(t, err.Error(), "invalid control character in URL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge error will fix
caFile = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt" | ||
caFile = "/etc/amazon-cloudwatch-observability-agent-cert/tls-ca.crt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the changes in this file are coming from 9cb314e - can we merge it to ci-nvidia-gpu
first so they don't show up as diffs in your PR?
GetInstanceID() string | ||
} | ||
|
||
func GetScraperConfig(hostInfoProvider hostInfoProvider) *config.ScrapeConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're moving it here, shouldn't it be removed from internal/gpu/dcgmscraper.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes will delete it correct, again mad merge
acir.podResourcesStore.AddResourceName("aws.amazon.com/neuroncore") | ||
acir.podResourcesStore.AddResourceName("aws.amazon.com/neuron") | ||
acir.podResourcesStore.AddResourceName("aws.amazon.com/neurondevice") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these share the constants from PodAttributesDecoratorConsumer
?
if acir.dcgmScraper != nil { | ||
acir.dcgmScraper.GetMetrics() //nolint:errcheck | ||
// if acir.dcgmScraper != nil { | ||
// acir.dcgmScraper.GetMetrics() //nolint:errcheck | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert
acir.settings.Logger.Info("We will start the Neuron Scraper") | ||
|
||
if acir.neuronMonitorScraper != nil { | ||
acir.settings.Logger.Info("Neuron Scraper is not NIL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
} | ||
|
||
acir.settings.Logger.Info("If this happened Neuron is started or not") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
t.Setenv(defaultRegionEnvName, mockRegion) | ||
t.Setenv(regionEnvName, mockRegion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all the changes in awsxrayreceiver?
CR for initial Reviews on the NeuronScraper
New revision: