Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ci neuron #1

Closed
wants to merge 35 commits into from
Closed

Ci neuron #1

wants to merge 35 commits into from

Conversation

sam6134
Copy link
Collaborator

@sam6134 sam6134 commented Feb 20, 2024

CR for initial Reviews on the NeuronScraper

New revision:

  • Rebased the GPU code, and made the decorator generic
  • Added a new decorator to add Pod-Attributes to the metric
  • Make both the Dcgm scraper and NeuronScraper implement the SimpleScraper

receiver/awscontainerinsightreceiver/config.go Outdated Show resolved Hide resolved
{
SourceLabels: model.LabelNames{"__address__"},
Regex: relabel.MustNewRegexp("([^:]+)(?::\\d+)?"),
Replacement: "${1}:8000",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ignorance, why do you need this one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing a comment in the new code about why we do this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed , I think it doesnt affect testing

receiver/awscontainerinsightreceiver/config.go Outdated Show resolved Hide resolved
@sam6134
Copy link
Collaborator Author

sam6134 commented Feb 23, 2024

Made changes, if the initial structure for neuron looks fine I will make changes to make dcgm also use the simple class

Copy link

@straussb straussb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, the code and tests are so much easier to read now! Please go ahead with converting DCGM exporter as well.

Comment on lines 43 to 45
AttachMetadata: kubernetes.AttachMetadataConfig{
Node: true,
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't notice this before, but do you need this? What effect does it have?

{
SourceLabels: model.LabelNames{"__address__"},
Regex: relabel.MustNewRegexp("([^:]+)(?::\\d+)?"),
Replacement: "${1}:8000",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing a comment in the new code about why we do this.

jobName = "containerInsightsNeuronMonitorScraper"
)

func GetNueronScrapeConfig(opts prometheusscraper.SimplePromethuesScraperOpts) *config.ScrapeConfig {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "Nueron" - same on GetNueronMetricRelabelConfigs below and in the package name itself.

scraper.Shutdown()
})

// wait for 2 scrapes, one initiated by us, another by the new scraper process

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering about this one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore this file, this will be merged in this PR - amazon-contributing#167

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

aditya-purang pushed a commit that referenced this pull request Mar 6, 2024
…emetry#24676)

**Description:** The metadata.yml for the SSH check receiver currently
documents a resource attribute containing the SSH endpoint but this is
not emitted. This PR updates the receiver to include this resource
attribute.

**Link to tracking Issue:** open-telemetry#24441 

**Testing:**

Example collector config:
```yaml
receivers:
  sshcheck:
    endpoint: 13.245.150.131:22
    username: ec2-user
    key_file: /Users/dewald.dejager/.ssh/sandbox.pem
    collection_interval: 15s
    known_hosts: /Users/dewald.dejager/.ssh/known_hosts
    ignore_host_key: false
    resource_attributes:
      "ssh.endpoint":
        enabled: true

exporters:
  logging:
    verbosity: detailed
  prometheus:
    endpoint: 0.0.0.0:8081
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    metrics:
      receivers: [sshcheck]
      exporters: [logging, prometheus]
```

The log output looks like this:
```
2023-07-30T16:52:38.724+0200    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 1, "metrics": 2, "data points": 2}
2023-07-30T16:52:38.724+0200    info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> ssh.endpoint: Str(13.245.150.131:22)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/sshcheckreceiver 0.82.0-dev
Metric #0
Descriptor:
     -> Name: sshcheck.duration
     -> Description: Measures the duration of SSH connection.
     -> Unit: ms
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 2023-07-30 14:52:22.381672 +0000 UTC
Timestamp: 2023-07-30 14:52:38.404003 +0000 UTC
Value: 319
Metric #1
Descriptor:
     -> Name: sshcheck.status
     -> Description: 1 if the SSH client successfully connected, otherwise 0.
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
StartTimestamp: 2023-07-30 14:52:22.381672 +0000 UTC
Timestamp: 2023-07-30 14:52:38.404003 +0000 UTC
Value: 1
```

And the Prometheus metrics look like this:
```
# HELP sshcheck_duration Measures the duration of SSH connection.
# TYPE sshcheck_duration gauge
sshcheck_duration{ssh_endpoint="13.245.150.131:22"} 311
# HELP sshcheck_status 1 if the SSH client successfully connected, otherwise 0.
# TYPE sshcheck_status gauge
sshcheck_status{ssh_endpoint="13.245.150.131:22"} 1
```
aditya-purang pushed a commit that referenced this pull request Mar 6, 2024
)

**Description:** 

Adding command line argument `--status-code` to `telemetrygen traces`,
which accepts `(Unset,Error,Ok)` (case sensitive) or the enum equivalent
of `(0,1,2)`.

Running 

```shell
telemetrygen traces --otlp-insecure --traces 1 --status-code 1
```

against a minimal local collector yields

```txt
2023-07-29T21:27:57.862+0100	info	ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
Resource attributes:
     -> service.name: Str(telemetrygen)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope telemetrygen
Span #0
    Trace ID       : f6dc4be32c78b9999c69d504a79e68c1
    Parent ID      : 4e2cd6e0e90cf2ea
    ID             : 20835413e32d26a5
    Name           : okey-dokey
    Kind           : Server
    Start time     : 2023-07-29 20:27:57.861602 +0000 UTC
    End time       : 2023-07-29 20:27:57.861726 +0000 UTC
    Status code    : Error
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-client)
Span #1
    Trace ID       : f6dc4be32c78b9999c69d504a79e68c1
    Parent ID      :
    ID             : 4e2cd6e0e90cf2ea
    Name           : lets-go
    Kind           : Client
    Start time     : 2023-07-29 20:27:57.861584 +0000 UTC
    End time       : 2023-07-29 20:27:57.861726 +0000 UTC
    Status code    : Error
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-server)
```

and similarly (the string version)

```shell
telemetrygen traces --otlp-insecure --traces 1 --status-code '"Ok"'
```

produces 

```txt
Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
Resource attributes:
     -> service.name: Str(telemetrygen)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope telemetrygen
Span #0
    Trace ID       : dfd830da170acfe567b12f87685d7917
    Parent ID      : 8e15b390dc6a1ccc
    ID             : 165c300130532072
    Name           : okey-dokey
    Kind           : Server
    Start time     : 2023-07-29 20:29:16.026965 +0000 UTC
    End time       : 2023-07-29 20:29:16.027089 +0000 UTC
    Status code    : Ok
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-client)
Span #1
    Trace ID       : dfd830da170acfe567b12f87685d7917
    Parent ID      :
    ID             : 8e15b390dc6a1ccc
    Name           : lets-go
    Kind           : Client
    Start time     : 2023-07-29 20:29:16.026956 +0000 UTC
    End time       : 2023-07-29 20:29:16.027089 +0000 UTC
    Status code    : Ok
    Status message :
Attributes:
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-server)
```

The default is `Unset` which is the current behaviour.

**Link to tracking Issue:**

24286

**Testing:**

Added unit tests which covers both valid and invalid inputs.

**Documentation:**

Command line arguments are self documenting via the usage info in the
flag.

Co-authored-by: Pablo Baeyens <[email protected]>
aditya-purang pushed a commit that referenced this pull request Mar 6, 2024
assert.Contains(t, err.Error(), "failed to parse proxy URL")
assert.Contains(t, err.Error(), "invalid control character in URL")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge error will fix

Comment on lines -28 to +29
caFile = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
caFile = "/etc/amazon-cloudwatch-observability-agent-cert/tls-ca.crt"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the changes in this file are coming from 9cb314e - can we merge it to ci-nvidia-gpu first so they don't show up as diffs in your PR?

GetInstanceID() string
}

func GetScraperConfig(hostInfoProvider hostInfoProvider) *config.ScrapeConfig {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're moving it here, shouldn't it be removed from internal/gpu/dcgmscraper.go?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will delete it correct, again mad merge

Comment on lines +232 to +234
acir.podResourcesStore.AddResourceName("aws.amazon.com/neuroncore")
acir.podResourcesStore.AddResourceName("aws.amazon.com/neuron")
acir.podResourcesStore.AddResourceName("aws.amazon.com/neurondevice")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these share the constants from PodAttributesDecoratorConsumer?

Comment on lines -246 to +321
if acir.dcgmScraper != nil {
acir.dcgmScraper.GetMetrics() //nolint:errcheck
// if acir.dcgmScraper != nil {
// acir.dcgmScraper.GetMetrics() //nolint:errcheck
// }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert

acir.settings.Logger.Info("We will start the Neuron Scraper")

if acir.neuronMonitorScraper != nil {
acir.settings.Logger.Info("Neuron Scraper is not NIL")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

}

acir.settings.Logger.Info("If this happened Neuron is started or not")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

t.Setenv(defaultRegionEnvName, mockRegion)
t.Setenv(regionEnvName, mockRegion)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all the changes in awsxrayreceiver?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants