Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add stackable-telemetry utility crate #758

Merged
merged 52 commits into from
Apr 20, 2024
Merged

Conversation

Techassi
Copy link
Member

@Techassi Techassi commented Mar 25, 2024

This set of changes introduces a new crate: stackable-telemetry.
So far, this includes:

  • A tracing initialization helper to reduce the amount of boiler-plate code needed to setup logging/telemetry in operators and webhooks.
    • Enable console logging
    • Enable log exports via OTLP
    • Enable trace exports via OTLP
    • This is incompatible with stackable_operator::logging::initialize_logging() (which would be deprecated if we decide to use this in operators too).
    • The API might change in future PRs, for example, we might allow it to be set up through environment variables to save the operator from parsing additional arguments with clap.
  • Tracing middleware for axum.
    • Extract trace context from HTTP headers which might have been passed from upstream systems. Also uses some of the HTTP request data to add as Span fields. This ensures traces cover disparate related systems (this enables distributed tracing).
    • Add OpenTelemetry semantic convention fields (such as the status code, and whether there was an error or not) to Spans before sending the HTTP response. You can see an example of where these head
    • Add trace context headers to HTTP responses for the calling system to optionally make use of.
    • In future, we might require other middleware (eg: for the HTTP client reqwest).

In this set of changes, we update the stackable_webhook implementation to automatically use the axum tracing middleware.

We intend to add metrics support in a future PR.

Tracked by stackabletech/issues#531

Screenshots

Search for Traces by Service and Span Name

image

Looking at a Trace with its related Spans and Attributes

image

Looking at Trace Events within each Span

image

@Techassi Techassi self-assigned this Mar 25, 2024
@Techassi Techassi marked this pull request as draft March 25, 2024 15:05
crates/stackable-otel/src/layer.rs Outdated Show resolved Hide resolved
crates/stackable-otel/src/layer.rs Outdated Show resolved Hide resolved
@Techassi Techassi changed the title feat: Add OpenTelemetry utility crate feat: Add stackable-telemetry utility crate Apr 4, 2024
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch, will have a closer look at the code and test in the afternoon.
I did not comment on many FIXMEs and TODOs as they will follow in another PR as discussed.

crates/stackable-operator/src/commons/telemetry.rs Outdated Show resolved Hide resolved
crates/stackable-operator/src/commons/telemetry.rs Outdated Show resolved Hide resolved
crates/stackable-webhook/src/lib.rs Outdated Show resolved Hide resolved
crates/stackable-webhook/src/tls.rs Show resolved Hide resolved
crates/stackable-telemetry/src/tracing.rs Show resolved Hide resolved
crates/stackable-telemetry/src/tracing.rs Outdated Show resolved Hide resolved
crates/stackable-telemetry/src/instrumentation/axum/mod.rs Outdated Show resolved Hide resolved
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running demo etc. worked fine for me.

Some findings when running the demo:

There are some events dropped. I did 8 requests to the webhook (answered with "Hello") but only 3 showed up in Grafana.
The dropping reoccurs constantly:

2024-04-18T14:35:38.268Z    info    memorylimiter/memorylimiter.go:222    Memory usage is above soft limit. Forcing a GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 53}
2024-04-18T14:35:38.275Z    info    memorylimiter/memorylimiter.go:192    Memory usage after GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 35}
2024-04-18T14:35:47.411Z    error    exporterhelper/queue_sender.go:101    Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "traces", "name": "otlp/tempo", "error": "not retryable error: Permanent error: rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 834977 bytes to trace 03175d1b3375c763983f5a0637d17125 for tenant single-tenant", "dropped_items": 701}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
    go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
    go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
    go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43

The sidecar metrics from the dummy webhook somehow state that nothign was dropped:

[stackable@dummy-webhook-564d7f7868-d6pls /]$ curl localhost:8888/metrics | grep memory
# HELP otelcol_process_memory_rss Total physical memory (resident set size)
# TYPE otelcol_process_memory_rss gauge
otelcol_process_memory_rss{service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 1.60968704e+08
# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')
# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge
otelcol_process_runtime_total_sys_memory_bytes{service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 7.4929416e+07
otelcol_processor_accepted_spans{processor="memory_limiter",service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 127253
otelcol_processor_dropped_spans{processor="memory_limiter",service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 0
otelcol_processor_refused_spans{processor="memory_limiter",service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 0

Discussed with Nick, thats more an opentelemetry-collector issue, so nothing to be done here. Leaving it here for future reference.

Copy link
Member

@adwk67 adwk67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of helpful comments in the code - thank you!

crates/stackable-telemetry/src/instrumentation/axum/mod.rs Outdated Show resolved Hide resolved
crates/stackable-telemetry/src/tracing.rs Outdated Show resolved Hide resolved
crates/stackable-telemetry/src/tracing.rs Outdated Show resolved Hide resolved
crates/stackable-telemetry/src/tracing.rs Outdated Show resolved Hide resolved
crates/stackable-telemetry/src/tracing.rs Outdated Show resolved Hide resolved
Techassi and others added 4 commits April 19, 2024 09:35
This batch of changes includes various typos.

Co-authored-by: Andrew Kenworthy <[email protected]>
Co-authored-by: Malte Sander <[email protected]>
Remove the OpenTelemetry related CRD structs for
now, because we currently don't fully understand
all requirements. The current work will be saved
in a separate branch for future reference.
@stackabletech stackabletech deleted a comment from github-actions bot Apr 19, 2024
…rk when called from the application code, not in a library
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM!

Copy link
Member

@adwk67 adwk67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same from me: outstanding issues are a non-blocking work-in-progress so feel free to merge this. Thanks!

@NickLarsenNZ
Copy link
Member

NickLarsenNZ commented Apr 19, 2024

Thanks @adwk67, @maltesander.

Regarding the issue with dropped traces:

2024-04-18T14:35:47.411Z    error    exporterhelper/queue_sender.go:101    Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "traces", "name": "otlp/tempo", "error": "not retryable error: Permanent error: rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 834977 bytes to trace 03175d1b3375c763983f5a0637d17125 for tenant single-tenant", "dropped_items": 701}

I have investagated with a simple axum web server, and it seems to be a tracing loop. So a request comes in, generates traces, traces get sent via OTLP, which cause more traces, which then get sent via OTLP, and so on.

I am able to stop the loop by changing the LevelFilter for h2 (http2), and have asked on the CNCF #otel-rust channel to see if anyone else has experienced it.

# The variable name will be configurable in a future PR
RUST_LOG=trace,h2=off cargo run --release

We could also enforce it by hard-coding a directive.

Because we are not using this crate yet, I'm ok for this to be merged and fixed in a future PR.

@NickLarsenNZ NickLarsenNZ added this pull request to the merge queue Apr 20, 2024
Merged via the queue into main with commit 0e4076b Apr 20, 2024
17 checks passed
@NickLarsenNZ NickLarsenNZ deleted the feat/otel-crate branch April 20, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants