-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add stackable-telemetry
utility crate
#758
Conversation
stackable-telemetry
utility crate
…configured subscribers.
This reverts commit 9a30a02.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First batch, will have a closer look at the code and test in the afternoon.
I did not comment on many FIXMEs and TODOs as they will follow in another PR as discussed.
Co-authored-by: Malte Sander <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running demo etc. worked fine for me.
Some findings when running the demo:
There are some events dropped. I did 8 requests to the webhook (answered with "Hello") but only 3 showed up in Grafana.
The dropping reoccurs constantly:
2024-04-18T14:35:38.268Z info memorylimiter/memorylimiter.go:222 Memory usage is above soft limit. Forcing a GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 53}
2024-04-18T14:35:38.275Z info memorylimiter/memorylimiter.go:192 Memory usage after GC. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 35}
2024-04-18T14:35:47.411Z error exporterhelper/queue_sender.go:101 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "traces", "name": "otlp/tempo", "error": "not retryable error: Permanent error: rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 834977 bytes to trace 03175d1b3375c763983f5a0637d17125 for tenant single-tenant", "dropped_items": 701}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
The sidecar metrics from the dummy webhook somehow state that nothign was dropped:
[stackable@dummy-webhook-564d7f7868-d6pls /]$ curl localhost:8888/metrics | grep memory
# HELP otelcol_process_memory_rss Total physical memory (resident set size)
# TYPE otelcol_process_memory_rss gauge
otelcol_process_memory_rss{service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 1.60968704e+08
# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')
# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge
otelcol_process_runtime_total_sys_memory_bytes{service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 7.4929416e+07
otelcol_processor_accepted_spans{processor="memory_limiter",service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 127253
otelcol_processor_dropped_spans{processor="memory_limiter",service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 0
otelcol_processor_refused_spans{processor="memory_limiter",service_instance_id="27269d98-5fce-4829-8777-9cf0af6ff254",service_name="otelcol-contrib",service_version="0.97.0"} 0
Discussed with Nick, thats more an opentelemetry-collector issue, so nothing to be done here. Leaving it here for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of helpful comments in the code - thank you!
This batch of changes includes various typos. Co-authored-by: Andrew Kenworthy <[email protected]> Co-authored-by: Malte Sander <[email protected]>
Remove the OpenTelemetry related CRD structs for now, because we currently don't fully understand all requirements. The current work will be saved in a separate branch for future reference.
67062be
to
536f58b
Compare
…rk when called from the application code, not in a library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, same from me: outstanding issues are a non-blocking work-in-progress so feel free to merge this. Thanks!
Thanks @adwk67, @maltesander. Regarding the issue with dropped traces:
I have investagated with a simple axum web server, and it seems to be a tracing loop. So a request comes in, generates traces, traces get sent via OTLP, which cause more traces, which then get sent via OTLP, and so on. I am able to stop the loop by changing the LevelFilter for # The variable name will be configurable in a future PR
RUST_LOG=trace,h2=off cargo run --release We could also enforce it by hard-coding a directive. Because we are not using this crate yet, I'm ok for this to be merged and fixed in a future PR. |
This set of changes introduces a new crate:
stackable-telemetry
.So far, this includes:
stackable_operator::logging::initialize_logging()
(which would be deprecated if we decide to use this in operators too).clap
.reqwest
).In this set of changes, we update the
stackable_webhook
implementation to automatically use the axum tracing middleware.We intend to add metrics support in a future PR.
Tracked by stackabletech/issues#531
Screenshots
Search for Traces by Service and Span Name
Looking at a Trace with its related Spans and Attributes
Looking at Trace Events within each Span