Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument driver with OpenTelemetry tracing #147

Open
wants to merge 10 commits into
base: scylla-3.x
Choose a base branch
from

Conversation

wprzytula
Copy link

In order to increase driver's observability, it has been instrumented with OTel tracing. Requests, speculative executions and particular retries are covered with appropriate spans, which are filled with informative tags. The tracing incurs no overhead by default (when off), and bearable overhead when on. The PR contains (not many yet) tests and a Zipkin-based example.

@avelanarius
Copy link

Some of the tests are failing. Those should be fixed by: zpp-2021-telemetry/java-driver#10 - please look if this PR could be simplified (the reason it wasn't merged was that it was a bit too complex to understand and not documented well enough).

@avelanarius
Copy link

avelanarius commented Jul 26, 2022

Minor nit: formatting of commit messages: there shouldn't be a . at the end and line length of the body of the commit message shouldn't be longer than 72 characters (title no longer than 50 characters).

@wprzytula wprzytula force-pushed the scylla-3.x-otel branch 5 times, most recently from 6134b17 to f838d27 Compare July 29, 2022 07:59
In order to be generic and to still stick to Java 6 requirement for
driver-core, abstract classes: TracingInfo and TracingInfoFactory are
introduced. Further implementations of tracing are expected
to be placed apart from driver-core and hence keep Java 8 (or higher)
requirements away.
@wprzytula wprzytula force-pushed the scylla-3.x-otel branch 3 times, most recently from 5b6e60b to 15e2d53 Compare August 4, 2022 12:29
@wprzytula wprzytula requested a review from avelanarius August 4, 2022 12:29
@wprzytula wprzytula force-pushed the scylla-3.x-otel branch 11 times, most recently from 4cbfa72 to e210be5 Compare August 9, 2022 06:05
@wprzytula
Copy link
Author

After long adventures with failing tests, the PR is finally ready for last review.

@wprzytula
Copy link
Author

@avelanarius the requested fixes regading boundValues and partitionKey have been added.

As some gathered data can be heavy or sensitive or we might want
to not have them in the trace for some reason, verbosity level layer
has been added. Currently, statement is both possibly long
and containing some sensitive data, so it is included only on level
FULL (and not NORMAL).
While building a Cluster object, one can pass a TracingInfo class
implementation and it will be used for tracing in all sessions
with this cluster from now on.
Every request (in RequestHandler) is covered by one "request" span,
and every (speculative) execution has its own "speculative_execution"
span. Each retry attempt is covered by "attempt" span, effectively
yielding a tree of form:
"request" -1--many-> "speculative_execution" -1--many-> "attempt"
After a request is completed, its status code (OK or ERROR) is added
to trace data, as well as exception information (if one occured).
All data that constitute added tags are already know by the driver,
i.e. they do not require fetching any additional metadata from
the cluster.
@wprzytula
Copy link
Author

Renamed PrecisionLevel to VerbosityLevel.

These tests verify that the span tree structure is valid and that spans
are filled with tags of proper content, by providing a mock
implementation of TracingInfo.
Added TracingInfo and TracingInfoFactory implementations in separate
driver-opentelemetry module.
Added OpenTelemetry and Zipkin connection configuration classes,
along with docker-compose file, example run script and related changes
in README.
@wprzytula wprzytula force-pushed the scylla-3.x-otel branch 2 times, most recently from 53cdcaa to cc23ceb Compare September 26, 2022 07:52
The test verifies that span tree structure and status code are valid.
Speculative executions are run parallel to the main thread, so some
of them can finish only after query result has been returned.
Thus, in order to collect span data from entire request, we decided
to wait until all speculative executions end. The main thread uses
conditional variable `allEnded` to wait for them and lock is used
for concurrent mutation of activeSpans.
@avelanarius avelanarius force-pushed the scylla-3.x branch 2 times, most recently from 0d1c3e9 to f5fc83d Compare November 24, 2022 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants