Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mch rebased otel #25

Merged
merged 5 commits into from
Dec 2, 2024
Merged

Conversation

MiquelZuehlke
Copy link
Collaborator

Minimum instrumentation for tracking the full request lifecycle

Copy link

@samweyermann samweyermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't gone through worker.py yet, but here are some initial comments

@@ -57,6 +57,7 @@ class Request:
"user_request",
"content_length",
"content_type",
"otel_trace_ctx",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we will want other otel fields in the future? If so, I recommend calling this otel_trace and storing the context as a field within to avoid polluting the request with many otel_trace_* slots or the expensive refactor in the future to change this slot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid point, although the goal is to propagate only the context, keeping the implementation minimally intrusive in the codebase, I agree that makes sense to have a more general naming for this field

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Based on the docs I think just the context is reasonable.

polytope_server/common/observability/otel.py Show resolved Hide resolved
# Adding trace_id and span_id to the request data for logging
request.otel_trace_ctx = {
'carrier': carrier # Store the injected carrier
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7-15 are also repeated in update_trace_context. Extract these

@contextmanager
def create_new_span_producer(span_name, request_id=None, parent_context=None, kind=trace.SpanKind.SERVER):
with create_new_span(span_name, request_id, parent_context, kind) as span:
span.set_attribute("role", "producer")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do this as four separate methods rather than passing the role as an argument? The only one that is odd is create_new_span_internal since there is only one valid mapping from kind to role.

In the case of passing the role, it could be passed as an enum value similarly to kind so that we limit what the options are to these three methods

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right, during testing, I frequently had to switch between Consumer, Producer, and Server span kinds. To simplify this and avoid modifying each service individually, I created dedicated functions for each type. This way, it's clear that, for example, a Producer span is being forced to a Server kind due to AWS requirements. Following that logic, it felt inconsistent to leave Internal spans handled differently, so I aligned it with the same approach.

polytope_server/common/observability/otel.py Show resolved Hide resolved
return extracted_context

@contextmanager
def create_new_span_internal(span_name, request_id=None, parent_context=None, kind=trace.SpanKind.INTERNAL):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If keeping this as four methods, make it clear that _internal is referring to the span and not that this is a way of marking this as an internal method (or perhaps someone more experience with Python would understand this immediately?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question. From my perspective, when discussing OpenTelemetry and spans, it should be clear that we're referring to the type of the span, as that's its primary characteristic.

Given the context of OpenTelemetry, I believe it would be intuitive for someone familiar with the framework to understand that _internal refers to the span type rather than indicating a private or internal method.

}

current_span = trace.get_current_span()
current_span.set_attribute("polytope.request.id", request.id)
Copy link

@samweyermann samweyermann Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set this attribute in add_ and restore_? It seems like this should be done only in create_new_span as otherwise it seems that it will be unreliably set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are passing the context between services, therefore, each service starts a new span where we restore the context (so we keep track of the propagation) but the attributes are missing and need to be updated again

else:
request.user_message += "Success"
# Creating new internal span for the datasource dispatch (calculate roundtrip time)
with create_new_span_internal("Datatsource_{}".format(ds.get_type())) as span_ds:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not add the request_id here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context has already been restored in this point. Since the context is tied to a specific request, restoring the context also sets the request ID into the parent span during the restoration process.

polytope_server/worker/worker.py Show resolved Hide resolved
polytope_server/worker/worker.py Show resolved Hide resolved
@MiquelZuehlke MiquelZuehlke merged commit 2f15090 into develop-mch Dec 2, 2024
@MiquelZuehlke MiquelZuehlke deleted the feature/mch-rebased-otel branch December 2, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants