Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] decision_wait time and the lifespan of a trace #36291

Open
AliArfan opened this issue Nov 11, 2024 · 6 comments
Open
Labels
discussion needed Community discussion needed enhancement New feature or request processor/tailsampling Tail sampling processor

Comments

@AliArfan
Copy link

Component(s)

processor/tailsampling

Describe the issue you're reporting

Hi,

I am fairly new to the tail sampling processor, but I would like to ask if there is a solution to my use case. After reading the documentation and looking at examples online, my only viable option seems to be increasing the decision_wait time.

Problem Statement

We have a gRPC collector that processes each message from Cisco devices. We have leveraged OpenTelemetry to gain insights into the application's health. However, we noticed that during a month, we produce 20GB of data. Therefore, we would like to use tail sampling to minimize the sampled data and only sample on probabilistic and status_code: ERROR.

The problem arises when the decision_wait time is reached due to errors in our application where we have a retry and backoff mechanism. For example, we try to re-publish the message to RabbitMQ if it fails, with an increasing backoff interval. The decision_wait set in the tail sampling processor would be too short to include all the retry spans.

Is there a way to sample all the error spans on retry, even after the decision_wait time has been reached?

It would be nice if there were a trace_start and trace_end we could use to only process traces that are complete.

Thank you!

@AliArfan AliArfan added the needs triage New item requiring triage label Nov 11, 2024
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Nov 11, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@bacherfl
Copy link
Contributor

Hi @AliArfan - Looking at the docs, there is a decision_cache option that remembers sampling decisions for a given trace ID beyond the decision_wait duration - is that something you could use for this purpose?

@AliArfan
Copy link
Author

Hi @bacherfl

Thank you for your quick response!

I just took a look at the docs, and this is what I found about the decision_cache:

decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.

Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the decision_wait as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample on trace_complete.

If I can use the decision_cache to alter the decision after I have received the spans with the same trace_id that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the same trace_id after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.

@bacherfl
Copy link
Contributor

Hi @bacherfl

Thank you for your quick response!

I just took a look at the docs, and this is what I found about the decision_cache:

decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.

Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the decision_wait as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample on trace_complete.

If I can use the decision_cache to alter the decision after I have received the spans with the same trace_id that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the same trace_id after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.

Thank you for clarifying @AliArfan! I see, in that case the decision_cache would only work if the trace previously had an error state and was sampled before - In case the error state is only reached after the decision wait time, according to my understanding increasing the decision_wait would be the workaround for now.
Regarding the introduction of the trace_complete, I will have to refer to the code owners of this processor, to get their opinion on if this could be done - FYI @jpkrohling

@bacherfl bacherfl added enhancement New feature or request discussion needed Community discussion needed and removed needs triage New item requiring triage labels Nov 12, 2024
@AliArfan
Copy link
Author

Thank you @bacherfl! Now I know that increasing decision_wait is our only option for now. Looking forward to the response from the devs :)

@jpkrohling
Copy link
Member

@bacherfl is completely right here. One of the problems with span-based traces is that there's no "trace" per se: a trace is only a collection of spans who happen to share the same trace ID. Therefore, it's impossible for us to determine when a trace has been completed, especially in async use-cases.

At the moment, what I can recommend is indeed increasing the decision wait property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Community discussion needed enhancement New feature or request processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

3 participants