[processor/tailsampling] decision_wait time and the lifespan of a trace #36291

AliArfan · 2024-11-11T12:34:30Z

Component(s)

processor/tailsampling

Describe the issue you're reporting

Hi,

I am fairly new to the tail sampling processor, but I would like to ask if there is a solution to my use case. After reading the documentation and looking at examples online, my only viable option seems to be increasing the decision_wait time.

Problem Statement

We have a gRPC collector that processes each message from Cisco devices. We have leveraged OpenTelemetry to gain insights into the application's health. However, we noticed that during a month, we produce 20GB of data. Therefore, we would like to use tail sampling to minimize the sampled data and only sample on probabilistic and status_code: ERROR.

The problem arises when the decision_wait time is reached due to errors in our application where we have a retry and backoff mechanism. For example, we try to re-publish the message to RabbitMQ if it fails, with an increasing backoff interval. The decision_wait set in the tail sampling processor would be too short to include all the retry spans.

Is there a way to sample all the error spans on retry, even after the decision_wait time has been reached?

It would be nice if there were a trace_start and trace_end we could use to only process traces that are complete.

Thank you!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-11T12:34:45Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

bacherfl · 2024-11-11T12:56:47Z

Hi @AliArfan - Looking at the docs, there is a decision_cache option that remembers sampling decisions for a given trace ID beyond the decision_wait duration - is that something you could use for this purpose?

AliArfan · 2024-11-11T14:23:04Z

Hi @bacherfl

Thank you for your quick response!

I just took a look at the docs, and this is what I found about the decision_cache:

decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.

Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the decision_wait as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample on trace_complete.

If I can use the decision_cache to alter the decision after I have received the spans with the same trace_id that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the same trace_id after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.

bacherfl · 2024-11-12T06:03:53Z

Hi @bacherfl

Thank you for your quick response!

I just took a look at the docs, and this is what I found about the decision_cache:
decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.
Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the decision_wait as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample on trace_complete.

If I can use the decision_cache to alter the decision after I have received the spans with the same trace_id that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the same trace_id after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.

Thank you for clarifying @AliArfan! I see, in that case the decision_cache would only work if the trace previously had an error state and was sampled before - In case the error state is only reached after the decision wait time, according to my understanding increasing the decision_wait would be the workaround for now.
Regarding the introduction of the trace_complete, I will have to refer to the code owners of this processor, to get their opinion on if this could be done - FYI @jpkrohling

AliArfan · 2024-11-12T07:48:45Z

Thank you @bacherfl! Now I know that increasing decision_wait is our only option for now. Looking forward to the response from the devs :)

jpkrohling · 2024-12-04T11:52:34Z

@bacherfl is completely right here. One of the problems with span-based traces is that there's no "trace" per se: a trace is only a collection of spans who happen to share the same trace ID. Therefore, it's impossible for us to determine when a trace has been completed, especially in async use-cases.

At the moment, what I can recommend is indeed increasing the decision wait property.

AliArfan added the needs triage New item requiring triage label Nov 11, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Nov 11, 2024

github-actions bot mentioned this issue Nov 12, 2024

Weekly Report: 2024-11-05 - 2024-11-12 #36302

Closed

bacherfl added enhancement New feature or request discussion needed Community discussion needed and removed needs triage New item requiring triage labels Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsampling] decision_wait time and the lifespan of a trace #36291

[processor/tailsampling] decision_wait time and the lifespan of a trace #36291

AliArfan commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

bacherfl commented Nov 11, 2024

AliArfan commented Nov 11, 2024

bacherfl commented Nov 12, 2024

AliArfan commented Nov 12, 2024

jpkrohling commented Dec 4, 2024