From 4f4e3fa119235b3027d20c0ad3382955bd4a273c Mon Sep 17 00:00:00 2001 From: Ruben van Staden Date: Tue, 22 Apr 2025 17:36:19 -0400 Subject: [PATCH 1/5] add TBS policy example to explain service order --- .../observability/apm/transaction-sampling.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/solutions/observability/apm/transaction-sampling.md b/solutions/observability/apm/transaction-sampling.md index 68ded94b5..1a80124b3 100644 --- a/solutions/observability/apm/transaction-sampling.md +++ b/solutions/observability/apm/transaction-sampling.md @@ -272,7 +272,7 @@ Trace events are matched to policies in the order specified. Each policy list mu Note that from version `9.0.0` APM Server has an unlimited storage limit, but will stop writing when the disk where the database resides reaches 80% usage. Due to how the limit is calculated and enforced, the actual disk space may still grow slightly over this disk usage based limit, or any configured storage limit. :::: -### Example configuration [_example_configuration] +### Example configuration A [_example_configuration_a] This example defines three tail-based sampling polices: @@ -290,6 +290,20 @@ This example defines three tail-based sampling polices: 2. Samples 1% of traces in `production` with the trace name `"GET /not_important_route"` 3. Default policy to sample all remaining traces at 10%, e.g. traces in a different environment, like `dev`, or traces with any other name +### Example configuration B [_example_configuration_b] + +For a trace that originates in Service A and ends in Service B without error, what would the sampling be? + +```yaml +- sample_rate: 0.3 + service.name: B +- sample_rate: 0.5 + service.name: A +- sample_rate: 1.0 # Always set a default +``` + +In the example, only 50% of traces will be sampled. The service that start the trace (Service A) has precedence over child services (Service B). The order of services does not matter, what matters, is in what service the trace event start. Service A, is were the trace starts, and therefore will always have precedence over "child" services that only create spans (Service B). If we start at Service B instead, pass on the context to Service A, which then adds a child span, then, the policy of `service.name: B` will take precedence over that of `service.name: A`. This is because we are working on the *trace level* rather than the *service level*. + ### Configuration reference [_configuration_reference] #### Top-level tail-based sampling settings [_top_level_tail_based_sampling_settings] From edd9143d81087ad53c32ff05574f5aa9ad59e1b6 Mon Sep 17 00:00:00 2001 From: Ruben van Staden Date: Tue, 22 Apr 2025 17:49:53 -0400 Subject: [PATCH 2/5] add TBS policy example to explain trace order --- .../observability/apm/transaction-sampling.md | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/solutions/observability/apm/transaction-sampling.md b/solutions/observability/apm/transaction-sampling.md index 1a80124b3..af7e15a59 100644 --- a/solutions/observability/apm/transaction-sampling.md +++ b/solutions/observability/apm/transaction-sampling.md @@ -304,6 +304,35 @@ For a trace that originates in Service A and ends in Service B without error, wh In the example, only 50% of traces will be sampled. The service that start the trace (Service A) has precedence over child services (Service B). The order of services does not matter, what matters, is in what service the trace event start. Service A, is were the trace starts, and therefore will always have precedence over "child" services that only create spans (Service B). If we start at Service B instead, pass on the context to Service A, which then adds a child span, then, the policy of `service.name: B` will take precedence over that of `service.name: A`. This is because we are working on the *trace level* rather than the *service level*. +### Example configuration C [_example_configuration_c] + +For a trace that originates in A and has an error in B, what would the sampling be? + +```yaml +# Example A +- sample_rate: 0.2 + service.name: A +- sample_rate: 0.5 + trace.outcome: failure +- sample_rate: 1.0 # Always set a default + +# Example B +- sample_rate: 0.2 + trace.outcome: failure +- sample_rate: 0.5 + service.name: alice +- sample_rate: 1.0 +``` + +- In Example A, we are stating that we want a 20% sample rate for trace events originating from Service A, but for all other failed traces we want a sample rate of 50%. +- However, in Example B, we want all failed traces to sample at 20%, including Service A. + +The order matters for `trace` policies relative to `service` policies. This has to do with how to define *specificity* in a distributed system. A *trace*, is an abstract concept that "spans" over a range of distributed services. This is by definition, since we want to be able to "trace" an event across multiple service. So then, when we define a policy on the trace level (such as `trace.outcome: failure`), we are implicitly defining this policy for a range of services. + +If you want to always capture all failed traces, you should define it at the top of your policy list with a value of 1.0. And then define more specific policies for specific services to capture edge cases. If an error happens in a child (Service B), ensure to propagate this error back up to the parent (Service A), which then makes the decision as to whether you want to trace as a whole to fail or not. This logic has to happen on the application layer. + +A child failing doesn't imply a distributed trace should fail. It is possible that the child call is just a nice-to-have and there are backup plans when that fails. For example, an application can fail to call a cache, but it can still read/write to a database directly. The trace shouldn't fail just because the cache isn't available. + ### Configuration reference [_configuration_reference] #### Top-level tail-based sampling settings [_top_level_tail_based_sampling_settings] From 78f4ffbdb51d32c729411a9a79b9e471051cdd99 Mon Sep 17 00:00:00 2001 From: Ruben van Staden Date: Tue, 22 Apr 2025 18:36:46 -0400 Subject: [PATCH 3/5] simplify and improve language --- .../observability/apm/transaction-sampling.md | 27 ++++++++++--------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/solutions/observability/apm/transaction-sampling.md b/solutions/observability/apm/transaction-sampling.md index af7e15a59..5ae6aaba2 100644 --- a/solutions/observability/apm/transaction-sampling.md +++ b/solutions/observability/apm/transaction-sampling.md @@ -292,31 +292,34 @@ This example defines three tail-based sampling polices: ### Example configuration B [_example_configuration_b] -For a trace that originates in Service A and ends in Service B without error, what would the sampling be? +When a trace originates in Service A and then calls Service B (without errors), the sampling rate is determined by the service where the trace starts: ```yaml - sample_rate: 0.3 service.name: B - sample_rate: 0.5 service.name: A -- sample_rate: 1.0 # Always set a default +- sample_rate: 1.0 # Fallback: always set a default ``` -In the example, only 50% of traces will be sampled. The service that start the trace (Service A) has precedence over child services (Service B). The order of services does not matter, what matters, is in what service the trace event start. Service A, is were the trace starts, and therefore will always have precedence over "child" services that only create spans (Service B). If we start at Service B instead, pass on the context to Service A, which then adds a child span, then, the policy of `service.name: B` will take precedence over that of `service.name: A`. This is because we are working on the *trace level* rather than the *service level*. +- Because Service A is the root of the trace, its policy (0.5) takes precedence over Service B's policy (0.3). +- If instead the trace began in Service B (and then passed to Service A), the policy for Service B would apply. + +> **Key point**: Tail‑based sampling rules are evaluated at the *trace level* based on where the trace was initiated, not on downstream spans (*service level*). ### Example configuration C [_example_configuration_c] -For a trace that originates in A and has an error in B, what would the sampling be? +When you need to combine service‑specific policies with outcomes (e.g. failures), policy order defines specificity: ```yaml -# Example A +# Example A: prioritize service origin, then failures - sample_rate: 0.2 service.name: A - sample_rate: 0.5 trace.outcome: failure -- sample_rate: 1.0 # Always set a default +- sample_rate: 1.0 # Default -# Example B +# Example B: prioritize failures, then a specific service - sample_rate: 0.2 trace.outcome: failure - sample_rate: 0.5 @@ -324,14 +327,12 @@ For a trace that originates in A and has an error in B, what would the sampling - sample_rate: 1.0 ``` -- In Example A, we are stating that we want a 20% sample rate for trace events originating from Service A, but for all other failed traces we want a sample rate of 50%. -- However, in Example B, we want all failed traces to sample at 20%, including Service A. - -The order matters for `trace` policies relative to `service` policies. This has to do with how to define *specificity* in a distributed system. A *trace*, is an abstract concept that "spans" over a range of distributed services. This is by definition, since we want to be able to "trace" an event across multiple service. So then, when we define a policy on the trace level (such as `trace.outcome: failure`), we are implicitly defining this policy for a range of services. +- In Example A, traces from Service A are sampled at 20%, and all other failed traces (regardless of service) are sampled at 50%. +- In Example B, every failed trace is sampled at 20%, including those originating from Service A. -If you want to always capture all failed traces, you should define it at the top of your policy list with a value of 1.0. And then define more specific policies for specific services to capture edge cases. If an error happens in a child (Service B), ensure to propagate this error back up to the parent (Service A), which then makes the decision as to whether you want to trace as a whole to fail or not. This logic has to happen on the application layer. +Policies targeting the trace (e.g. `trace.outcome: failure`) apply across all services and should appear before more specific, service‑level rules if you want them to take precedence. -A child failing doesn't imply a distributed trace should fail. It is possible that the child call is just a nice-to-have and there are backup plans when that fails. For example, an application can fail to call a cache, but it can still read/write to a database directly. The trace shouldn't fail just because the cache isn't available. +> **Key point**: Define failure policy at the top to ensure capturing all failed traces, then define more specific policies for specific services to capture edge cases. ### Configuration reference [_configuration_reference] From 7fa4942dcace5da500c8c4b05f0dc5cf11a29bfb Mon Sep 17 00:00:00 2001 From: Ruben van Staden Date: Mon, 28 Apr 2025 16:50:36 -0400 Subject: [PATCH 4/5] improve language around explaining how policies are evaluated --- .../observability/apm/transaction-sampling.md | 20 ++++++++----------- 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/solutions/observability/apm/transaction-sampling.md b/solutions/observability/apm/transaction-sampling.md index 5ae6aaba2..b1468985d 100644 --- a/solutions/observability/apm/transaction-sampling.md +++ b/solutions/observability/apm/transaction-sampling.md @@ -272,7 +272,7 @@ Trace events are matched to policies in the order specified. Each policy list mu Note that from version `9.0.0` APM Server has an unlimited storage limit, but will stop writing when the disk where the database resides reaches 80% usage. Due to how the limit is calculated and enforced, the actual disk space may still grow slightly over this disk usage based limit, or any configured storage limit. :::: -### Example configuration A [_example_configuration_a] +### Example configuration 1 [_example_configuration_1] This example defines three tail-based sampling polices: @@ -290,9 +290,9 @@ This example defines three tail-based sampling polices: 2. Samples 1% of traces in `production` with the trace name `"GET /not_important_route"` 3. Default policy to sample all remaining traces at 10%, e.g. traces in a different environment, like `dev`, or traces with any other name -### Example configuration B [_example_configuration_b] +### Example configuration 2 [_example_configuration_2] -When a trace originates in Service A and then calls Service B (without errors), the sampling rate is determined by the service where the trace starts: +When a trace originates in Service A and then calls Service B, the sampling rate is determined by the service where the trace starts: ```yaml - sample_rate: 0.3 @@ -302,14 +302,14 @@ When a trace originates in Service A and then calls Service B (without errors), - sample_rate: 1.0 # Fallback: always set a default ``` -- Because Service A is the root of the trace, its policy (0.5) takes precedence over Service B's policy (0.3). +- Because Service A is the root of the trace, its policy (0.5) is applied while Service B's policy (0.3) is ignored. - If instead the trace began in Service B (and then passed to Service A), the policy for Service B would apply. -> **Key point**: Tail‑based sampling rules are evaluated at the *trace level* based on where the trace was initiated, not on downstream spans (*service level*). +> **Key point**: Tail‑based sampling rules are evaluated at the *trace level* based on which service initiated the distributed trace, not the service of the transaction or span. -### Example configuration C [_example_configuration_c] +### Example configuration 3 [_example_configuration_3] -When you need to combine service‑specific policies with outcomes (e.g. failures), policy order defines specificity: +Policies are evaluated **in order** and applies the first one whose match conditions are all met. That means, in practice, order policies from most specific (narrow matchers) to most general, ending with a catch-all (fallback). ```yaml # Example A: prioritize service origin, then failures @@ -317,7 +317,7 @@ When you need to combine service‑specific policies with outcomes (e.g. failure service.name: A - sample_rate: 0.5 trace.outcome: failure -- sample_rate: 1.0 # Default +- sample_rate: 1.0 # catch-all # Example B: prioritize failures, then a specific service - sample_rate: 0.2 @@ -330,10 +330,6 @@ When you need to combine service‑specific policies with outcomes (e.g. failure - In Example A, traces from Service A are sampled at 20%, and all other failed traces (regardless of service) are sampled at 50%. - In Example B, every failed trace is sampled at 20%, including those originating from Service A. -Policies targeting the trace (e.g. `trace.outcome: failure`) apply across all services and should appear before more specific, service‑level rules if you want them to take precedence. - -> **Key point**: Define failure policy at the top to ensure capturing all failed traces, then define more specific policies for specific services to capture edge cases. - ### Configuration reference [_configuration_reference] #### Top-level tail-based sampling settings [_top_level_tail_based_sampling_settings] From 0292fcf701c480c791a396f6f7f5c79e64262b55 Mon Sep 17 00:00:00 2001 From: Ruben van Staden Date: Tue, 29 Apr 2025 17:29:34 -0400 Subject: [PATCH 5/5] leverage admonition, decrease fallback sampling rate, and improve language --- .../observability/apm/transaction-sampling.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/solutions/observability/apm/transaction-sampling.md b/solutions/observability/apm/transaction-sampling.md index b1468985d..428d1bc6e 100644 --- a/solutions/observability/apm/transaction-sampling.md +++ b/solutions/observability/apm/transaction-sampling.md @@ -299,17 +299,19 @@ When a trace originates in Service A and then calls Service B, the sampling rate service.name: B - sample_rate: 0.5 service.name: A -- sample_rate: 1.0 # Fallback: always set a default +- sample_rate: 0.1 # Fallback: always set a default ``` - Because Service A is the root of the trace, its policy (0.5) is applied while Service B's policy (0.3) is ignored. - If instead the trace began in Service B (and then passed to Service A), the policy for Service B would apply. -> **Key point**: Tail‑based sampling rules are evaluated at the *trace level* based on which service initiated the distributed trace, not the service of the transaction or span. +:::{note} +Tail‑based sampling rules are evaluated at the *trace level* based on which service initiated the distributed trace, not the service of the transaction or span. +::: ### Example configuration 3 [_example_configuration_3] -Policies are evaluated **in order** and applies the first one whose match conditions are all met. That means, in practice, order policies from most specific (narrow matchers) to most general, ending with a catch-all (fallback). +Policies are evaluated **in order** and the first one that meets all match conditions is applied. That means, in practice, order policies from most specific (narrow matchers) to most general, ending with a catch-all (fallback). ```yaml # Example A: prioritize service origin, then failures @@ -317,14 +319,16 @@ Policies are evaluated **in order** and applies the first one whose match condit service.name: A - sample_rate: 0.5 trace.outcome: failure -- sample_rate: 1.0 # catch-all +- sample_rate: 0.1 # catch-all +``` +```yaml # Example B: prioritize failures, then a specific service - sample_rate: 0.2 trace.outcome: failure - sample_rate: 0.5 - service.name: alice -- sample_rate: 1.0 + service.name: A +- sample_rate: 0.1 ``` - In Example A, traces from Service A are sampled at 20%, and all other failed traces (regardless of service) are sampled at 50%.