diff --git a/docs/en/observability/apm/images/dt-sampling-continuation-strategy-restart.png b/docs/en/observability/apm/images/dt-sampling-continuation-strategy-restart.png new file mode 100644 index 0000000000..e68e43caf7 Binary files /dev/null and b/docs/en/observability/apm/images/dt-sampling-continuation-strategy-restart.png differ diff --git a/docs/en/observability/apm/images/dt-sampling-continuation-strategy-restart_external.png b/docs/en/observability/apm/images/dt-sampling-continuation-strategy-restart_external.png new file mode 100644 index 0000000000..68b19dfc1b Binary files /dev/null and b/docs/en/observability/apm/images/dt-sampling-continuation-strategy-restart_external.png differ diff --git a/docs/en/observability/apm/images/dt-sampling-example-1.png b/docs/en/observability/apm/images/dt-sampling-example-1.png index a3def0c7bf..07ff3fc0d5 100644 Binary files a/docs/en/observability/apm/images/dt-sampling-example-1.png and b/docs/en/observability/apm/images/dt-sampling-example-1.png differ diff --git a/docs/en/observability/apm/images/dt-sampling-example-2.png b/docs/en/observability/apm/images/dt-sampling-example-2.png index d7f87bcd89..b6fd4d3419 100644 Binary files a/docs/en/observability/apm/images/dt-sampling-example-2.png and b/docs/en/observability/apm/images/dt-sampling-example-2.png differ diff --git a/docs/en/observability/apm/images/dt-sampling-example-3.png b/docs/en/observability/apm/images/dt-sampling-example-3.png index a0045705a0..bd795b3783 100644 Binary files a/docs/en/observability/apm/images/dt-sampling-example-3.png and b/docs/en/observability/apm/images/dt-sampling-example-3.png differ diff --git a/docs/en/observability/apm/sampling.asciidoc b/docs/en/observability/apm/sampling.asciidoc index 0f43627743..33172f4dc9 100644 --- a/docs/en/observability/apm/sampling.asciidoc +++ b/docs/en/observability/apm/sampling.asciidoc @@ -29,23 +29,69 @@ data might be discarded purely due to chance. See <> to get started. -**Distributed tracing with head-based sampling** +[float] +[[distributed-tracing-examples]] +===== Distributed tracing In a distributed trace, the sampling decision is still made when the trace is initiated. Each subsequent service respects the initial service's sampling decision, regardless of its configured sample rate; the result is a sampling percentage that matches the initiating service. -In this example, `Service A` initiates four transactions and has sample rate of `.5` (`50%`). -The sample rates of `Service B` and `Service C` are ignored. +In the example in _Figure 1_, `Service A` initiates four transactions and has sample rate of `.5` (`50%`). +The upstream sampling decision is respected, so even if the sample rate is defined and is a different +value in `Service B` and `Service C`, the sample rate will be `.5` (`50%`) for all services. +.Upstream sampling decision is respected image::./images/dt-sampling-example-1.png[Distributed tracing and head based sampling example one] -In this example, `Service A` initiates four transactions and has a sample rate of `1` (`100%`). -Again, the sample rates of `Service B` and `Service C` are ignored. +In the example in _Figure 2_, `Service A` initiates four transactions and has a sample rate of `1` (`100%`). +Again, the upstream sampling decision is respected, so the sample rate for all services will +be `1` (`100%`). +.Upstream sampling decision is respected image::./images/dt-sampling-example-2.png[Distributed tracing and head based sampling example two] -**OpenTelemetry with head-based sampling** +[float] +===== Trace continuation strategies with distributed tracing + +In addition to setting the sample rate, you can also specify which _trace continuation strategy_ to use. +There are three trace continuation strategies: `continue`, `restart`, and `restart_external`. + +The *`continue`* trace continuation strategy is the default and will behave similar to the examples in +the <>. + +Use the *`restart_external`* trace continuation strategy on an Elastic-monitored service to start +a new trace if the previous service did not have a `traceparent` header with `es` vendor data. +This can be helpful if a transaction includes an Elastic-monitored service that is receiving requests +from an unmonitored service. + +In the example in _Figure 3_, `Service A` is an Elastic-monitored service that initiates four transactions +with a sample rate of `.25` (`25%`). Because `Service B` is unmonitored, the traces started in +`Service A` will end there. `Service C` is an Elastic-monitored service that initiates four transactions +that start new traces with a new sample rate of `.5` (`50%`). Because `Service D` is also +Elastic-monitored service, the upstream sampling decision defined in `Service C` is respected. +The end result will be three sampled traces. + +.Using the `restart_external` trace continuation strategy +image::./images/dt-sampling-continuation-strategy-restart_external.png[Distributed tracing and head based sampling with restart_external continuation strategy] + +Use the *`restart`* trace continuation strategy on an Elastic-monitored service to start +a new trace regardless of whether the previous service had a `traceparent` header. +This can be helpful if an Elastic-monitored service is publicly exposed, and you do not +want tracing data to possibly be spoofed by user requests. + +In the example in _Figure 4_, `Service A` and `Service B` are Elastic-monitored services that use the +default trace continuation strategy. `Service A` has a sample rate of `.25` (`25%`), and that +sampling decision is respected in `Service B`. `Service C` is an Elastic-monitored service that +uses the `restart` trace continuation strategy and has a sample rate of `1` (`100%`). +Because it uses `restart`, the upstream sample rate is _not_ respected in `Service C` and all four +traces will be sampled as new traces in `Service C`. The end result will be five sampled traces. + +.Using the `restart` trace continuation strategy +image::./images/dt-sampling-continuation-strategy-restart.png[Distributed tracing and head based sampling with restart continuation strategy] + +[float] +===== OpenTelemetry Head-based sampling is implemented directly in the APM agents and SDKs. The sample rate must be propagated between services and the managed intake service in order to produce accurate metrics. @@ -54,13 +100,16 @@ OpenTelemetry offers multiple samplers. However, most samplers do not propagate This results in inaccurate span-based metrics, like APM throughput, latency, and error metrics. For accurate span-based metrics when using head-based sampling with OpenTelemetry, you must use -a [consistent probability sampler](https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/). +a https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/[consistent probability sampler]. These samplers propagate the sample rate between services and the managed intake service, resulting in accurate metrics. -NOTE: OpenTelemetry does not offer consistent probability samplers in all languages. +[NOTE] +==== +OpenTelemetry does not offer consistent probability samplers in all languages. OpenTelemetry users should consider using tail-based sampling instead. -+ + Refer to the documentation of your favorite OpenTelemetry agent or SDK for more information on the availability of consistent probability samplers. +==== [float] [[apm-tail-based-sampling]] @@ -99,7 +148,7 @@ and will work with traces sent by either Elastic APM agents or OpenTelemetry SDK Due to <> when using https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor[tailsamplingprocessor], we recommend using APM Server tail-based sampling instead. [float] -=== Sampled data and visualizations +==== Sampled data and visualizations A sampled trace retains all data associated with it. A non-sampled trace drops all <> and <> data^1^. @@ -125,7 +174,7 @@ The {kib} apps that utilize RUM data depend on transaction events, so non-sampled RUM traces retain transaction data -- only span data is dropped. [float] -=== Sample rates +==== Sample rates What's the best sampling rate? Unfortunately, there isn't one. Sampling is dependent on your data, the throughput of your application, data retention policies, and other factors.