Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for diagnosing backpressure from Elasticsearch #4097

Merged
merged 9 commits into from
Aug 6, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions docs/en/observability/apm/apm-performance-diagnostic.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
[[apm-performance-diagnostic]]
=== APM Server performance diagnostic

[[apm-es-backpressure]]
[float]
==== Diagnosing backpressure from {es}

When {es} is under excessive load or indexing pressure, APM Server could experience the downstream backpressure when indexing new documents into {es}.
Most commonly, backpressure from {es} will manifest itself in the form of higher indexing latency and/or rejected requests, which in return could lead APM Server to deny incoming requests.
As a result APM agents connected to the affected APM Server will suffer from throttling and/or request timeout when shipping APM events.
1pkg marked this conversation as resolved.
Show resolved Hide resolved

To quickly identify possible issues try looking for similar error logs lines in APM Server logs:

[source,json]
----
...
{"log.level":"error","@timestamp":"2024-07-27T23:46:28.529Z","log.origin":{"function":"github.com/elastic/go-docappender/v2.(*Appender).flush","file.name":"[email protected]/appender.go","file.line":370},"message":"bulk indexing request failed","service.name":"apm-server","error":{"message":"flush failed (429): [429 Too Many Requests]"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-27T23:55:38.612Z","log.origin":{"function":"github.com/elastic/go-docappender/v2.(*Appender).flush","file.name":"[email protected]/appender.go","file.line":370},"message":"bulk indexing request failed","service.name":"apm-server","error":{"message":"flush failed (503): [503 Service Unavailable]"},"ecs.version":"1.6.0"}
...
----

To gain better insight into APM Server health and performance, consider enabling the monitoring feature by following the steps in <<apm-monitor-apm-self-install,Monitor a Fleet-managed APM Server>>.
1pkg marked this conversation as resolved.
Show resolved Hide resolved
When enabled APM Server will additionally report a set of vital metrics to help you identify any performance degradation.
1pkg marked this conversation as resolved.
Show resolved Hide resolved

Pay careful attention to the next metric fields:

* `beats_stats.metrics.libbeat.output.events.active` that represents the number of buffered pending documents waiting for indexing;
(_if this value is increasing rapidly it indicates {es} backpressure_)
1pkg marked this conversation as resolved.
Show resolved Hide resolved
* `beats_stats.metrics.libbeat.output.events.acked` that represents the number of indexing operations that have completed successfully;
* `beats_stats.metrics.libbeat.output.events.failed` that represents the number of indexing operations that failed, it includes all failures;
1pkg marked this conversation as resolved.
Show resolved Hide resolved
(_if this value is increasing rapidly it indicates {es} backpressure_)
1pkg marked this conversation as resolved.
Show resolved Hide resolved
* `beats_stats.metrics.libbeat.output.events.toomany` that represents the number of indexing operations that failed due to {es} responding with 429 Too many Requests;
(_if this value is increasing rapidly it indicates {es} backpressure_)
* `beats_stats.output.elasticsearch.bulk_requests.available` that represents the number of bulk indexers available for making bulk index requests;
(_if this value is equal to 0 it indicates {es} backpressure_)
* `beats_stats.output.elasticsearch.bulk_requests.completed` that represents the number of already completed bulk requests;
* `beats_stats.metrics.output.elasticsearch.indexers.active` that represents the number of active bulk indexers that are concurrently processing batches;

See https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-beat.html[{metricbeat} documentation] for the full list of exported metric fields.
1pkg marked this conversation as resolved.
Show resolved Hide resolved

One likely cause of excessive indexing pressure or rejected requests is undersized {es}. To mitigate this, follow the guidance in {ref}/rejected-requests.html[Rejected requests].
If scaling {es} resources up is not an option, you can try to workaround by adjusting `flush_bytes`, `flush_interval`, `max_retries` and `timeout` settings described in <<apm-elasticsearch-output,Configure the Elasticsearch output>> to reduce APM Server indexing pressure.
1pkg marked this conversation as resolved.
Show resolved Hide resolved
However, consider that increasing number of buffered documents and/or reducing retries may lead to a higher rate of dropped APM events.
5 changes: 4 additions & 1 deletion docs/en/observability/apm/troubleshoot-apm.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and processing and performance guidance.
* <<apm-common-response-codes>>
* <<apm-processing-and-performance>>
* <<apm-enable-apm-server-debugging>>
* <<apm-performance-diagnostic>>

For additional help with other APM components, see the links below.

Expand Down Expand Up @@ -54,4 +55,6 @@ include::apm-response-codes.asciidoc[]

include::processing-performance.asciidoc[]

include::{observability-docs-root}/docs/en/observability/apm/debugging.asciidoc[]
include::{observability-docs-root}/docs/en/observability/apm/debugging.asciidoc[]

include::apm-performance-diagnostic.asciidoc[]