Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiling: introduce self-managed beta for 8.12 #3379

Merged
merged 11 commits into from
Jan 24, 2024
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/en/observability/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,10 @@ include::profiling-upgrade.asciidoc[leveloffset=+2]

include::profiling-troubleshooting.asciidoc[leveloffset=+2]

include::profiling-self-managed.asciidoc[leveloffset=+2]
include::profiling-self-managed-ops.asciidoc[leveloffset=+3]
include::profiling-self-managed-troubleshooting.asciidoc[leveloffset=+3]

// Alerting
include::create-alerts.asciidoc[leveloffset=+1]

Expand Down
228 changes: 228 additions & 0 deletions docs/en/observability/profiling-self-managed-ops.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
[[profiling-self-managed-ops]]
= Operate the Universal Profiling backend

++++
<titleabbrev>Operate the backend</titleabbrev>
++++

This page outlines operating the backend when running Universal Profiling on a self-managed version of the {stack}. Here you'll find information on:

* <<profiling-self-managed-ops-sizing-guidance, Resource sizing>>
* <<profiling-self-managed-ops-configuration, Configuring your collector and symbolizer>>
* <<profiling-self-managed-ops-monitoring, Monitoring your collector and symbolizer>>
* <<profiling-scaling-backend-resources, Scaling your resources>>
* <<profiling-upgrade-backend-bin, Upgrading backend binaries>>

[discrete]
[[profiling-self-managed-ops-sizing-guidance]]
== Resource guide

The resources needed to ingest and query Universal Profiling data vary based on the total number of CPU cores you're profiling.
The number of cores comes from the sum of all _virtual_ cores as recorded in `/proc/cpuinfo`, adding up all the machines you'll deploy the host-agent to.

Ingestion and query resource demand is almost directly proportional to the amount of data the host-agents generate.
Calculate the data generated by the host-agents using the number of CPU samples collected, the number of executables processed, and the executables' debug metadata size. While the number of CPU samples collected is predictable, the number of executables processed and the executables' debug metadata size is not.

The following table provides recommended resources for ingesting and querying Universal Profiling data based on your number of CPU cores:

|====
| # of CPU cores | Elasticsearch total memory | Elasticsearch total storage (60 days retention) | Profiling Backend | Kibana memory

| 1–100 | 4GB–8GB | 250GB | 1 Collector 2GB, 1 Symbolizer 2GB | 2GB
| 100–1000 | 8GB–32GB | 250GB–2TB | 1 Collector 4GB, 1 Symbolizer 4GB | 2GB
| 1000–10,000 | 32GB–128GB | 2TB–8TB | 2 Collector 4GB, 1 Symbolizer 8GB | 4GB
| 10,000–50,000 | 128GB–512GB | 8TB–16TB | 3+ Collector 4GB, 1 Symbolizer 8GB | 8GB
|====

NOTE: This table is derived from benchmarks performed on Universal Profiling with ingestion of up to 15,000 CPU cores.
The profiled machines had a near-constant load of 75% CPU utilization.
The deployment used 3 Elasticsearch nodes with 64 GB memory each, 8 vCPU, and 1.5 TB NVMe disk drives.

Because resource demand is nearly proportional to the amount of data the host-agents generate, you can calculate the necessary resources for use cases beyond those in the table by comparing your actual number of cores profiled with the number of cores in the table.
When calculating, factor in the following:

* The average load of the machines being profiled: The average load directly impacts the amount of CPU samples collected. For example, on a system that is mostly idle, not all CPUs will be scheduling tasks during the sampling intervals.
* The rate of change of the executables being profiled—for example, how often you deploy new versions of your software: The rate of change impacts the amount of debug metadata stored in Elasticsearch as a result of symbolization; the more different executables the host-agent collects, the more debug data will be stored in Elasticsearch. Note that two different builds of the same application still result in two different executables, as the host-agent will treat each ELF file independently.

Storage considerations: the Elasticsearch disks' bandwidth and latency will affect the latency of ingesting and querying the profiling data.
Allocate data to hot nodes for best performance and user experience.
If storage becomes a concern, tune the data retention by customizing the Universal Profiling <<profiling-ilm-custom-policy, index lifecycle management policy>>.

[discrete]
[[profiling-self-managed-ops-configuration]]
== Configure the collector and symbolizer

You can configure the collector and symbolizer using the YAML file and CLI flags, with the CLI flags taking precedence over the YAML file.
The configuration files are created during the installation process, as seen in <<profiling-self-managed-running-linux-configfile, Create configuration files section>>.
Comments in the configuration files explain the purpose of each configuration option.

Restart the backend binaries after modifying the configuration files for changes to take effect.

[discrete]
[[profiling-self-managed-ops-configuration-cli-overrides]]
=== Use CLI flags to override configuration file values

When building configuration options for each of the backend binaries, you can use CLI flags to override the values in the YAML configuration file.
The overrides **must** contain the full path to the configuration option and must be in a key=value format. For example, `-E application.field.key=value`, where `application` is the name of the binary.

For example, to enable TLS in the HTTP server of the collector, you can pass the `-E pf-elastic-collector.ssl.enabled=true` flag.
This will override the `ssl.enabled` option found in the YAML configuration file.

[discrete]
[[profiling-self-managed-ops-monitoring]]
== Monitoring

Monitor the collector and symbolizer through <<profiling-self-managed-ops-monitoring-logs>> and <<profiling-self-managed-ops-monitoring-metrics>> to ensure the services are running and healthy.
Without both services running, profiling data will not be ingested and symbolized,
and querying Kibana won't return data.

[discrete]
[[profiling-self-managed-ops-monitoring-logs]]
=== Logs

The collector and symbolizer always log to standard output.
You can turn on debug logs by setting the `verbose` configuration option to `true` in the YAML configuration file.

Avoid using debug logs in production, as they can be very verbose and impact backend performance.
Only enable debug logs when troubleshooting a failed deployment or when instructed to do so by support.

Logs are formatted as "key=value" pairs, and {es} and {kib} can automatically parse them into fields.

A log collector, such as Filebeat, can collect and send logs to {es} for indexing and analysis.
Depending on how it's installed, a Filebeat input of type `journald` (for OS packages), `log` (for binaries), or `container` can be used to process the logs.
Refer to the {filebeat-ref}/configuring-howto-filebeat.html[filebeat documentation] for more information.

[discrete]
[[profiling-self-managed-ops-monitoring-metrics]]
=== Metrics

Metrics are not exposed by default. Enable metrics in the `metrics` section in the YAML configuration files.
The collector and symbolizer can expose metrics in both JSON and Prometheus formats.

Metrics in JSON format can be exposed through an HTTP server or a Unix domain socket.
Prometheus metrics can only be exposed through an HTTP server.
Customize where the metrics are exposed using the `metrics.prometheus_host` and `metrics.expvar_host` configuration options.

You can use Metricbeat to scrape metrics.
Consume the JSON directly through the `http` module.
Consume the Prometheus endpoint using the `prometheus` module.
When using an HTTP server for either format, the URI to scrape metrics from is `/metrics`.

For example, the following collector configuration would expose metrics in Prometheus format on port 9090 and in JSON format on port 9191.
You can then scrape them by connecting to `http://127.0.0.1:9090/metrics` and `http://127.0.0.1:9191/metrics` respectively.

[source,yaml]
----
pf-elastic-collector:
metrics:
prometheus_host: ":9090"
expvar_host: ":9191"
----

Optionally, you can also expose the `expvar` format over a Unix domain socket, by setting the `expvar_socket` configuration option to a valid path.
For example, the following collector configuration would expose metrics in Prometheus format on port 9090 and in JSON format over a Unix domain socket at `/tmp/collector.sock`.

[source,yaml]
----
pf-elastic-collector:
metrics:
prometheus_host: ":9090"
expvar_host: "/tmp/collector.sock"
----

The following sections show the most relevant metrics exposed by the backend binaries.
Include these metrics in your monitoring dashboards to detect backend issues.

[profiling-backend-common-runtime-metrics]
*Common runtime metrics*

* `process_cpu_seconds_total`: track the amount of CPU time used by the process.
* `process_resident_memory_bytes`: track the amount of RAM used by the process.
* `go_memstats_heap_sys_bytes`: track the amount of heap memory.
* `go_memstats_stack_sys_bytes`: track the amount of stack memory.
* `go_threads`: number of OS threads created by the runtime.
* `go_goroutines`: number of active goroutines.

[profiling-backend-collector-metrics]
*Collector metrics*

* `collection_agent.indexing.bulk_indexer_failure_count`: number of times the bulk indexer failed to ingest data in Elasticsearch.
* `collection_agent.indexing.document_count.*`: counter that represents the number of documents ingested in Elasticsearch for each index; can be used to calculate the rate of ingestion for each index.
* `grpc_server_handling_seconds`: histogram of the time spent by the gRPC server to handle requests.
* `grpc_server_msg_received_total: count of messages received by the gRPC server; can be used to calculate the rate of ingestion for each RPC.
* `grpc_server_handled_total`: count of messages processed by the gRPC server; can be used to calculate the availability of the gRPC server for each RPC.

[profiling-backend-symbolizer-metrics]
*Symbolizer metrics*

* `symbols_app.indexing.bulk_indexer_failure_count`: number of times the bulk indexer failed to ingest data in Elasticsearch.
* `symbols_app.indexing.document_count.*`: counter that represents the number of documents ingested in Elasticsearch for each index; can be used to calculate the rate of ingestion for each index.
* `symbols_app.user_client.document_count.update.*`: counter that represents the number of existing documents that were updated in Elasticsearch for each index; when the rate increases, it can impact Elasticsearch performance.

[profiling-backend-health checks]
*Health checks*

The backend binaries expose two health check endpoints, `/live` and `/ready`, that you can use to monitor the health of the application.
The endpoints return a `200 OK` HTTP status code when the checks are successful.

The health check endpoints are hosted in the same HTTP server that accepts the incoming profiling data.
This endpoint is configured through the application's `host` configuration option.

For example, if the collector is configured with the default value `host: 0.0.0.0:8260`, you can check the health of the application by running `curl -i localhost:8260/live` and `curl -i localhost:8260/ready`.

[discrete]
[[profiling-scaling-backend-resources]]
== Scale resources

In the <<profiling-self-managed-ops-sizing-guidance, resource guidance table>>, no options use more than one replica for the symbolizer.
We do not recommend scaling the number of symbolizer replicas because of the technical limitations of the current implementation.
We recommend scaling the symbolizer vertically, by increasing the memory and CPU cores it uses to process data.

You can increase the number of collector replicas at will, keeping their vertical sizing smaller, if this is more convenient for your deployment use case.
The collector has a linear increase in memory usage and CPU threads with the number of host-agents that it serves.
Keep in mind that since the host-agent/collector communication happens via gRPC, there may be long-lived TCP sessions that are bound to a single collector replica.
When scaling out the number of replicas, depending on the load balancer that you have in place fronting the collector's endpoint, you may want to shut down the older replicas after adding new replicas.
This ensures that the load is evenly distributed across all replicas.

[discrete]
[[profiling-upgrade-backend-bin]]
== Upgrade the backend binaries

Upgrade the backend binaries whenever you upgrade the rest of the Elastic stack.
While we try to keep backward compatibility between two consecutive minor version, we may introduce changes to the data format that require the applications to be at the same version of Elasticsearch and Kibana.

The upgrade process steps vary depending on the installation method used.

[discrete]
[[profiling-backend-upgrade-ece]]
=== ECE

When using ECE, the upgrade process is managed by the platform itself.
You don't need to perform any action to upgrade the backend binaries.

[discrete]
[[profiling-backend-upgrade-k8s]]
=== Kubernetes

Perform a helm upgrade using the `helm upgrade` command.
You may reuse existing values or provide the full values YAML file on each upgrade.

[discrete]
[[profiling-backend-upgrade-os]]
=== OS packages

Upgrade the package version using the OS package manager.
Not all package managers will call into `systemd` to restart the service, so you may need to restart the service manually or through any other automation in place.

[discrete]
[[profiling-backend-upgrade-binaries]]
=== Binaries

Download the corresponding binary version and replace the existing one, using the command seen in the <<profiling-self-managed-running-linux-binary, Binary>> section of the setup guide.
Replace the old binary and restart the services.

[discrete]
[[profiling-backend-upgrade-containers]]
=== Containers

Pull the new container image, and replace the existing image with the new image.
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
[[profiling-self-managed-troubleshooting]]
= Troubleshoot the Universal Profiling backend

++++
<titleabbrev>Troubleshoot the backend</titleabbrev>
++++

Refer to the following sections to troubleshoot any issues you encounter when setting up or operating the Universal Profiling backend services.

[discrete]
== Application behavior

[discrete]
=== Missing stack traces in the UI

If there is a sudden drop in the number of stack traces in the UI, or if the UI is displaying none at all,
the collector service may be having trouble and thus data ingestion is impacted.

The status of the collector can be inferred from the health checks and the metrics exposed.

Most notable causes of an impaired collector are:

* collector is not able to connect to the Elasticsearch cluster: the connection or authentication details may be wrong
* collector is not starting up properly: the collector may be crashing on startup, or it may be stuck in a loop (when deployed through an orchestration system): check the logs for any errors

[discrete]
=== Missing symbols

One of the most useful features of Universal Profiling is the ability to display the source code file and line number
of the stack trace frames.
This is only possible if the symbols are processed correctly in the backend services.

When the symbols are missing, the UI will display the stack trace frames of native applications with an hexadecimal addresses, in the form `0x1234abcd`.
If this is happening for most of the native frames, including public OS package files, this is a sign that the debug symbols are not being processed correctly.

It is possible to verify that the symbolizer is working correctly by using the health check endpoint and the metrics exposed.

The most notable causes of an impaired symbolizer are:

. symbolizer is not able to connect to the debug symbol endpoint: this is an internet-exposed endpoint, so it may be blocked by a firewall
. symbolizer is not starting up properly: the symbolizer may be crashing on startup, possibly due to misconfigurations,
or it may be stuck in a loop (when deployed through an orchestration system): check the logs for any errors

[discrete]
== General troubleshooting

[discrete]
=== Capacity planning

When deploying Universal Profiling host-agents on a new set of machines, it is possible that the backend services will
not be able to handle the load. This is especially true if the number of host-agents is large, or if the host-agents are
deployed on machines with a large number of cores.

The traffic pattern of the host-agents is prone to bursts on startup, and the backend services may not be able to handle the burst of traffic
coming from a large number of host-agents at the same time.

Even if the capacity of the backend services was planned based on the number of host-agents as suggested in <<profiling-self-managed-ops-sizing-guidance, Sizing guidance>>,
we recommend deploying host-agents are in batches. For example, deploy 20% of the fleet at a time.
After deploying a batch of host-agents, pause for at least 30 to 60 seconds before deploying the next batch to allow the backend services to stabilize.
When the host-agent starts fresh on a new machine, it scans all the existing processes
and sends the executable's metadata to the backend services. This can cause a burst of traffic that can overwhelm the backend services.

[discrete]
=== Inspecting the metrics

Once metrics are exposed by the backend services, it is possible to inspect them to understand the behavior of the services.
Refer to <<profiling-self-managed-ops-monitoring-metrics, Metrics>> for instructions on how to expose metrics.

We don't yet provide pre-built Kibana dashboards to monitor the services, but we have compiled a list of the most useful metrics to monitor.
The prominent peak of goroutines or memory usage is a sign that the service is under stress and may be having trouble.
If there's the possibility of having access to Linux kernel telemetry for the hosts running the backend services, the most important metrics to monitor are the CPU throttling and the network usage.

[discrete]
=== Reading debug logs

The backend services can be configured to log at debug level, which can be useful for troubleshooting issues.
To do so, there's a `verbose` config entry in each YAML configuration file, which can be set to `true` to enable debug logging.
The same configuration option can be set through the CLI flags, as detailed in <<profiling-self-managed-ops-configuration-cli-overrides, Use CLI flags to override configuration file values">>.

When running the backend services in verbose mode, the logs will be helpful to troubleshoot issues.

IMPORTANT: Debug logs create an output that is unsuited for long-running production deployments.
The verbose mode should only be enabled on a single replica at a time, and only for a short period of time,
as it reduces performance and increases the CPU usage of the service.

When verbose mode is enabled, there will be fine-grained information logged about the operations of the service.
In the case of collectors, the component responsible for ingesting data in Elasticsearch will be the most frequent.
For symbolizers most of the logs will be related to the processing of native frames, initially detected by the collector.

If you are troubleshooting startup issues for both services, logs are the most useful source of information.
On startup, each service will log if it is able to parse configurations and to start serving the incoming requests.
mdbirnstiehl marked this conversation as resolved.
Show resolved Hide resolved
Errors will be logged using the `log.level=error` field: they can be used to spot misconfigurations or other issues that prevent the service from starting up.
Loading