From e8152d85ef2ad4005cc9089de3045e74d741688b Mon Sep 17 00:00:00 2001
From: sweexordious <chamirachid1@gmail.com>
Date: Wed, 3 Jan 2024 18:08:39 +0100
Subject: [PATCH] docs: metrics documentation

---
 docs/orchestrator.md | 22 ++++++++++++++++++++++
 docs/relayer.md      | 21 +++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/docs/orchestrator.md b/docs/orchestrator.md
index e6986c58..4672b5a3 100644
--- a/docs/orchestrator.md
+++ b/docs/orchestrator.md
@@ -255,6 +255,28 @@ If the validator still has access to the previously running orchestrator, it wou
 
 Running a second orchestrator in the same machine would require using different P2P listening ports, i.e. changing the `listen-addr` value in the `<orchestrator_home>/config/config.toml` file and using different ports between the two instances.
 
+### Telemetry
+
+The orchestrator supports metrics that describe its runtime and gives more information on its health. The supported metrics are:
+
+- `orchestrator_processed_nonces_counter`: The count of the total number of nonces that have been processed by the orchestrator. During normal conditions, this number will be incremented by 1 every hour, i.e. 400 blocks which is the current data commitment window. The health of the orchestrator can be determined using this metric via checking if it's been constantly signing nonces. If the counter wasn't incremented for more than an hour, the orchestrator might be failing.
+- `orchestrator_failed_nonces_counter`: The count of the number of nonces that the orchestrator tried to process, but failed. These nonces might be re-queued to be reprocessed subsequently. If the orchestrator manages to process them correctly, the `orchestrator_processed_nonces_counter` will be incremented. Otherwise, they might be re-enqueued to be re-processed.
+- `orchestrator_failed_nonces_counter`: The count of the number of nonces that failed to be processed by the orchestrator, but were re-enqueued.
+- `orchestrator_processing_time`: The time it takes for a nonce to be processed or fail after it was picked by the orchestrator processor.
+
+To enable these metrics, make sure to set the `metrics` to true in the orchestrator configuration file:
+
+```toml
+# Enables OTLP metrics with HTTP exporter.
+metrics = "true"
+```
+
+And setup a correct endpoint to connect to an otel collector, by default it targets the `"localhost:4318"` endpoint. These can also be setup using the command line flags.
+
+The orchestrator provides also the LibP2P native metrics. These are also enabled when the above parameter is set to `true` and are served, by default, to the `"localhost:30001/metrics"`, which can be updated using the orchestrator config file or the command line flags.
+
+An example configuration is provided in the `e2e/telemetry` folder along with the corresponding docker-compose file.
+
 #### Systemd service
 
 If you want to start the orchestrator as a `systemd` service, you could use the following:
diff --git a/docs/relayer.md b/docs/relayer.md
index f71df695..44361bb1 100644
--- a/docs/relayer.md
+++ b/docs/relayer.md
@@ -121,3 +121,24 @@ To start the relayer using the default home directory, run the following:
 > **_NOTE:_** The above command assumes that the necessary configuration is specified in the `<relayer_home>/config/config.toml` file.
 
 Then, you will be prompted to enter your EVM key passphrase for the EVM address passed using the `--evm.account` flag, so that the relayer can use it to send transactions to the target Blobstream smart contract. Make sure that it's funded.
+
+### Telemetry
+
+The relayer supports metrics that describe its runtime and gives more information on its health. The supported metrics are:
+
+- `relayer_processed_nonces_counter`: The count of the total number of nonces that have been processed by the relayer. During normal conditions, this number will be incremented by 1 every hour, i.e. 400 blocks which is the current data commitment window. The health of the relayer can be determined using this metric via checking if it's been constantly signing nonces. If the counter wasn't incremented for more than an hour, the relayer might be failing.
+- `relayer_number_of_failures`: The number of failures the relayer failed to relay a nonce.
+- `relayer_processing_time`: The time it takes for a nonce to be processed or fail after it was picked by the relayer.
+
+To enable these metrics, make sure to set the `metrics` to true in the relayer configuration file:
+
+```toml
+# Enables OTLP metrics with HTTP exporter.
+metrics = "true"
+```
+
+And setup a correct endpoint to connect to an otel collector, by default it targets the `"localhost:4318"` endpoint. These can also be setup using the command line flags.
+
+The relayer provides also the LibP2P native metrics. These are also enabled when the above parameter is set to `true` and are served, by default, to the `"localhost:30001/metrics"`, which can be updated using the relayer config file or the command line flags.
+
+An example configuration is provided in the `e2e/telemetry` folder along with the corresponding docker-compose file.