Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add notes about metrics + error logging #747

Merged
merged 2 commits into from
Jun 14, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions docs/relay/MetricsLogging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Metrics And Logging

> :warning: NOTE: this document serves as a starting point for debugging and does not provide an exhaustive/definitive answer

The relay exports metrics and chain-specific errors. This document identifies common metrics/logs and potential reasons for behavior.

## Error Logging

[`failed to enqeue tx for simulation`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L129)

* indicates slow RPCs that are not responding quickly enough

[`original signature does not match retry signature`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L301)

* this could indicate a race condition within the relayer code (please alert developers for investigation)

[`failed to find transaction within confirm timeout`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L372)

* indicates network congestion or poor RPC performance (tx dropped)

[`simulate: unrecognized error`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L494)

* There is usually an additional output within the result parameter of the error:
* `InsufficientFundsForRent`: sender balance too low
* `AccountNotFound`: sender or used account does not exist (if previously existed, could have been garbage collected)
* Additional errors + reasons can be found here: https://github.com/solana-labs/solana/blob/master/sdk/src/transaction/error.rs

[`failed to enqeue tx`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L528)

* indicates slow RPC which does not respond quickly enough to keep up with the incoming stream of transactions

[`error in ReadAnswer: stale answer data, polling is likely experiencing errors`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/transmissions_cache.go#L110C21-L110C98)

* indicates RPC issues (most likely down)

[`error in ReadState: stale state data, polling is likely experiencing errors`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/state_cache.go#L114C21-L114C96)

* indicates RPC issues (most likely down)

## Metrics

[`solana_balance`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L14)

* provides the SOL balance for keys in the keystore
* low SOL balance will lead to the CL node stop transmitting

[`solana_cache_last_update_unix`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L18)

* tracks last update to cached data (unix timestamp)
* updates should occur at the configured rate (default: 1s), slower updates can indicate RPC latency issues

[`solana_client_latency_ms`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L23)

* tracks duration of each RPC request, separated via label + URLs
* spikes in latency can indicate RPC issues

[`solana_txm_tx_success`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L10)

* total of TXs that are confirmed and successfully executed on chain
* this value should consistently increase. If it does not, this could indicate RPC latency or funding issues.

[`solana_txm_tx_pending`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L16)

* current TXs that are inflight (not confirmed success or error)
* this value should stay mostly constant - spikes could indicate lagging performance due to slow RPCs.

[`solana_txm_tx_error`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L22)

* sum of TXs that have errored for any reason
* depending on the network configuration, this value should either be constant or increase

[`solana_txm_tx_error_revert`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L26)

* total of TXs that have been confirmed but error with a revert
* depending on the network configuration, this value should either be constant or increase

[`solana_txm_tx_error_reject`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L30)

* total of TXs that have been immediately rejected by the RPC
* value should be near zero, TXs should not be immediately rejected by the RPC. this could indicate faulty RPC or

[`solana_txm_tx_error_drop`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L34)

* total of TXs that have been broadcast to the network but was not confirmed within the configured timeout
* an increasing value can indicate RPC latency issues or network congestion

[`solana_txm_tx_error_sim_revert`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L38)

* total of TXs that reverted during simulation
* value should not increase rapidly and should be low, if it does it may indicate misconfiguration on the CL node or onchain

[`solana_txm_tx_error_sim_other`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L38)

* total of TXs that failed during simulation with an unrecognized error
* value should not increase rapdily and should be low, requires looking through logs for the unrecognized error and diagnosing further from there

Loading