You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature description
A lightweight Prometheus exporter for the CNCLI "Blocks" data.
I propose the following metrics, always in relation to the current active epoch for monitoring purposes:
Rationale
Sometimes out of the blue errors or bugs can occur in the pool infrastructure or Cardano node itself which can lead to a number of consecutive lost blocks. A single missed block is currently usually a false classification and rather a ghosted block. A single ghosted block is usually a race condition currently.
However, if any of these error states do happen in multiples in direct succession something is clearly off though and ought to trigger an alert / action based on Prometheus / Grafana rules. In the example "cntools_cncli_blocks_metrics_missed_max_consec: 3" would be such a case which would warrant an alert / action or cntools_cncli_blocks_metrics_invalid_total =! 0. Without a Prometheus Exporter it might take a while to notice the issue, especailly if it triggers no other alerts, with more lost blocks than necessary.
Possible implementation approaches
From first looks this seems to be a feasible architecture with SQL scraping:
Feature description
A lightweight Prometheus exporter for the CNCLI "Blocks" data.
I propose the following metrics, always in relation to the current active epoch for monitoring purposes:
Next_leader_time_UTC returns the UTC time of the next leader slot
Next_next_leader_time_UTC returns the UTC time of the leader slot after next
*_total refers to the total number at the current time. *_max_consec refers to the max consecutive occurrence of said block state.
Example
Example block sequence:
Results in:
Rationale
Sometimes out of the blue errors or bugs can occur in the pool infrastructure or Cardano node itself which can lead to a number of consecutive lost blocks. A single missed block is currently usually a false classification and rather a ghosted block. A single ghosted block is usually a race condition currently.
However, if any of these error states do happen in multiples in direct succession something is clearly off though and ought to trigger an alert / action based on Prometheus / Grafana rules. In the example "cntools_cncli_blocks_metrics_missed_max_consec: 3" would be such a case which would warrant an alert / action or cntools_cncli_blocks_metrics_invalid_total =! 0. Without a Prometheus Exporter it might take a while to notice the issue, especailly if it triggers no other alerts, with more lost blocks than necessary.
Possible implementation approaches
From first looks this seems to be a feasible architecture with SQL scraping:
Another approach could be to use Bash pushing:
Considered alternatives
None
Version:
cardano-node 1.34.1 - linux-x86_64 - ghc-8.10
git rev 73f9a746362695dc2cb63ba757fbcabb81733d23
The text was updated successfully, but these errors were encountered: