Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Lightweight CNCLI Blocks Prometheus Exporter #1439

Open
Straightpool opened this issue Jun 25, 2022 · 0 comments
Open

Feature Request: Lightweight CNCLI Blocks Prometheus Exporter #1439

Straightpool opened this issue Jun 25, 2022 · 0 comments
Assignees

Comments

@Straightpool
Copy link
Contributor

Straightpool commented Jun 25, 2022

Feature description
A lightweight Prometheus exporter for the CNCLI "Blocks" data.
I propose the following metrics, always in relation to the current active epoch for monitoring purposes:

  • cntools_cncli_blocks_metrics_next_leader_time_utc
  • cntools_cncli_blocks_metrics_next_next_leader_time_utc
  • cntools_cncli_blocks_metrics_ideal
  • cntools_cncli_blocks_metrics_luck
  • cntools_cncli_blocks_metrics_adopted_total
  • cntools_cncli_blocks_metrics_confirmed_total
  • cntools_cncli_blocks_metrics_missed_total
  • cntools_cncli_blocks_metrics_ghosted_total
  • cntools_cncli_blocks_metrics_stolen_total
  • cntools_cncli_blocks_metrics_invalid_total
  • cntools_cncli_blocks_metrics_adopted_max_consec
  • cntools_cncli_blocks_metrics_confirmed_max_consec
  • cntools_cncli_blocks_metrics_missed_max_consec
  • cntools_cncli_blocks_metrics_ghosted_max_consec
  • cntools_cncli_blocks_metrics_stolen_max_consec
  • cntools_cncli_blocks_metrics_invalid_max_consec

Next_leader_time_UTC returns the UTC time of the next leader slot
Next_next_leader_time_UTC returns the UTC time of the leader slot after next

*_total refers to the total number at the current time. *_max_consec refers to the max consecutive occurrence of said block state.

Example
Example block sequence:

  1. confirmed
  2. confirmed
  3. stolen
  4. confirmed
  5. ghosted
  6. confirmed
  7. confirmed
  8. confirmed
  9. missed
  10. confirmed
  11. confirmed
  12. confirmed
  13. confirmed
  14. missed
  15. missed
  16. missed
  17. invalid

Results in:

  • cntools_cncli_blocks_metrics_adopted_total: 0
  • cntools_cncli_blocks_metrics_confirmed_total: 10
  • cntools_cncli_blocks_metrics_missed_total: 4
  • cntools_cncli_blocks_metrics_ghosted_total: 1
  • cntools_cncli_blocks_metrics_stolen_total: 1
  • cntools_cncli_blocks_metrics_invalid_total: 1
  • cntools_cncli_blocks_metrics_adopted_max_consec: 0
  • cntools_cncli_blocks_metrics_confirmed_max_consec: 4
  • cntools_cncli_blocks_metrics_missed_max_consec: 3
  • cntools_cncli_blocks_metrics_ghosted_max_consec: 1
  • cntools_cncli_blocks_metrics_stolen_max_consec: 1
  • cntools_cncli_blocks_metrics_invalid_max_consec: 1

Rationale
Sometimes out of the blue errors or bugs can occur in the pool infrastructure or Cardano node itself which can lead to a number of consecutive lost blocks. A single missed block is currently usually a false classification and rather a ghosted block. A single ghosted block is usually a race condition currently.
However, if any of these error states do happen in multiples in direct succession something is clearly off though and ought to trigger an alert / action based on Prometheus / Grafana rules. In the example "cntools_cncli_blocks_metrics_missed_max_consec: 3" would be such a case which would warrant an alert / action or cntools_cncli_blocks_metrics_invalid_total =! 0. Without a Prometheus Exporter it might take a while to notice the issue, especailly if it triggers no other alerts, with more lost blocks than necessary.

Possible implementation approaches
From first looks this seems to be a feasible architecture with SQL scraping:

Another approach could be to use Bash pushing:

Considered alternatives
None

Version:

  • OS: Ubunto 20.04 LTS
  • Product version: CNTools 9.1.0
  • Cardano Node version:
    cardano-node 1.34.1 - linux-x86_64 - ghc-8.10
    git rev 73f9a746362695dc2cb63ba757fbcabb81733d23
  • Network you're connecting to: Mainnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants