Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a new alert #236

Closed
mwtzzz opened this issue Jun 28, 2024 · 1 comment
Closed

Adding a new alert #236

mwtzzz opened this issue Jun 28, 2024 · 1 comment

Comments

@mwtzzz
Copy link

mwtzzz commented Jun 28, 2024

What problem does your feature solve?

A more accurate alert on whether soroban-rpc is up or down, which does not give a false positive if the prometheus or elastalert services are down.

What would you like to see?

Since the RPC endpoint is not exposed publicly, we can't use Runscope, but we do have a deadmansswitch account which would be perfect for this. There's a couple ways we could do it:

  1. the most accurate is to modify the soroban-rpc pod to constantly (once a minute or whatever) ping deadmanswitch. Either add a new pod/container whose purpose is to check the rpc endpoint and ping deadmanswitch; or modify the rpc container itself to make those pings to deadmanswitch.
  2. less accurate is to put a cron job on a host somewhere, the cron job checks the rpc endpoint and makes the pings. The problem with this approach is you'll get a false alarm if the cron host dies.

In both cases, a simple script like the following will be sufficient:

if $(curl rpc/health | grep healthy) then curl https://nosnch.in/xxxxx

The way it works is: as long as the script frequently checks in with deadmanswitch, then an alert will not be fired. If the script fails to check in, then an alert will be fired.

What alternatives are there?

None

@mollykarcher
Copy link
Contributor

@mwtzzz we discussed and we'd like to go with the second option

less accurate is to put a cron job on a host somewhere, the cron job checks the rpc endpoint and makes the pings. The problem with this approach is you'll get a false alarm if the cron host dies.

@github-project-automation github-project-automation bot moved this from Backlog to Done in Platform Scrum Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants