Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: add metrics for ingestions failures and alert on them #5256

Closed
mollykarcher opened this issue Mar 21, 2024 · 0 comments · Fixed by #5302
Closed

services/horizon: add metrics for ingestions failures and alert on them #5256

mollykarcher opened this issue Mar 21, 2024 · 0 comments · Fixed by #5302

Comments

@mollykarcher
Copy link
Contributor

What problem does your feature solve?

There's been 2 incidents in the past year where ingestion in horizon halted (for different reasons):

During the resolution process, we had a lot of general health metric alerts firing but it wasn't immediately clear it was an ingestion halt until we looked at error logs.

What would you like to see?

An explicit metric that tracks ingestion failures. When a single instance of this occurs, we should alert (critical) on it. This will make it clearer to the responding engineer immediately what the root cause of the issue is.

@mollykarcher mollykarcher added this to the platform sprint 45 milestone Mar 21, 2024
@mollykarcher mollykarcher moved this from Backlog to Current Sprint in Platform Scrum Mar 27, 2024
@sreuland sreuland moved this from To Do to In Progress in Platform Scrum Apr 30, 2024
@sreuland sreuland moved this from In Progress to Needs Review in Platform Scrum May 7, 2024
sreuland added a commit to sreuland/go that referenced this issue May 9, 2024
… new error counting metrics, per review feedback
@github-project-automation github-project-automation bot moved this from Needs Review to Done in Platform Scrum May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants