Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(alerting): Add up alert rule with anti-flap and absence detection #147

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

simskij
Copy link
Member

@simskij simskij commented Jul 9, 2024

resolves #143

Issue

Grafana Agent does not currently provide an alert that detects the host being down.

Solution

Add a check, and make it also detect absence of metrics.

Context

Testing Instructions

  1. Deploy COS
  2. Deploy a machine (juju deploy ubuntu)
  3. Deploy Grafana Agent and relate it to ubuntu.
  4. Relate the agent to COS
  5. Verify that the alert rule has made it over
  6. Turn off the host
  7. Wait 5 minutes
  8. Verify that the alert is firing

Upgrade Notes

@simskij simskij requested a review from a team as a code owner July 9, 2024 11:05
Copy link
Contributor

@mmkay mmkay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo requires fixing, another one is a general question so feel free to resolve if you feel 5 mins is enough.

src/prometheus_alert_rules/host_health.rules Outdated Show resolved Hide resolved
src/prometheus_alert_rules/host_health.rules Show resolved Hide resolved
src/prometheus_alert_rules/host_health.rules Outdated Show resolved Hide resolved
@mmkay mmkay self-requested a review July 10, 2024 15:30
@sed-i
Copy link
Contributor

sed-i commented Jul 10, 2024

Need to remember to do the same in the -k8s charm after merge.

@gabrielcocenza
Copy link
Member

Hi 👋
Is there any updates regarding this PR? Are those new rules going to be implemented?

@Deezzir
Copy link

Deezzir commented Sep 27, 2024

Hello,

I just encountered a case where those alerts would be helpful, so +1 from me.

I was testing DCGM-exporter metrics. I saw that Prometheus picked up the metrics, but there was no data. After some digging, I discovered that the issue was with one faulty metric produced by DCGM, which caused everything to be discarded and scrape_series_added to go to 0.

The alerts would save me some time :)

@sed-i sed-i merged commit 6da23c0 into main Oct 23, 2024
13 checks passed
@sed-i sed-i deleted the feat/up-alert branch October 23, 2024 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

grafana-agent is not generating alerts when node goes down
9 participants