You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, blackbox_exporter scraping of ICMP probes to switches was failing for nearly two weeks without any anyone being aware of it. In addition to this metric for switches being missing for any other purpose, it caused several spurious alerts in production that should not have fired. We already have alerts for a good number of scrape jobs being down:
The question is whether we want to try to have separate alerts for every possible job, or whether it would be better to have a single alert for when any single scrape target is down for too long. So far I've come up with this, which seems pretty close:
@stephen-soltesz: Do you have any opinion on this? Adding an alert like the latter one, while leaving all the other existing ones would probably cause duplicate alerts, unless we inhibited by configuration.
The text was updated successfully, but these errors were encountered:
I think this could work. I'm thinking of this as a stopgap, something to rely on if more specific alerts don't exist or are failing to fire for some other reason. So, a long alert for period could do that. Is that how you're thinking of it?
@stephen-soltesz: I was sort of thinking of removing all the existing up{} == 0 ("SoAndSoDownOrMissing") alerts and changing those to just "SoAndSoMissing" alerts, then adding a single up{} == 0 alert to catch any scrape job that is down for too long. I'm afraid that adding a generic alert with a longer for condition would eventually lead to duplicate alerting. Maybe we should chat about this in a VC.
We should talk -- I'm worried about the generic alert being too noisy (too many false positives). Your point about the fallback case leading to duplicate alerts makes sense too.
Recently, blackbox_exporter scraping of ICMP probes to switches was failing for nearly two weeks without any anyone being aware of it. In addition to this metric for switches being missing for any other purpose, it caused several spurious alerts in production that should not have fired. We already have alerts for a good number of scrape jobs being down:
The question is whether we want to try to have separate alerts for every possible job, or whether it would be better to have a single alert for when any single scrape target is down for too long. So far I've come up with this, which seems pretty close:
@stephen-soltesz: Do you have any opinion on this? Adding an alert like the latter one, while leaving all the other existing ones would probably cause duplicate alerts, unless we inhibited by configuration.
The text was updated successfully, but these errors were encountered: