Not all/enough Prometheus scrape targets have alerts #788

nkinkade · 2021-02-12T22:58:10Z

Recently, blackbox_exporter scraping of ICMP probes to switches was failing for nearly two weeks without any anyone being aware of it. In addition to this metric for switches being missing for any other purpose, it caused several spurious alerts in production that should not have fired. We already have alerts for a good number of scrape jobs being down:

$ grep -o 'up{.*} == 0' config/federation/prometheus/alerts.yml
up{job="federation-targets"} == 0
up{container="downloader"} == 0
up{deployment="script-exporter"} == 0
up{job="blackbox-exporter-ipv4"} == 0
up{job="blackbox-exporter-ipv6"} == 0
up{job="nginx-proxied-services", service="gmx"} == 0
up{job="eb-node-exporter"} == 0
up{container="etl-gardener", deployment!="etl-gardener-universal", instance=~".*:9090"} == 0
up{cluster="data-processing", container="etl-gardener",instance=~".*:9090"} == 0
up{cluster="data-processing", container="etl-parser",instance=~".*:9090"} == 0
up{service="annotator"} == 0
up{job="epoxy-boot-api"} == 0
up{job="platform-cluster"} == 0
up{job="kubernetes-nodes", cluster="platform-cluster"} == 0
up{cluster="platform-cluster", job="token-server"} == 0
up{job="bmc-targets"} == 0
up{job="switch-monitoring-targets"} == 0

The question is whether we want to try to have separate alerts for every possible job, or whether it would be better to have a single alert for when any single scrape target is down for too long. So far I've come up with this, which seems pretty close:

up{job=~".+", cluster!~"data-processing.*"} == 0
    unless on(machine) gmx_machine_maintenance == 1
    unless on(site) gmx_site_maintenance == 1

@stephen-soltesz: Do you have any opinion on this? Adding an alert like the latter one, while leaving all the other existing ones would probably cause duplicate alerts, unless we inhibited by configuration.

The text was updated successfully, but these errors were encountered:

stephen-soltesz · 2021-02-17T12:15:03Z

I think this could work. I'm thinking of this as a stopgap, something to rely on if more specific alerts don't exist or are failing to fire for some other reason. So, a long alert for period could do that. Is that how you're thinking of it?

nkinkade · 2021-02-17T22:15:41Z

@stephen-soltesz: I was sort of thinking of removing all the existing up{} == 0 ("SoAndSoDownOrMissing") alerts and changing those to just "SoAndSoMissing" alerts, then adding a single up{} == 0 alert to catch any scrape job that is down for too long. I'm afraid that adding a generic alert with a longer for condition would eventually lead to duplicate alerting. Maybe we should chat about this in a VC.

nkinkade · 2021-02-23T22:41:27Z

@stephen-soltesz: Ping on the above comment.

stephen-soltesz · 2021-02-23T22:47:17Z

We should talk -- I'm worried about the generic alert being too noisy (too many false positives). Your point about the fallback case leading to duplicate alerts makes sense too.

autolabel bot added the review/triage label Feb 12, 2021

nkinkade self-assigned this Feb 12, 2021

nkinkade added DevOps and removed review/triage labels Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all/enough Prometheus scrape targets have alerts #788

Not all/enough Prometheus scrape targets have alerts #788

nkinkade commented Feb 12, 2021 •

edited

Loading

stephen-soltesz commented Feb 17, 2021

nkinkade commented Feb 17, 2021

nkinkade commented Feb 23, 2021

stephen-soltesz commented Feb 23, 2021

Not all/enough Prometheus scrape targets have alerts #788

Not all/enough Prometheus scrape targets have alerts #788

Comments

nkinkade commented Feb 12, 2021 • edited Loading

stephen-soltesz commented Feb 17, 2021

nkinkade commented Feb 17, 2021

nkinkade commented Feb 23, 2021

stephen-soltesz commented Feb 23, 2021

nkinkade commented Feb 12, 2021 •

edited

Loading