Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all/enough Prometheus scrape targets have alerts #788

Open
nkinkade opened this issue Feb 12, 2021 · 4 comments
Open

Not all/enough Prometheus scrape targets have alerts #788

nkinkade opened this issue Feb 12, 2021 · 4 comments
Assignees
Labels

Comments

@nkinkade
Copy link
Contributor

nkinkade commented Feb 12, 2021

Recently, blackbox_exporter scraping of ICMP probes to switches was failing for nearly two weeks without any anyone being aware of it. In addition to this metric for switches being missing for any other purpose, it caused several spurious alerts in production that should not have fired. We already have alerts for a good number of scrape jobs being down:

$ grep -o 'up{.*} == 0' config/federation/prometheus/alerts.yml
up{job="federation-targets"} == 0
up{container="downloader"} == 0
up{deployment="script-exporter"} == 0
up{job="blackbox-exporter-ipv4"} == 0
up{job="blackbox-exporter-ipv6"} == 0
up{job="nginx-proxied-services", service="gmx"} == 0
up{job="eb-node-exporter"} == 0
up{container="etl-gardener", deployment!="etl-gardener-universal", instance=~".*:9090"} == 0
up{cluster="data-processing", container="etl-gardener",instance=~".*:9090"} == 0
up{cluster="data-processing", container="etl-parser",instance=~".*:9090"} == 0
up{service="annotator"} == 0
up{job="epoxy-boot-api"} == 0
up{job="platform-cluster"} == 0
up{job="kubernetes-nodes", cluster="platform-cluster"} == 0
up{cluster="platform-cluster", job="token-server"} == 0
up{job="bmc-targets"} == 0
up{job="switch-monitoring-targets"} == 0

The question is whether we want to try to have separate alerts for every possible job, or whether it would be better to have a single alert for when any single scrape target is down for too long. So far I've come up with this, which seems pretty close:

up{job=~".+", cluster!~"data-processing.*"} == 0
    unless on(machine) gmx_machine_maintenance == 1
    unless on(site) gmx_site_maintenance == 1

@stephen-soltesz: Do you have any opinion on this? Adding an alert like the latter one, while leaving all the other existing ones would probably cause duplicate alerts, unless we inhibited by configuration.

@nkinkade nkinkade self-assigned this Feb 12, 2021
@stephen-soltesz
Copy link
Contributor

I think this could work. I'm thinking of this as a stopgap, something to rely on if more specific alerts don't exist or are failing to fire for some other reason. So, a long alert for period could do that. Is that how you're thinking of it?

@nkinkade
Copy link
Contributor Author

@stephen-soltesz: I was sort of thinking of removing all the existing up{} == 0 ("SoAndSoDownOrMissing") alerts and changing those to just "SoAndSoMissing" alerts, then adding a single up{} == 0 alert to catch any scrape job that is down for too long. I'm afraid that adding a generic alert with a longer for condition would eventually lead to duplicate alerting. Maybe we should chat about this in a VC.

@nkinkade
Copy link
Contributor Author

@stephen-soltesz: Ping on the above comment.

@stephen-soltesz
Copy link
Contributor

We should talk -- I'm worried about the generic alert being too noisy (too many false positives). Your point about the fallback case leading to duplicate alerts makes sense too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants