Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add promql/gauge_staleness check #966

Open
wbollock opened this issue Apr 29, 2024 · 5 comments
Open

Add promql/gauge_staleness check #966

wbollock opened this issue Apr 29, 2024 · 5 comments

Comments

@wbollock
Copy link

Frequently we make nits to users about using query functions when alerting on gauges:

https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/

I believe this would be a nice addition to Pint and Prometheus rule linting. Based on the counter check it seems like we could also find if a metric is a gauge and not covered by a query function:

func (c CounterCheck) Check(ctx context.Context, _ discovery.Path, rule parser.Rule, _ []discovery.Entry) (problems []Problem) {

I'm happy to contribute this back to Pint but want to discuss with the maintainer if this would be acceptable

cc: @prymitive

@prymitive
Copy link
Collaborator

I don't think I follow. Can you give a more concrete example?
By Frequently we make nits to users about using query functions when alerting on gauges do you mean when someone uses up{job="jobname"} < 1 tell them to use avg_over_time(up{job="jobname"}[5m]) < 0.9 instead?

@wbollock
Copy link
Author

wbollock commented Jun 4, 2024

Yes that's a perfect example, exactly the scenario I have in mind. Let me know if this seems like a valid check 😄

@prymitive
Copy link
Collaborator

But there is nothing wrong with up{job="jobname"} < 1, it's a perfectly valid alert query.
If someone wants to write an alert for a single scrape failures that's the way to do it.
avg_over_time(up{job="jobname"}[5m]) < 0.9 is something one can do instead, if they want to alert on a number of failures rather than a single one, but it's not like one is valid and the other one isn't. It all depends on what the user is trying to accomplish, which pint doesn't know, only the user.
I don't think that adding checks that just give you an alternative way of doing something that might or might not be what you want is valuable. It just teaches pint users that if you see a warning or error then you just need to either ignore it or disable that check.

@wbollock
Copy link
Author

wbollock commented Jun 4, 2024

Both queries are valid, but in my experience the author of up{job="jobname"} < 1 is never intentionally using it to alert if there is any staleness in addition to the scrape job being down. It's a common pitfall with the staleness handling changes outlined in the article, https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/.

A bare up is prone to the staleness handling changes and simpler weaker than if max_over_time was used to alert on any single scrape failure. It's not just an alternate way of doing things, but an improvement with no downsides at all:

This way gaps in our data resulting from failed scrapes and flaky time series will no longer result in false positives or negatives in our alerting

For example, up{job="jobname"} < 1 vs max_over_time(up{job="jobname"}[5m]) < 1. I envision the check as asking for any query function to cover the gauge, not specifically avg_over_time

@wbh1
Copy link
Contributor

wbh1 commented Jun 10, 2024

up isn't the best example since it's a special metric that is always present and doesn't need an _over_time function to be resilient.

A better example might be the exporter pattern of {service}_up metrics like mysql_up from the mysqld_exporter. I've seen multiple cases from multiple exporters (and even applications natively exposing metrics like Ceph) where there are silent failures when up stays as 1 but it stops returning all of its time series or intermittently returns them.

In such a case, an alert for up < 1 wouldn't fire and neither would one for ceph_osd_up < 1 (if it has a non-zero for duration, since it would be reset frequently and either flap a lot or never fire). Adding an _over_time function such as last_over_time would make the alert less fragile.

I don't even think that this is specific to gauges. An alert for a Summary metric could be fragile in the same way: prometheus_engine_query_duration_seconds{quantile="0.99"} > 1.

Histograms may be different since they should probably fall under the same existing promql/counter check 🤔.

Regardless, I think it would be useful to have a check similar to promql/fragile (or extend that?) to detect when an alert has a non-zero for duration AND isn't using range vector selector. Detecting a range vector selector rather than an _over_time function should be more flexible and easier to implement, I think.

Excerpt from Prometheus: Up & Running (2nd edition):
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants