Add `promql/gauge_staleness` check #966

wbollock · 2024-04-29T19:56:42Z

Frequently we make nits to users about using query functions when alerting on gauges:

https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/

I believe this would be a nice addition to Pint and Prometheus rule linting. Based on the counter check it seems like we could also find if a metric is a gauge and not covered by a query function:

pint/internal/checks/promql_counter.go

Line 51 in 63fb306

    
           func (c CounterCheck) Check(ctx context.Context, _ discovery.Path, rule parser.Rule, _ []discovery.Entry) (problems []Problem) {

I'm happy to contribute this back to Pint but want to discuss with the maintainer if this would be acceptable

cc: @prymitive

prymitive · 2024-06-04T09:39:51Z

I don't think I follow. Can you give a more concrete example?
By Frequently we make nits to users about using query functions when alerting on gauges do you mean when someone uses up{job="jobname"} < 1 tell them to use avg_over_time(up{job="jobname"}[5m]) < 0.9 instead?

wbollock · 2024-06-04T12:02:07Z

Yes that's a perfect example, exactly the scenario I have in mind. Let me know if this seems like a valid check 😄

prymitive · 2024-06-04T12:38:29Z

But there is nothing wrong with up{job="jobname"} < 1, it's a perfectly valid alert query.
If someone wants to write an alert for a single scrape failures that's the way to do it.
avg_over_time(up{job="jobname"}[5m]) < 0.9 is something one can do instead, if they want to alert on a number of failures rather than a single one, but it's not like one is valid and the other one isn't. It all depends on what the user is trying to accomplish, which pint doesn't know, only the user.
I don't think that adding checks that just give you an alternative way of doing something that might or might not be what you want is valuable. It just teaches pint users that if you see a warning or error then you just need to either ignore it or disable that check.

wbollock · 2024-06-04T13:26:12Z

Both queries are valid, but in my experience the author of up{job="jobname"} < 1 is never intentionally using it to alert if there is any staleness in addition to the scrape job being down. It's a common pitfall with the staleness handling changes outlined in the article, https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0/.

A bare up is prone to the staleness handling changes and simpler weaker than if max_over_time was used to alert on any single scrape failure. It's not just an alternate way of doing things, but an improvement with no downsides at all:

This way gaps in our data resulting from failed scrapes and flaky time series will no longer result in false positives or negatives in our alerting

For example, up{job="jobname"} < 1 vs max_over_time(up{job="jobname"}[5m]) < 1. I envision the check as asking for any query function to cover the gauge, not specifically avg_over_time

wbh1 · 2024-06-10T16:34:56Z

up isn't the best example since it's a special metric that is always present and doesn't need an _over_time function to be resilient.

A better example might be the exporter pattern of {service}_up metrics like mysql_up from the mysqld_exporter. I've seen multiple cases from multiple exporters (and even applications natively exposing metrics like Ceph) where there are silent failures when up stays as 1 but it stops returning all of its time series or intermittently returns them.

In such a case, an alert for up < 1 wouldn't fire and neither would one for ceph_osd_up < 1 (if it has a non-zero for duration, since it would be reset frequently and either flap a lot or never fire). Adding an _over_time function such as last_over_time would make the alert less fragile.

I don't even think that this is specific to gauges. An alert for a Summary metric could be fragile in the same way: prometheus_engine_query_duration_seconds{quantile="0.99"} > 1.

Histograms may be different since they should probably fall under the same existing promql/counter check 🤔.

Regardless, I think it would be useful to have a check similar to promql/fragile (or extend that?) to detect when an alert has a non-zero for duration AND isn't using range vector selector. Detecting a range vector selector rather than an _over_time function should be more flexible and easier to implement, I think.

Excerpt from Prometheus: Up & Running (2nd edition):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `promql/gauge_staleness` check #966

Add `promql/gauge_staleness` check #966

wbollock commented Apr 29, 2024

prymitive commented Jun 4, 2024

wbollock commented Jun 4, 2024

prymitive commented Jun 4, 2024

wbollock commented Jun 4, 2024

wbh1 commented Jun 10, 2024

Add promql/gauge_staleness check #966

Add promql/gauge_staleness check #966

Comments

wbollock commented Apr 29, 2024

prymitive commented Jun 4, 2024

wbollock commented Jun 4, 2024

prymitive commented Jun 4, 2024

wbollock commented Jun 4, 2024

wbh1 commented Jun 10, 2024

Add `promql/gauge_staleness` check #966

Add `promql/gauge_staleness` check #966