-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus Output Format Oddity #867
Comments
Cc: @petemounce |
I think the original intention for |
The
This is a pretty good description of the data which I'd expect to be returned from the For context, I use The problem with using a counter which increments every time the endpoint is hit is that the AWS healthchecks artificially inflate the number stupidly fast. This makes the metrics less useful even when rate is applied. Having something in the main /metrics endpoint tracking the total number of test runs would actually be handy! However, the |
Thanks for elaborating.
The counter docs:
This is what goss' The intention is that
Is there a semantic difference between "AWS health checks hit /healthz and tests run" and "prometheus scraper hits /healthz and tests run" for you? I think the gauges approach suffers if the particular tests intermittently fail, compared to the counters approach which does not. With gauges, an alert as you describe will only catch an intermittently-failing test by chance if a scrape happens to coincide with a failure. With counters, the alert looks at the complete history of the outcomes and is not subject to that "leak". With gauges, one can increase the scrape rate to lower the chance of missing an intermittent failure (but that would only be prompted by luck), at the consequence of increased load and storage on prometheus because it scrapes more often. This is unnecessary for the counters approach. WDYT?
Is that rate over time different from (roughly) Aside: I appreciate that the lack of documentation isn't helpful here. I intend to address that after #856 merges. |
Let's bring this back to basics. The problem arises with the expectation of what the Prometheus output should be for the However, every other format available for the healthz endpoint reflects the checks run when the endpoint was queries, and JUST that particular run. Using the example goss.yaml from above: documentation:
json:
junit:
nagios:
rspecish:
structured:
The ONLY output format for |
Any conclusion on this? I'm not chiming in so far since I've mostly stayed out of the Prometheus implementation. Haven't had time to go down the Prometheus rabbit hole or best practices. So, I'm mostly depending on community consensus and contributions time to maintain it. |
@aelsabbahy I was awaiting merge of #856 so I could more clearly document the intended use-cases that are supported. I see that's merged now, so I'll plan to write some content in June once back from traveling. |
Thank you for the update, have a great trip! |
Describe the bug
Looking at using the Prometheus output for health checks, and it looks like the incorrect data types are being used for some of the metrics. Rather then using a gauge data type to give the results from this run, it's instead a counter with the cumulative results from multiple runs.
For a /heathz endpoint, it makes more sense to output the data as a guage with just the results for the requested run.
How To Reproduce
Using any example set of goss tests, run
goss s -f prometheus
. Then runcurl http://localhost:8080/healthz
. For testing I've been using a simple goss file which will return a pass and fail.Expected Behavior
When using the prometheus format healthz endpoint, all the HELP lines indicate that the results returned will be for this run. In practice, they're all counters rather than guages which means they increment between runs.
Actual Behavior
With the simple goss.yaml above just doing a two file checks:
First run:
Second Run:
Note that
goss_tests_outcomes_total
andgoss_tests_run_outcomes_total
are counting across runs rather than just returning a gauge for the results from this run.Environment:
The text was updated successfully, but these errors were encountered: