-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Align Ruby client with Prometheus best practices, prepare for multi-process support #94
Comments
YES PLEASE! |
Having implemented https://github.com/discourse/prometheus_exporter one of the weakest point the official library has is the Summary implementation, it just uses estimates which make all results pretty nonsensical. Other weaknesses are lack of multi process support which is enormous, lack of standard instrumentation people can easily lean on and lack of standalone robust web server implementation short of mounting rack. I highly recommend any work on improving the default library has a look at the prom exporter implementation for ideas, there is a lot there and the thing is very battle tested. |
@SamSaffron Thank you for your comments! These are all very good points.
Yeah, we're actually proposing "downgrading" the Summary to the level which Python offers, which is just having Sum and Count, without quantiles, which is the minimum the Best Practices require. We're doing this for practical reasons, since we want to be able to only store Floats in our stores, and the
This is the main thing we're aiming at improving with this work. We're currently leaning more towards a "shared memory" implementation (an example of which can be seen in Gitlab's implementation, or in the Python client), but it'd be quite easy for someone to have a store that pushes things to your Collector process.
We're also planning on looking at these, if we have time, but they're a bit more long-term plans, for now, we're focusing on solving the multi-process issue |
Update: We've updated PR #95 with actual data stores, including one that works for pre-fork servers 🎉 |
All of this is done now that we've merged PR #95 |
As it currently stands, the Prometheus Ruby Client has a few issues that make it hard to adopt in mainstream Ruby projects, particularly in Web applications:
We'd like to contribute to the effort of improving the client in these directions, and we're proposing we could make a number of changes (this issue will be promptly followed by a PR that implements several of these).
Objectives
We have reached out to @grobie recently and he suggested that releasing a new major version was the way to go in order to work around these issues.
There are several proposals in this RFC for improvements to the existing Prometheus Client for Ruby.
These proposals are largely independent of each other, so we can pick for each one whether we think it’s an improvement or not. They are also ordered from most to least important. Only the first one is an absolute must, since it paves the way for adding multi-process support.
1. Centralizing and Abstracting storage of Metric values
In the current client, each Metric object has an internal
@values
hash to store the metric values. The value of this hash is, for Gauges and Counters, afloat
, and the key of this hash is itself a hash of labels and their values. Thus, for one given metric there are multiple values at any given time, one for each combination of values of their labels.For Histograms,
@values
doesn’t hold afloat
. Instead it holds aHistogram::Value
object, which holds oneinteger
for each bucket, plus the total number of observations, and the sum of all observations. Summaries do a similar thing.We're proposing that, instead of each metric holding their own counter internally, we should have a centralized store that holds all the information. Metric objects update this store for every observation, and it gets read in its entirety by a formatter when the metrics are scraped.
Why
Having this central storage allows us to abstract how that data is stored internally. For simpler cases, we can simply use a large hash, similar to the current implementation. But other situations (like pre-fork servers) require more involved implementations, to be able to share memory between different processes and report coherent total numbers. Abstracting the storage away allows the rest of the client to be agnostic about this complexity, and allows for multiple different “backends” that can be swapped based on the needs of each particular user.
What this looks like:
Prometheus would have a global
config
object that allows users to set which Store they want:As a default, a simple storage system that provides the same functionality as the current client is provided. Other backends may be provided with the gem, and users can also make their own. Note that we set the data store to an instantiated object, not a class. This is because that object may need store-specific parameters when being instantiated (file paths, connection strings, etc)
These swappable stores have the following interface:
For example, the default implementation of this interface would be something like this: (like all the code in this doc, this is simplified to explain the general idea, it is not final code):
A more complex store may look like this: (note, this is based on a fantasy primitive that magically shares memory between processes, it’s just to illustrate how extra internal labels / aggregators work):
The way you’d use these stores and aggregators would be something like:
For all other metrics, you’d just get
sum
by default which is probably what you want.Stores are ultimately only used by the Metrics and the Formatters. The user never touches them.
The way Metrics work is similar to this:
Storage for Histograms and Summaries
In the current client, Histograms use a special value object to hold the number of observations for each bucket, plus a total and a sum. Our stores don’t allow this, since they’re a simple Hash that stores
floats
and nothing else.To work around this, Histograms add special, reserved labels when interacting with the store. These are the same labels that’ll be exposed when exporting the metrics, so there isn’t a huge impedance problem with doing this. The main difference is that Histograms need to override the
get
andvalues
methods ofMetric
to recompose these individual bucket values into a coherent HashExample usage:
For Summaries, we'd apply a similar solution.
2. Remove Quantile calculation from
Summary
We would change Summaries to expose only
sum
andcount
instead, with no quantiles / percentiles.Why
This is a bit of a contentious proposal in that it's not something we're doing because it'll make the client better, but because the way Summaries work is not very compatible with the idea of "Stores", which we'd need for pre-fork servers.
The
quantile
gem doesn't play well with this, since we'd need to store instances ofQuantile::Estimator
, which is a complex data structure, and tricky to share between Ruby processes.Moreover, individual Summaries on different processes cannot be aggregated, so all processes would actually have to share one instance of this class, which makes it extremely tricky, particularly to do performantly.
Even though this is a loss of functionality, this puts the Ruby client on par with other client libraries, like the Python one, which also only offers sum and count without quantiles.
Also, this is actually more compliant with the Client Library best practices:
The original client didn't comply with the last 2 rules, where this one would, just like the Python client.
And quantiles, while seemingly the point of summaries, are encouraged but not required.
We're not ruling out the idea of adding quantiles back, either they'd work only "single-process", or we may find a better way of dealing with the multiple processes.
3. Better declaration and validation of Labels
Why?
The current client enforces that labels for a metric don’t change once the metric is observed once. However, there is no way to declare what the labels should be when creating the metric, as the best practices demand. There’s also no facility to access a labeled dimension via a
labels
method like the best practices, allowing things likemetric.labels(role: admin).inc()
What does this look like?
We propose changing the signature of a Metric’s
initialize
method to:labels
is an array of strings listing all the labels that are both allowed and required by this metric.preset_labels
are the same as the currentbase_labels
parameter. They allow specifying “default” values for labels. The difference is that, in this proposed interface, a label that has a pre-set value would be specified in both thelabels
andpreset_labels
params.LabelSetValidator
basically changes to comparepreset_labels.merge(labels).keys == labels
, instead of storing the previously validated ones, and comparing against those. The rest of the validations remain basically the same.with_labels
method that behaves in the way that the best practices suggest forlabels()
. (with_labels
is a more idiomatic name in Ruby). Given a metric, callingwith_labels
on it will return a new Metric object with those labels added topreset_labels
.with_labels(role: :admin)
, caching that, and then callinginc()
on it multiple times will be slightly faster than callinginc({ role: :admin }, 1)
multiple times, as we can skip the label validation.NOTE: This code doesn’t quite work faster when caching the result of
with_labels
, for simplicity, but it's easy to make that change.Questions:
nil
as a label value?nil
as a value for a label.to_s
inside theMetric
4. More control over how Label Errors are notified
Why
Currently, the client raises an exception when anything is wrong with labels. While any such problem should be caught in development, we wouldn’t want to 500 on a request because of some unexpected situation with labels.
Ideally, we would raise in dev / test, but notify our exception manager in production.
What does this look like?
We propose adding a
label_error_strategy
config option, which defaults toRaise
, but that can be change by the user to whatever they need.Something like:
The Prometheus Client would only provide
Raise
as a strategy. We might also want to provide some for Sentry / Rollbar / Bugsnag / etc, but just allowing to swap these around should be enough.Note that the Client makes no attempt at figuring out it’s in production / dev, and deciding anything based on that. This is left for the user.
An example of using this would be:
5. Using
kwargs
throughout the codeWhy
We believe using keyword arguments will make the Client nicer to use, and more clear. Also, this is more idiomatic in modern Ruby.
Examples:
counter.increment(by: 2)
vs
counter.increment({}, 2)
vs
Registry.counter(:requests_total, ['code', 'app'], { app: 'MyApp' })
The main point against this is that Ruby < 2.0 doesn’t support them fully, but those versions have been EOL for over 3 years now, so we shouldn't need to continue to support them.
7. Add less critical things recommended by the best practices
Counter#count_exceptions
Something like:
This should be used like:
Gauge#set_to_current_time
Not much explanation needed for this one
Gauge#track_in_progress
Something like:
Gauge#time
Something like:
Summary#time and Histogram#time
/s/set()/observe()/
in the code aboveHistogram.linear_buckets and Histogram.exponential_buckets
Class methods on
Histogram
that return an array with bucket upper limits, for users to pass to the Histogram constructorStandard Automatic Process Metrics
https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors
The text was updated successfully, but these errors were encountered: