Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src/collector: Introduce Collector abstraction #82

Merged
merged 14 commits into from
Dec 29, 2022

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented Aug 29, 2022

The Collector abstraction allows users to provide additional metrics
and their description on each scrape.

See also:

I am opening this up as a "Draft" for early feedback.

Copy link
Member Author

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do folks think of this design? Would this enable your use-cases?

//CC @gagbo @dovreshef @vladvasiliu @phyber

Alternative implementation to #82.

src/collector.rs Outdated
Comment on lines 7 to 28
pub trait Collector<M: Clone> {
fn collect<'a>(&'a self) -> Box<dyn Iterator<Item = (Cow<Descriptor>, Cow<M>)> + 'a>;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the trait one would need to implement to provide a custom collector, e.g. a process collector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this trait is generic over M, which I think is the type of the metric? How does the Clone bound work with collectors/registries of type Box<dyn EncodeMetric> which aren't Clone? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the Clone bound work with collectors/registries of type Box<dyn EncodeMetric> which aren't Clone? thinking

It doesn't. For now this is just a proposal. We could require Clone on EnocdeMetric, though I think the much better idea is to no longer be generic over the metric type. See #82 (comment).

Let me know what you think @sd2k!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great, that makes sense. I agree that switching to an enum instead of generics/trait objects would simplify things a lot! (Especially since adding a Clone bound to EncodeMetric would mean it wasn't object safe, so I don't think we could use trait objects anyway).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Especially since adding a Clone bound to EncodeMetric would mean it wasn't object safe, so I don't think we could use trait objects anyway).

Good point. We would have to drop the Cow<M> in favor of M, but then again, let's try to get rid of M.

src/registry.rs Outdated
Comment on lines 180 to 186
/// struct MyCollector {}
///
/// impl Collector<Counter> for MyCollector {
/// fn collect<'a>(&'a self) -> Box<dyn Iterator<Item = (Cow<Descriptor>, Cow<Counter>)> + 'a> {
/// let c = Counter::default();
/// let descriptor = Descriptor::new(
/// "my_counter",
/// "This is my counter",
/// None,
/// None,
/// vec![],
/// );
/// Box::new(std::iter::once((Cow::Owned(descriptor), Cow::Owned(c))))
/// }
/// }
///
/// let my_collector = Box::new(MyCollector{});
///
/// let mut registry: Registry<Counter> = Registry::default();
///
/// registry.register_collector(my_collector);
Copy link
Member Author

@mxinden mxinden Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example on how to implement and register a custom collector.

src/registry.rs Outdated
@@ -57,26 +59,37 @@ use std::borrow::Cow;
/// # "# EOF\n";
/// # assert_eq!(expected, String::from_utf8(buffer).unwrap());
/// ```
#[derive(Debug)]
pub struct Registry<M = Box<dyn crate::encoding::text::SendSyncEncodeMetric>> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the Collector mechanism I don't think there is need for being abstract over M any longer. Using a concrete type (e.g. enum over all metrics in prometheus_client::metrics) will drastically simplify both the library and its usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #100 proposing the above.

@mxinden mxinden mentioned this pull request Aug 29, 2022
@mxinden
Copy link
Member Author

mxinden commented Aug 29, 2022

Also //CC @sd2k

@mxinden
Copy link
Member Author

mxinden commented Aug 29, 2022

And //CC @08d2

@dovreshef
Copy link

Looks like it will work for the process collector use case! Thanks.

@vladvasiliu
Copy link

Am I understanding correctly that the user of the lib is expected to reimplement the Collector trait such that calling collect() will do whatever is necessary to produce the metrics, possibly querying some external system?

If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.

@08d2
Copy link

08d2 commented Aug 29, 2022

Am I understanding correctly that the user of the lib is expected to reimplement the Collector trait such that calling collect() will do whatever is necessary to produce the metrics, possibly querying some external system?

If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.

For context, collect() is called when Prometheus does an HTTP GET of the metrics endpoint for an instance of a job.

That call is subject to a few layers of (typically sub-second) timeouts and failsafes, and is expected to return more or less immediately. An implementation of collect() which blocked in any way that would benefit from async would almost certainly be a design error.

If the information returned by collect() needed to come from an external source, then it's I guess a de facto requirement that the retrieval is decoupled from the collecting.

@vladvasiliu
Copy link

That's what I was thinking, although this example seems to be reading stuff (memory info) from an external system (though, in this case, it's probably unlikely that it should take long).

I have to think a bit more about this; it seems to me that once the collector is added to the registry, there's no easy way to only update it such that it doesn't share state between possibly concurrent calls to the /metrics endpoint, all the while updating said state outside collect().

@08d2
Copy link

08d2 commented Aug 29, 2022

although this example seems to be reading stuff (memory info) from an external system

callLongGetter? Yeah, it's not obvious to me what that does, exactly... but I wouldn't take too much inspiration from the Java client, which hasn't been updated in a long while, and was never particularly idiomatic. Stick with the Go client, or maybe look at the more active exporters hosted in the Prometheus org.

there's no easy way to only update it such that it doesn't share state between possibly concurrent calls to the /metrics endpoint, all the while updating said state outside collect().

The nature of the problem demands shared state, for sure. (Or a "clever" workaround to avoid it, which, well.) But does shared state imply async, either necessarily or even ideally? I don't think so, but I could be wrong.

@vladvasiliu
Copy link

callLongGetter? Yeah, it's not obvious to me what that does, exactly... but I wouldn't take too much inspiration from the Java client, which hasn't been updated in a long while, and was never particularly idiomatic. Stick with the Go client, or maybe look at the more active exporters hosted in the Prometheus org.

I was thinking more about collectMemoryMetricsLinux. The reason I gave this example was because it's the one mentioned on the docs.

The nature of the problem demands shared state, for sure. (Or a "clever" workaround to avoid it, which, well.) But does shared state imply async, either necessarily or even ideally? I don't think so, but I could be wrong.

I wouldn't say shared state implies async. I think the issue was the way I was thinking about somehow getting the external (non-shared) state as a collector of the shared state, which would have then required collect() to cover async collection.

This could probably be implemented the other way around, with the shared state added as a sub-collector to the non-shared one before returning. Which, if collect() is guaranteed to return quickly, could be called from an async context.

@vladvasiliu
Copy link

I've been looking a bit more over this, and I think that, basically, a Registry is a Collector: it holds a bunch of metrics, right? That's how the Python client is implemented.

I think that if the lib provided an implementation of Collector for Registry, this would solve the majority of use cases.

@mxinden
Copy link
Member Author

mxinden commented Aug 31, 2022

Am I understanding correctly that the user of the lib is expected to reimplement the Collector trait such that calling collect() will do whatever is necessary to produce the metrics, possibly querying some external system?

Correct.


If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.

Can you be more concrete? Can you describe a use-case where one would need an async collector?


That call is subject to a few layers of (typically sub-second) timeouts and failsafes, and is expected to return more or less immediately.

Prometheus does not expect the scrape call to return immediately. E.g. from the Prometheus scheduling section:

The default scrape timeout for Prometheus is 10 seconds.


If the information returned by collect() needed to come from an external source, then it's I guess a de facto requirement that the retrieval is decoupled from the collecting.

By default the collection should not be decoupled from the collection. Citing the Prometheus scheduling section:

Metrics should only be pulled from the application when Prometheus scrapes them, exporters should not perform scrapes based on their own timers. That is, all scrapes should be synchronous.


I think that if the lib provided an implementation of Collector for Registry, this would solve the majority of use cases.

Registry could implement Collector, though I have refrained from that thus far to reduce complexity, i.e. not to mix concepts. I expect a user to have a single Registry but many `Collectors.

I don't follow how Registry implementing Collector would solve an issue here. Would you mind expanding on that?

@mxinden
Copy link
Member Author

mxinden commented Aug 31, 2022

Also note that the encode function takes an immutable reference to a Registry, thus two Prometheus servers (e.g. for fault tolerance) may scrape the same endpoint concurrently.

@vladvasiliu
Copy link

Can you be more concrete? Can you describe a use-case where one would need an async collector?

My use case is exporting metrics that are gathered via multiple HTTP calls. If my HTTP client is async, collect() should be able to handle that.

Prometheus does not expect the scrape call to return immediately. E.g. from the Prometheus scheduling section:

Well, the question is whether collect() itself should return immediately. The other operations could be done outside it, while staying inside the global scrape.

I don't follow how Registry implementing Collector would solve an issue here. Would you mind expanding on that?

See the other point: this is a confusion about whether collect() should be guaranteed to be quick or not and whether a collector should be different from a registry.

If it's not, and random long operations can be done inside collect(), then I can see how a Collector can be different from a Registry, and thus providing an default implementation for Registry doesn't help, since every collector is expected to be different and do its own thing.

However, in the second situation, when the actual gathering of the metrics should be done outside of the collect() method, then basically this becomes generating two separate registries, and combining them before calling collect(). The idea wouldn't be to do the scraping outside of the call from prometheus, but only outside the call to collect().

@vladvasiliu
Copy link

Looking at the Go implementation, specifically the example collector, it's clear that the collect() method is not expected to return "immediately".

This means that there's no particular reason to provide an implementation of Collector for Registry, but it can be useful to provide an alternative, async Collector, in case the actual job of collection is done asynchronously.

@08d2
Copy link

08d2 commented Sep 2, 2022

That example describes how a collector might be "bolted on" to a legacy system. It's not indicative of best practice. But the point is largely moot, I think.

@kwiesmueller
Copy link

My current examples why a collector would need to support async are:

  • collecting metrics from devices via e.g. Bluetooth
  • collecting container metrics comparable to cadvisor where the list of metrics is basically generated from a current state that is retrieved from the filesystem, containerd APIs etc.

While both approaches could also refresh the data in the background, having it directly connected to the scrape and being able to fetch fresh data on a scrape provides more reliable resolution than scraping previously scraped data.

Now, the data refresh could actually be handled in the http handler, but still it seems like an unnecessary detour.

@mxinden
Copy link
Member Author

mxinden commented Sep 4, 2022

As a first iteration, I suggest we support synchronous Collector::collect only.

  1. For those that need to call async methods in their Collector::collect call, I suggest you spawn the encode call in a new OS thread (or in a blocking-aware runtime task) and call any async methods via block_on in your Collector::collect implementation.
  2. For those that need concurrency within your Collector::collect implementation (e.g. reaching out to two remote machines concurrently) I suggest you spawn two or more OS threads within your Collector::collect implementation and join the threads again.
  3. For those that need the Collector::collect calls of two distinct Collector implementations to run concurrently, you would be out of luck for now.

In case folks see a large performance hit, e.g. due to thread spawns (even though that should be < 10 microseconds), or (1) and (2) add too much complexity in your Collector::collect implementation or folks need (3) I suggest we design an async Collector::collect as a next step.

@mxinden
Copy link
Member Author

mxinden commented Sep 4, 2022

@kwiesmueller Thanks for sharing. Let me know in case the threading approach is good enough for you for now, i.e. can collect from the many datasources you are targeting within the Prometheus scrape timeout.

If not, as mentioned above, let's design an async Collector::collect as a next step.

@mxinden
Copy link
Member Author

mxinden commented Oct 2, 2022

I rebased this pull request onto #100. Everything compiles, tests succeed.

To me the above validates that we should move forward with #100. That said, I do think the signatures in this pull request need more work (e.g. the workaround with MaybeOwned).

Adopt encoding style similar to serde. `EncodeXXX` for a type that can be
encoded (see serde's `Serialize`) and `XXXEncoder` for a supported encoding i.e.
text and protobuf (see serde's `Serializer`).

- Compatible with Rust's additive features. Enabling the `protobuf` feature does
not change type signatures.
- `EncodeMetric` is trait object safe, and thus `Registry` can use dynamic dispatch.
- Implement a single encoding trait `EncodeMetric` per metric type instead of
one implementation per metric AND per encoding format.
- Leverage `std::fmt::Write` instead of `std::io::Write`. The OpenMetrics text
format has to be unicode compliant. The OpenMetrics Protobuf format requires
labels to be valid strings.

Signed-off-by: Max Inden <[email protected]>
@hexfusion
Copy link

Very interested in this work are you looking for help or what would help move this forward.

@mxinden
Copy link
Member Author

mxinden commented Dec 6, 2022

Very interested in this work

Thanks for the interest. Sorry for the silence.

I rebased this pull request on #105. I want to try out the new API with one of my projects (e.g. https://github.com/mxinden/kademlia-exporter/) before merging #105 and this pull request.

are you looking for help or what would help move this forward.

I would appreciate feedback on #105 and this pull request. You would be of great help by testing this pull request (it is on top of #105) within your application in need of the Collector abstraction @hexfusion.

@hexfusion
Copy link

I have some cycles today will take a look, thanks!

Iterators returned by a Collector don't have to cross thread boundaries, nor doe
their references.
@mxinden
Copy link
Member Author

mxinden commented Dec 18, 2022

Tested this patch with rust-libp2p and kademlia-exporter. While the Collector trait is still quite verbose (with Cow and MaybeOwned), it does function as expected and allows ad-hoc cheap metric generation.

mxinden/kademlia-exporter#209

@@ -228,41 +258,58 @@ impl Registry {
.expect("sub_registries not to be empty.")
}

/// [`Iterator`] over all metrics registered with the [`Registry`].
pub fn iter(&self) -> RegistryIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming RegistryIterator and making this pub(crate) is the only breaking change as far as I can tell. Might be worth undoing.

Comment on lines +330 to +331
impl<S: EncodeLabelSet, M: EncodeMetric + TypedMetric, T: Iterator<Item = (S, M)>> EncodeMetric
for RefCell<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very hard to discover and doesn't feel Rust-y to me. Why do we need to depend on RefCell here?

I think it would be more idiomatic to do:

  • impl EncodeMetric for Vec<T> where T: EncodeMetric { }
  • impl EncodeMetric for (S, M) where S: EncodeLabelSet, M: EncodeMetric { }

Then, have the user provide a Vec instead of being generic over something that can be iterated. That should get rid of the RefCell and iterator dependency and would compose better with other usecases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An Iterator::next takes &mut self in order to be able to e.g. track its position within a Vec. Unfortunately EncodeMetric::encode takes &self.

In this pull request RefCell is used for interior mutability. It allows calling an Iterator::next taking &mut self from an EncodeMetric::encode providing &self only.

An alternative approach would be to change EncodeMetric::encode to take &mut self. I deemed the change and its implications not worth it, given that Collector is an abstraction for advanced users only.

All that said, I agree that the usage of RefCell is neither simple nor intuitive. @thomaseizinger do you see other alternatives to the one above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants