Ability to expose how loaded waitress is #182

robertlagrant · 2018-01-15T09:49:46Z

Is it possible to grab Waitress metrics (how many channels currently active, how many threads active, how many TCP connections open, for example)?

Is this currently possible? If not, I think it'd be a good feature :-)

(My plan is to shove them in prometheus_client, which can compute averages, monitor for high watermark breaches and trigger Kubernetes pod scaling, etc.)

digitalresistor · 2018-01-16T02:14:42Z

No, waitress at this point in time doesn't make any of that sort of information available.

NicolasLM · 2018-01-19T14:20:26Z

As a Prometheus and Waitress user this would be great!

Obviously waitress should not depend on prometheus_client, so I believe that the way to integrate metrics into waitress without being too invasive is by using signals.

digitalresistor · 2019-04-23T03:00:47Z

Waitress will currently log when there are more requests in the queue than there are threads to process said requests. It is logged under waitress.queue. Additional information and or prometheus support or things of that nature require someone to actively investigate how to implement that in waitress and the performance impact thereof.

mcdonc · 2019-04-23T03:19:07Z

It might not be a terrible idea to allow people to interrogate for this info via a signal handler and stdout/stderr and let people build on that.

runyaga · 2019-12-19T17:42:10Z

It would be a great idea. Also would suggest adding traceback for all threads.
Back in the Old World you could send signal USR1 to ZServer and it would dump out to stdout traceback adding the other metrics would be nice bonus.

nsballmann · 2021-10-05T15:41:06Z

How about making prometheus_client an optional dependency and serving the (configurable) /metrics endpoint (or maybe on a different port) if the package is available?

robsonpeixoto · 2021-11-23T23:49:33Z

Is safe to check the len(server.task_dispatcher.queue) to monitor the queue length?
If so, is possible to add the metric:

d = Gauge('waitress_task_queue_length', 'Waitress task queue length')
d.set_function(lambda: len(server.task_dispatcher.queue))

How to get the number of connections? The Gauge.set_function method can be very useful to monitor the waitress.

robsonpeixoto · 2021-11-24T03:47:19Z

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

Maybe use statsd is the better solution.

https://statsd.readthedocs.io/en/v3.3/types.html#gauges

NicolasLM · 2021-11-24T09:40:48Z

It doesn't seem like there is consensus on the technology to use to report metrics. One way to tackle the problem could be to not pick any technology but instead provide a set of signals users could use. By signals I mean something like blinker signals, not unix signals.

Waitress would only need to define a set a signals like "new connection", "started task", "finished task"... and users would handle these signals if they want to derive metrics useful to them. Popular integrations like statsd or Prometheus could be created in a contrib part of waitress or even in a separate repository because signals would act as an API between core waitress and the integration.

This is the approach used by Flask, Django, Celery and many more to provide developers a way to integrate and extend the frameworks.

For a very down to earth example on how this could be used to provide metrics, see how I used signals in a project of mine to create an integration with Datadog.

pauls-username · 2021-12-10T20:59:18Z

Here is a simple monkey patch solution I have used. https://gist.github.com/pauls-username/56744f5bd43b531a348bb59c36541cd7

There are probably more metrics that you could add using this method. The key is getting the server instance. Maybe adding some sort of a hook to get the server instance (without using monkey patching) would be a good official solution?

Rudd-O · 2021-12-28T00:58:17Z

We really should have an optional export of open metrics for Waitress.

That said, the question of whether to serve the /metrics handler in the same HTTP port and using the same Waitress threads... is tricky. If your Waitress is overloaded — kind of the moment when you need metrics the most — your Prometheus instance won't be able to scrape anything, because the Prometheus instance is sharing the same inbound queue as everything else. That's not good.

I like the monkeypatch code sample a lot, but — seconding what the previous commenter said — the code should also start a separate HTTP server using the Prometheus routine to serve /metrics on a different port.

Without metrics, Waitress is unobservable and that does not bode well fro production.

nsballmann · 2022-01-14T08:08:26Z

Practically, this is not a problem. Monitoring is not for the point in time where the server doesn't serve anymore. Then, you know that already. That's the reason you are looking at the monitoring data in the first place. Monitoring data is an indicator what may be the reason and how fast you got there.

Yes, the /metrics handler will consume resources (CPU, Mem, Queue-entries) while serving. But this just means, it's overloaded earlier. But the trade-off is, by spending those resources, you get some insight which resources are consumed most and need most to be added (if possible).

In most implementations it doesn't matter if its in the same HTTP server or a different one. They consume the same resources. The overhead of a different process is probably higher. Serving it in the same process Pythons GIL may play a role in the game.

So my personal assessment is: Metrics consuming a small part of the resources is better than to have no metrics at all.

Rudd-O · 2022-01-31T14:56:51Z

Agreed on the assessment — note that serving on a different port allows metrics to still be collected if the main server is overloaded in the specific sense that the main port has an enormous queue.

robertlagrant · 2022-10-27T13:34:36Z

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

Maybe use statsd is the better solution.

https://statsd.readthedocs.io/en/v3.3/types.html#gauges

I don't quite understand why this is the case. Can't Prometheus expose a format like that? I'm thinking of https://prometheus.io/docs/tutorials/understanding_metric_types/#gauge

Rudd-O · 2022-10-28T00:02:21Z

As a Prometheus and Waitress user this would be great!

Obviously waitress should not depend on prometheus_client, so I believe that the way to integrate metrics into waitress without being too invasive is by using signals.

FYI you can create and serve Prometheus metrics without depending on any library. They're simple lines of text your program serves on a port.

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

It is viable, you just have to serve them on a different port, and possibly with a priority queue, or maybe on a separate thread entirely.

For the record, this is standard practice at Google. You don't serve your production traffic on the same port as your metrics traffic, because otherwise you don't get any metrics when you are swamped in production. Different buses for different purposes.

statsd

That's push, and therefore it's pretty hard to get right (many corner cases with push based solutions, especially on dynamic service stacks). You miss metrics sometimes, sometimes you don't know that you are supposed to be getting metrics, et cetera. With systems like Prometheus, if the service isn't giving you metrics, you will get a metric that will fire an alert telling you that the service is down, absolutely no exceptins.

Rudd-O · 2022-10-28T00:03:18Z

FYI I am using Prometheus today to monitor Plone (waitress-based) servers, and this works quite fine to obtain Zope metrics (of course, until the server is swamped, but then I also have Varnish and NginX metrics to tell me the other aspects of the story).

mmerickel · 2022-10-28T01:08:03Z

waitress is interested in exposing a hook feature that folks could tie into prometheus or whatever library/system they want but the issue has been (in order of priority highest to lowest):

what exact metrics to compute and expose: this one is really critical because quite a lot of metrics should just be computed in middleware... the only metrics that really belong in waitress would be around the connections and queue but I think we need a concrete proposal of what stuff waitress would be responsible for computing
how to allow apps to register the hooks (an api to serve() and cli args)
avoid adding dependencies added to waitress

Rudd-O · 2022-11-01T10:10:50Z

I don't have a particular opinion on issues 2 and 3, but on issue 1 I think we need at least:

Bytes sent / received by sockets.
Requests per second by method.
Listen queue depth count.
Active request in flight count.

if not more and in histogram form.

On the Prometheus side this is usually accomplished by directly embedding calls to the respective metrics accumulators in opportune places within the code. With Python being a dynamic language, it should be possible to conditionally import and set up e.g. the Prometheus library if it successfully imported.

I understand that waitress the project does not want to gain this dependency tho. But the amount of hooks that need to be added for implementers of waitress-based services can be daunting as well, and it's not necessarily going to lead to more code clarity. Hooks may just be too low-level. Perhaps there is a way to implement this all using Zope-style adapters / interfaces, have waitress call the right methods of such code at the right times, and then the implementer can register the right adapter class (Prometheus / statsd / collectd) which in turn concretely implements listening on a socket and serving /metrics (or pushing data somewhere), and maintaining the counter / gauge data.

NicolasLM · 2022-11-01T10:42:46Z

I really think hooks/signals is a great way to solve the problem at hand.

With Python being a dynamic language, it should be possible to conditionally import and set up e.g. the Prometheus library if it successfully imported.

But there are dozens of monitoring libraries and products. I get that Prometheus is popular right now but so is Datadog and statsd. Obliviously waitress cannot support all that in the long run. So it is better to expose low-levels hooks and let others use them to create extensions for each monitoring product.

But the amount of hooks that need to be added for implementers of waitress-based services can be daunting as well, and it's not necessarily going to lead to more code clarity.

I don't think it is that hard of a task, as @mmerickel said, most of the HTTP metrics should really be computed by WSGI middlewares. Waitress should really focus on the low-level stuff that is not accessible any place else.

What exact metrics to compute and expose: this one is really critical because quite a lot of metrics should just be computed in middleware... the only metrics that really belong in waitress would be around the connections and queue but I think we need a concrete proposal of what stuff waitress would be responsible for computing

With hooks waitress doesn't really have to compute any metrics. Instead it lets the subscriber know that something happened (eg. new connection in queue, connection out of queue...). The subscriber is then responsible for computing the metrics it wants with that information.

How to allow apps to register the hooks (an api to serve() and cli args)

I don't want to beat a dead horse but signals, a la blinker, solve this problem quite elegantly by decoupling publisher from subscriber. Moreover it is a common pattern used by tons of high profile libraries (Flask, Django, Celery...) that most Python developers are already familiar with.

Avoid adding dependencies added to waitress

I get that, for signals there are two solutions:

Conditional import of the signal library, this is the strategy used by Flask. It requires to pip install blinker to use some extensions. Waitress could do the same: when the signal lib is not installed, signals are not sent.
Vendorize the signal library, this is the strategy used by Django and Celery. Signal libraries are really a few hundred lines of code, some of them exist with permissive licenses.

Rudd-O · 2022-11-01T10:45:35Z

I would favor that approach too.

NicolasLM · 2022-11-01T11:00:18Z

Maintainers, if you are willing to entertain the idea of solving this problems via signals, I can get started on a PR to show how it would look like.

mmerickel · 2022-11-01T19:21:36Z

@NicolasLM I'd love it if you got into this. The conversation above is correct that it impacts quite a lot of the codebase... if you wanted to propose the general API first it would be great for buy-in before you have to get too deep into things.

My primary concerns are:

running multiple wsgi servers on different ports, and how to disambiguate events between them
what parts need to be global versus per-server
how to configure it from serve() invocation
does it have any features when launched from the CLI? I would lean toward not, and if you want metrics you should wrap the serve() api and start up your prometheus endpoint next to it.

Basically all of those points are related to the API and what's global vs per-server and limiting the global part. :-)

Maybe it's as simple as defining an interface and then passing an adapter into serve() such as serve(listen='*', monitor=MyCoolPrometheusMonitor(prom_server)).

I'm not sure what prior art exists here in other wsgi servers, I think aiohttp has some when I looked a while ago.

Rudd-O · 2022-11-05T12:14:22Z

IT just occurred to me that I would also like to have the connection count metric (as well as the queue count) simply because that lets me do clean blue/green deployment — I won't take down a waitress-based backend (which has already been removed from the frontend proxy) until the connection count reduces to 1.

romank0 · 2023-02-23T11:26:45Z

If it is the goal to keep waitress free from the external dependencies then the approach in romank0#1 can be used. That implementation is far from being generic and it is mostly to demonstrate the approach and to be able to discuss options how to add a posibility to monitor without coupling to a specific provider. It defines the interface that the waitress user can implement to collect metrics if they need.

If it is ok to introduce dependencies I would use opentelemetry API. waitress would depend on the opentelementry metrics API and users can plug in their favourite collectors.

Rudd-O · 2023-02-23T12:05:17Z

If it is ok to introduce dependencies I would use opentelemetry API. waitress would depend on the opentelementry metrics API and users can plug in their favourite collectors.

I would vote for this as a preferable solution.

db0 · 2024-06-23T12:02:39Z

for anyone who might need it, I repurposed this monkeypatch solution to use a simple class to expose the metrics to me wherever I need them.

https://github.com/Haidra-Org/AI-Horde/blob/1e2629a73ddbfa873a61b5ae849b19662b4c591d/server.py#L25
https://github.com/Haidra-Org/AI-Horde/blob/1e2629a73ddbfa873a61b5ae849b19662b4c591d/horde/metrics.py

NicolasLM mentioned this issue Nov 2, 2022

WIP: Add signals to expose internal Waitress events #388

Closed

This was referenced Nov 5, 2022

Timings? jonnenauha/prometheus_varnish_exporter#83

Open

Plone 6 plone/ansible-playbook#147

Open

romank0 mentioned this issue Feb 23, 2023

metrics collector with queue length & wait time reporting romank0/waitress#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to expose how loaded waitress is #182

Ability to expose how loaded waitress is #182

robertlagrant commented Jan 15, 2018 •

edited

Loading

digitalresistor commented Jan 16, 2018

NicolasLM commented Jan 19, 2018

digitalresistor commented Apr 23, 2019

mcdonc commented Apr 23, 2019

runyaga commented Dec 19, 2019

nsballmann commented Oct 5, 2021

robsonpeixoto commented Nov 23, 2021

robsonpeixoto commented Nov 24, 2021

NicolasLM commented Nov 24, 2021 •

edited

Loading

pauls-username commented Dec 10, 2021

Rudd-O commented Dec 28, 2021 •

edited

Loading

nsballmann commented Jan 14, 2022

Rudd-O commented Jan 31, 2022 •

edited

Loading

robertlagrant commented Oct 27, 2022

Rudd-O commented Oct 28, 2022

Rudd-O commented Oct 28, 2022

mmerickel commented Oct 28, 2022

Rudd-O commented Nov 1, 2022 •

edited

Loading

NicolasLM commented Nov 1, 2022 •

edited

Loading

Rudd-O commented Nov 1, 2022

NicolasLM commented Nov 1, 2022

mmerickel commented Nov 1, 2022

Rudd-O commented Nov 5, 2022

romank0 commented Feb 23, 2023 •

edited

Loading

Rudd-O commented Feb 23, 2023

db0 commented Jun 23, 2024

Ability to expose how loaded waitress is #182

Ability to expose how loaded waitress is #182

Comments

robertlagrant commented Jan 15, 2018 • edited Loading

digitalresistor commented Jan 16, 2018

NicolasLM commented Jan 19, 2018

digitalresistor commented Apr 23, 2019

mcdonc commented Apr 23, 2019

runyaga commented Dec 19, 2019

nsballmann commented Oct 5, 2021

robsonpeixoto commented Nov 23, 2021

robsonpeixoto commented Nov 24, 2021

NicolasLM commented Nov 24, 2021 • edited Loading

pauls-username commented Dec 10, 2021

Rudd-O commented Dec 28, 2021 • edited Loading

nsballmann commented Jan 14, 2022

Rudd-O commented Jan 31, 2022 • edited Loading

robertlagrant commented Oct 27, 2022

Rudd-O commented Oct 28, 2022

Rudd-O commented Oct 28, 2022

mmerickel commented Oct 28, 2022

Rudd-O commented Nov 1, 2022 • edited Loading

NicolasLM commented Nov 1, 2022 • edited Loading

Rudd-O commented Nov 1, 2022

NicolasLM commented Nov 1, 2022

mmerickel commented Nov 1, 2022

Rudd-O commented Nov 5, 2022

romank0 commented Feb 23, 2023 • edited Loading

Rudd-O commented Feb 23, 2023

db0 commented Jun 23, 2024

robertlagrant commented Jan 15, 2018 •

edited

Loading

NicolasLM commented Nov 24, 2021 •

edited

Loading

Rudd-O commented Dec 28, 2021 •

edited

Loading

Rudd-O commented Jan 31, 2022 •

edited

Loading

Rudd-O commented Nov 1, 2022 •

edited

Loading

NicolasLM commented Nov 1, 2022 •

edited

Loading

romank0 commented Feb 23, 2023 •

edited

Loading