Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to expose how loaded waitress is #182

Open
robertlagrant opened this issue Jan 15, 2018 · 26 comments
Open

Ability to expose how loaded waitress is #182

robertlagrant opened this issue Jan 15, 2018 · 26 comments

Comments

@robertlagrant
Copy link

robertlagrant commented Jan 15, 2018

Is it possible to grab Waitress metrics (how many channels currently active, how many threads active, how many TCP connections open, for example)?

Is this currently possible? If not, I think it'd be a good feature :-)

(My plan is to shove them in prometheus_client, which can compute averages, monitor for high watermark breaches and trigger Kubernetes pod scaling, etc.)

@digitalresistor
Copy link
Member

No, waitress at this point in time doesn't make any of that sort of information available.

@NicolasLM
Copy link

As a Prometheus and Waitress user this would be great!

Obviously waitress should not depend on prometheus_client, so I believe that the way to integrate metrics into waitress without being too invasive is by using signals.

@digitalresistor
Copy link
Member

Waitress will currently log when there are more requests in the queue than there are threads to process said requests. It is logged under waitress.queue. Additional information and or prometheus support or things of that nature require someone to actively investigate how to implement that in waitress and the performance impact thereof.

@mcdonc
Copy link
Member

mcdonc commented Apr 23, 2019

It might not be a terrible idea to allow people to interrogate for this info via a signal handler and stdout/stderr and let people build on that.

@runyaga
Copy link

runyaga commented Dec 19, 2019

It would be a great idea. Also would suggest adding traceback for all threads.
Back in the Old World you could send signal USR1 to ZServer and it would dump out to stdout traceback adding the other metrics would be nice bonus.

@nsballmann
Copy link

How about making prometheus_client an optional dependency and serving the (configurable) /metrics endpoint (or maybe on a different port) if the package is available?

@robsonpeixoto
Copy link

Is safe to check the len(server.task_dispatcher.queue) to monitor the queue length?
If so, is possible to add the metric:

d = Gauge('waitress_task_queue_length', 'Waitress task queue length')
d.set_function(lambda: len(server.task_dispatcher.queue))

How to get the number of connections? The Gauge.set_function method can be very useful to monitor the waitress.

@robsonpeixoto
Copy link

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

Maybe use statsd is the better solution.

https://statsd.readthedocs.io/en/v3.3/types.html#gauges

@NicolasLM
Copy link

NicolasLM commented Nov 24, 2021

It doesn't seem like there is consensus on the technology to use to report metrics. One way to tackle the problem could be to not pick any technology but instead provide a set of signals users could use. By signals I mean something like blinker signals, not unix signals.

Waitress would only need to define a set a signals like "new connection", "started task", "finished task"... and users would handle these signals if they want to derive metrics useful to them. Popular integrations like statsd or Prometheus could be created in a contrib part of waitress or even in a separate repository because signals would act as an API between core waitress and the integration.

This is the approach used by Flask, Django, Celery and many more to provide developers a way to integrate and extend the frameworks.

For a very down to earth example on how this could be used to provide metrics, see how I used signals in a project of mine to create an integration with Datadog.

@pauls-username
Copy link

Here is a simple monkey patch solution I have used. https://gist.github.com/pauls-username/56744f5bd43b531a348bb59c36541cd7

There are probably more metrics that you could add using this method. The key is getting the server instance. Maybe adding some sort of a hook to get the server instance (without using monkey patching) would be a good official solution?

@Rudd-O
Copy link

Rudd-O commented Dec 28, 2021

We really should have an optional export of open metrics for Waitress.

That said, the question of whether to serve the /metrics handler in the same HTTP port and using the same Waitress threads... is tricky. If your Waitress is overloaded — kind of the moment when you need metrics the most — your Prometheus instance won't be able to scrape anything, because the Prometheus instance is sharing the same inbound queue as everything else. That's not good.

I like the monkeypatch code sample a lot, but — seconding what the previous commenter said — the code should also start a separate HTTP server using the Prometheus routine to serve /metrics on a different port.

Without metrics, Waitress is unobservable and that does not bode well fro production.

@nsballmann
Copy link

Practically, this is not a problem. Monitoring is not for the point in time where the server doesn't serve anymore. Then, you know that already. That's the reason you are looking at the monitoring data in the first place. Monitoring data is an indicator what may be the reason and how fast you got there.

Yes, the /metrics handler will consume resources (CPU, Mem, Queue-entries) while serving. But this just means, it's overloaded earlier. But the trade-off is, by spending those resources, you get some insight which resources are consumed most and need most to be added (if possible).

In most implementations it doesn't matter if its in the same HTTP server or a different one. They consume the same resources. The overhead of a different process is probably higher. Serving it in the same process Pythons GIL may play a role in the game.

So my personal assessment is: Metrics consuming a small part of the resources is better than to have no metrics at all.

@Rudd-O
Copy link

Rudd-O commented Jan 31, 2022

Agreed on the assessment — note that serving on a different port allows metrics to still be collected if the main server is overloaded in the specific sense that the main port has an enormous queue.

@robertlagrant
Copy link
Author

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

Maybe use statsd is the better solution.

https://statsd.readthedocs.io/en/v3.3/types.html#gauges

I don't quite understand why this is the case. Can't Prometheus expose a format like that? I'm thinking of https://prometheus.io/docs/tutorials/understanding_metric_types/#gauge

@Rudd-O
Copy link

Rudd-O commented Oct 28, 2022

As a Prometheus and Waitress user this would be great!

Obviously waitress should not depend on prometheus_client, so I believe that the way to integrate metrics into waitress without being too invasive is by using signals.

FYI you can create and serve Prometheus metrics without depending on any library. They're simple lines of text your program serves on a port.

Use prometheus to monitor the queue is not a viable solution, because the /metrics route will need to wait process all tasks from the queue until process the /metrics request.

It is viable, you just have to serve them on a different port, and possibly with a priority queue, or maybe on a separate thread entirely.

For the record, this is standard practice at Google. You don't serve your production traffic on the same port as your metrics traffic, because otherwise you don't get any metrics when you are swamped in production. Different buses for different purposes.

statsd

That's push, and therefore it's pretty hard to get right (many corner cases with push based solutions, especially on dynamic service stacks). You miss metrics sometimes, sometimes you don't know that you are supposed to be getting metrics, et cetera. With systems like Prometheus, if the service isn't giving you metrics, you will get a metric that will fire an alert telling you that the service is down, absolutely no exceptins.

@Rudd-O
Copy link

Rudd-O commented Oct 28, 2022

FYI I am using Prometheus today to monitor Plone (waitress-based) servers, and this works quite fine to obtain Zope metrics (of course, until the server is swamped, but then I also have Varnish and NginX metrics to tell me the other aspects of the story).

@mmerickel
Copy link
Member

waitress is interested in exposing a hook feature that folks could tie into prometheus or whatever library/system they want but the issue has been (in order of priority highest to lowest):

  • what exact metrics to compute and expose: this one is really critical because quite a lot of metrics should just be computed in middleware... the only metrics that really belong in waitress would be around the connections and queue but I think we need a concrete proposal of what stuff waitress would be responsible for computing
  • how to allow apps to register the hooks (an api to serve() and cli args)
  • avoid adding dependencies added to waitress

@Rudd-O
Copy link

Rudd-O commented Nov 1, 2022

I don't have a particular opinion on issues 2 and 3, but on issue 1 I think we need at least:

  1. Bytes sent / received by sockets.
  2. Requests per second by method.
  3. Listen queue depth count.
  4. Active request in flight count.

if not more and in histogram form.

On the Prometheus side this is usually accomplished by directly embedding calls to the respective metrics accumulators in opportune places within the code. With Python being a dynamic language, it should be possible to conditionally import and set up e.g. the Prometheus library if it successfully imported.

I understand that waitress the project does not want to gain this dependency tho. But the amount of hooks that need to be added for implementers of waitress-based services can be daunting as well, and it's not necessarily going to lead to more code clarity. Hooks may just be too low-level. Perhaps there is a way to implement this all using Zope-style adapters / interfaces, have waitress call the right methods of such code at the right times, and then the implementer can register the right adapter class (Prometheus / statsd / collectd) which in turn concretely implements listening on a socket and serving /metrics (or pushing data somewhere), and maintaining the counter / gauge data.

@NicolasLM
Copy link

NicolasLM commented Nov 1, 2022

I really think hooks/signals is a great way to solve the problem at hand.

With Python being a dynamic language, it should be possible to conditionally import and set up e.g. the Prometheus library if it successfully imported.

But there are dozens of monitoring libraries and products. I get that Prometheus is popular right now but so is Datadog and statsd. Obliviously waitress cannot support all that in the long run. So it is better to expose low-levels hooks and let others use them to create extensions for each monitoring product.

But the amount of hooks that need to be added for implementers of waitress-based services can be daunting as well, and it's not necessarily going to lead to more code clarity.

I don't think it is that hard of a task, as @mmerickel said, most of the HTTP metrics should really be computed by WSGI middlewares. Waitress should really focus on the low-level stuff that is not accessible any place else.

What exact metrics to compute and expose: this one is really critical because quite a lot of metrics should just be computed in middleware... the only metrics that really belong in waitress would be around the connections and queue but I think we need a concrete proposal of what stuff waitress would be responsible for computing

With hooks waitress doesn't really have to compute any metrics. Instead it lets the subscriber know that something happened (eg. new connection in queue, connection out of queue...). The subscriber is then responsible for computing the metrics it wants with that information.

How to allow apps to register the hooks (an api to serve() and cli args)

I don't want to beat a dead horse but signals, a la blinker, solve this problem quite elegantly by decoupling publisher from subscriber. Moreover it is a common pattern used by tons of high profile libraries (Flask, Django, Celery...) that most Python developers are already familiar with.

Avoid adding dependencies added to waitress

I get that, for signals there are two solutions:

  • Conditional import of the signal library, this is the strategy used by Flask. It requires to pip install blinker to use some extensions. Waitress could do the same: when the signal lib is not installed, signals are not sent.
  • Vendorize the signal library, this is the strategy used by Django and Celery. Signal libraries are really a few hundred lines of code, some of them exist with permissive licenses.

@Rudd-O
Copy link

Rudd-O commented Nov 1, 2022

I would favor that approach too.

@NicolasLM
Copy link

Maintainers, if you are willing to entertain the idea of solving this problems via signals, I can get started on a PR to show how it would look like.

@mmerickel
Copy link
Member

@NicolasLM I'd love it if you got into this. The conversation above is correct that it impacts quite a lot of the codebase... if you wanted to propose the general API first it would be great for buy-in before you have to get too deep into things.

My primary concerns are:

  • running multiple wsgi servers on different ports, and how to disambiguate events between them
  • what parts need to be global versus per-server
  • how to configure it from serve() invocation
  • does it have any features when launched from the CLI? I would lean toward not, and if you want metrics you should wrap the serve() api and start up your prometheus endpoint next to it.

Basically all of those points are related to the API and what's global vs per-server and limiting the global part. :-)

Maybe it's as simple as defining an interface and then passing an adapter into serve() such as serve(listen='*', monitor=MyCoolPrometheusMonitor(prom_server)).

I'm not sure what prior art exists here in other wsgi servers, I think aiohttp has some when I looked a while ago.

@Rudd-O
Copy link

Rudd-O commented Nov 5, 2022

IT just occurred to me that I would also like to have the connection count metric (as well as the queue count) simply because that lets me do clean blue/green deployment — I won't take down a waitress-based backend (which has already been removed from the frontend proxy) until the connection count reduces to 1.

@romank0
Copy link

romank0 commented Feb 23, 2023

If it is the goal to keep waitress free from the external dependencies then the approach in romank0#1 can be used. That implementation is far from being generic and it is mostly to demonstrate the approach and to be able to discuss options how to add a posibility to monitor without coupling to a specific provider. It defines the interface that the waitress user can implement to collect metrics if they need.

If it is ok to introduce dependencies I would use opentelemetry API. waitress would depend on the opentelementry metrics API and users can plug in their favourite collectors.

@Rudd-O
Copy link

Rudd-O commented Feb 23, 2023

If it is ok to introduce dependencies I would use opentelemetry API. waitress would depend on the opentelementry metrics API and users can plug in their favourite collectors.

I would vote for this as a preferable solution.

@db0
Copy link

db0 commented Jun 23, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests