-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
75 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
Frequently Asked Questions (FAQ) | ||
================================ | ||
|
||
|
||
Is my "model_fn" called at each invocation? | ||
------------------------------------------- | ||
|
||
No. | ||
|
||
The :func:`model_fn` function is called during the very fist invocation only. | ||
Once the model has been loaded, it is retained in memory for as long as the service runs. | ||
|
||
To speed up the very first invocation, it is possible to trigger the `model_fn` hook in advance. | ||
To do this, simply call :func:`inference_server.warmup`. | ||
|
||
For example, when using Gunicorn, this could be done from a post-fork Gunicorn hook:: | ||
|
||
def post_fork(server, worker): | ||
worker.log.info("Warming up worker...") | ||
inference_server.warmup() | ||
|
||
|
||
Does **inference-server** support async/ASGI webservers? | ||
-------------------------------------------------------- | ||
|
||
No. | ||
|
||
**inference-server** is a WSGI application to be used by synchronous webservers. | ||
|
||
For most ML models that will be the correct choice as model inference is typically CPU-bound. | ||
Therefore, a multi-process based WSGI server is a good choice whereby the number of workers is equal to the number of CPU cores available. | ||
|
||
For more details see :ref:`deployment:Configuring Gunicorn workers`. | ||
|
||
|
||
My model is leaking memory, how do I address that? | ||
-------------------------------------------------- | ||
|
||
If the memory leak is outside your control, one approach would be to periodically restart the webserver workers. | ||
|
||
For example, when using Gunicorn, it is possible to specify a maximum number of HTTP requests (`max_requests`) after which a given worker should be restarted. | ||
Gunicorn additionally allows a random offset (`max_requests_jitter`) to be added such that worker restarts are staggered. | ||
|
||
For more details see `Gunicorn settings documentation <https://docs.gunicorn.org/en/stable/settings.html#max-requests>`_. | ||
|
||
|
||
How do I invoke my model using a data stream from my favourite message queue system? | ||
------------------------------------------------------------------------------------ | ||
|
||
By design, **inference-server** is an HTTP web server and uses a simple request-response model. | ||
|
||
This is so it can be deployed in most environments, not only including AWS Sagemaker but also as a local Dockerized service. | ||
Access to the web server is also possible from a range of environments including AWS itself, but also from other providers in a multi-cloud environment. | ||
|
||
Depending on the messaging/queueing system and cloud environment, you have various options to integrate a model deployed with **inference-server** with a message stream. | ||
|
||
For example, in AWS, you could deploy a Lambda function which consumes messages from AWS SQS, then send this as an HTTP request to AWS SageMaker. | ||
Equally, the Lambda function could write the SageMaker response to another SQS queue. | ||
Of course, instead of a Lambda function you could use any other compute platform to deploy similar logic, including an EKS pods or ECS task. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,8 @@ | ||
API Documentation | ||
================= | ||
API reference documentation | ||
=========================== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:maxdepth: 1 | ||
|
||
inference_server | ||
inference_server_testing |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters