Skip to content

Commit

Permalink
FAQ docs (#20)
Browse files Browse the repository at this point in the history
  • Loading branch information
faph authored Apr 12, 2024
2 parents 977571f + 0c6cfe9 commit cd0e2ec
Show file tree
Hide file tree
Showing 4 changed files with 75 additions and 6 deletions.
59 changes: 59 additions & 0 deletions docs/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Frequently Asked Questions (FAQ)
================================


Is my "model_fn" called at each invocation?
-------------------------------------------

No.

The :func:`model_fn` function is called during the very fist invocation only.
Once the model has been loaded, it is retained in memory for as long as the service runs.

To speed up the very first invocation, it is possible to trigger the `model_fn` hook in advance.
To do this, simply call :func:`inference_server.warmup`.

For example, when using Gunicorn, this could be done from a post-fork Gunicorn hook::

def post_fork(server, worker):
worker.log.info("Warming up worker...")
inference_server.warmup()


Does **inference-server** support async/ASGI webservers?
--------------------------------------------------------

No.

**inference-server** is a WSGI application to be used by synchronous webservers.

For most ML models that will be the correct choice as model inference is typically CPU-bound.
Therefore, a multi-process based WSGI server is a good choice whereby the number of workers is equal to the number of CPU cores available.

For more details see :ref:`deployment:Configuring Gunicorn workers`.


My model is leaking memory, how do I address that?
--------------------------------------------------

If the memory leak is outside your control, one approach would be to periodically restart the webserver workers.

For example, when using Gunicorn, it is possible to specify a maximum number of HTTP requests (`max_requests`) after which a given worker should be restarted.
Gunicorn additionally allows a random offset (`max_requests_jitter`) to be added such that worker restarts are staggered.

For more details see `Gunicorn settings documentation <https://docs.gunicorn.org/en/stable/settings.html#max-requests>`_.


How do I invoke my model using a data stream from my favourite message queue system?
------------------------------------------------------------------------------------

By design, **inference-server** is an HTTP web server and uses a simple request-response model.

This is so it can be deployed in most environments, not only including AWS Sagemaker but also as a local Dockerized service.
Access to the web server is also possible from a range of environments including AWS itself, but also from other providers in a multi-cloud environment.

Depending on the messaging/queueing system and cloud environment, you have various options to integrate a model deployed with **inference-server** with a message stream.

For example, in AWS, you could deploy a Lambda function which consumes messages from AWS SQS, then send this as an HTTP request to AWS SageMaker.
Equally, the Lambda function could write the SageMaker response to another SQS queue.
Of course, instead of a Lambda function you could use any other compute platform to deploy similar logic, including an EKS pods or ECS task.
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@ inference-server
Deploy your AI/ML model to Amazon SageMaker for Real-Time Inference and Batch Transform using your own Docker container image.

.. toctree::
:maxdepth: 2
:maxdepth: 1
:caption: Contents

introduction
hooks
batch_transform
deployment
testing
faq
modules


Expand Down
6 changes: 3 additions & 3 deletions docs/modules.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
API Documentation
=================
API reference documentation
===========================

.. toctree::
:maxdepth: 2
:maxdepth: 1

inference_server
inference_server_testing
13 changes: 11 additions & 2 deletions src/inference_server/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"MIMEAccept", # Exporting for plugin developers' convenience
"create_app",
"plugin_hook",
"warmup",
)

#: Library version, e.g. 1.0.0, taken from Git tags
Expand Down Expand Up @@ -70,12 +71,20 @@ class BatchStrategy(enum.Enum):


def create_app() -> "WSGIApplication":
"""Initialize and return the WSGI application"""
"""
Initialize and return the WSGI application
This is the WSGI application factory function that needs to be passed to a WSGI-compatible web server.
"""
return _app


def warmup() -> None:
"""Initialize any additional resources upfront"""
"""
Initialize any additional resources upfront
This will call the ``model_fn`` plugin hook.
"""
_model()


Expand Down

0 comments on commit cd0e2ec

Please sign in to comment.