FAQ docs (#20)

jpmorganchase · Apr 12, 2024 · cd0e2ec · cd0e2ec
2 parents 977571f + 0c6cfe9
commit cd0e2ec
Show file tree

Hide file tree

Showing 4 changed files with 75 additions and 6 deletions.
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -0,0 +1,59 @@
+Frequently Asked Questions (FAQ)
+================================
+
+
+Is my "model_fn" called at each invocation?
+-------------------------------------------
+
+No.
+
+The :func:`model_fn` function is called during the very fist invocation only.
+Once the model has been loaded, it is retained in memory for as long as the service runs.
+
+To speed up the very first invocation, it is possible to trigger the `model_fn` hook in advance.
+To do this, simply call :func:`inference_server.warmup`.
+
+For example, when using Gunicorn, this could be done from a post-fork Gunicorn hook::
+
+   def post_fork(server, worker):
+       worker.log.info("Warming up worker...")
+       inference_server.warmup()
+
+
+Does **inference-server** support async/ASGI webservers?
+--------------------------------------------------------
+
+No.
+
+**inference-server** is a WSGI application to be used by synchronous webservers.
+
+For most ML models that will be the correct choice as model inference is typically CPU-bound.
+Therefore, a multi-process based WSGI server is a good choice whereby the number of workers is equal to the number of CPU cores available.
+
+For more details see :ref:`deployment:Configuring Gunicorn workers`.
+
+
+My model is leaking memory, how do I address that?
+--------------------------------------------------
+
+If the memory leak is outside your control, one approach would be to periodically restart the webserver workers.
+
+For example, when using Gunicorn, it is possible to specify a maximum number of HTTP requests (`max_requests`) after which a given worker should be restarted.
+Gunicorn additionally allows a random offset (`max_requests_jitter`) to be added such that worker restarts are staggered.
+
+For more details see `Gunicorn settings documentation <https://docs.gunicorn.org/en/stable/settings.html#max-requests>`_.
+
+
+How do I invoke my model using a data stream from my favourite message queue system?
+------------------------------------------------------------------------------------
+
+By design, **inference-server** is an HTTP web server and uses a simple request-response model.
+
+This is so it can be deployed in most environments, not only including AWS Sagemaker but also as a local Dockerized service.
+Access to the web server is also possible from a range of environments including AWS itself, but also from other providers in a multi-cloud environment.
+
+Depending on the messaging/queueing system and cloud environment, you have various options to integrate a model deployed with **inference-server** with a message stream.
+
+For example, in AWS, you could deploy a Lambda function which consumes messages from AWS SQS, then send this as an HTTP request to AWS SageMaker.
+Equally, the Lambda function could write the SageMaker response to another SQS queue.
+Of course, instead of a Lambda function you could use any other compute platform to deploy similar logic, including an EKS pods or ECS task.
diff --git a/docs/index.rst b/docs/index.rst
@@ -4,14 +4,15 @@ inference-server
 Deploy your AI/ML model to Amazon SageMaker for Real-Time Inference and Batch Transform using your own Docker container image.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
    :caption: Contents
 
    introduction
    hooks
    batch_transform
    deployment
    testing
+   faq
    modules
 
 

diff --git a/docs/modules.rst b/docs/modules.rst
@@ -1,8 +1,8 @@
-API Documentation
-=================
+API reference documentation
+===========================
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
    inference_server
    inference_server_testing
diff --git a/src/inference_server/__init__.py b/src/inference_server/__init__.py
@@ -42,6 +42,7 @@
     "MIMEAccept",  # Exporting for plugin developers' convenience
     "create_app",
     "plugin_hook",
+    "warmup",
 )
 
 #: Library version, e.g. 1.0.0, taken from Git tags
@@ -70,12 +71,20 @@ class BatchStrategy(enum.Enum):
 
 
 def create_app() -> "WSGIApplication":
-    """Initialize and return the WSGI application"""
+    """
+    Initialize and return the WSGI application
+
+    This is the WSGI application factory function that needs to be passed to a WSGI-compatible web server.
+    """
     return _app
 
 
 def warmup() -> None:
-    """Initialize any additional resources upfront"""
+    """
+    Initialize any additional resources upfront
+
+    This will call the ``model_fn`` plugin hook.
+    """
     _model()