KFServing provides a simple Kubernetes CRD to allow deploying single or multiple trained models onto model servers such as TFServing, TorchServe, ONNXRuntime, Triton Inference Server. In addition KFServer is the Python model server implemented in KFServing itself with prediction v1 protocol, MLServer implements the prediction v2 protocol with both REST and gRPC. These model servers are able to provide out-of-the-box model serving, but you could also choose to build your own model server for more complex use case. KFServing provides basic API primitives to allow you easily build custom model server, you can use other tools like BentoML to build your custom model serve image.
After models are deployed onto model servers with KFServing, you get all the following serverless features provided by KFServing.
- Scale to and from Zero
- Request based Autoscaling on CPU/GPU
- Revision Management
- Optimized Container
- Batching
- Request/Response logging
- Scalable Multi Model Serving
- Traffic management
- Security with AuthN/AuthZ
- Distributed Tracing
- Out-of-the-box metrics
- Ingress/Egress control
Out-of-the-box Predictor | Exported model | Prediction Protocol | HTTP | gRPC | Versions | Examples |
---|---|---|---|---|---|---|
Triton Inference Server | TensorFlow,TorchScript,ONNX,TensorRT | v2 | ✔️ | ✔️ | Compatibility Matrix | Triton Examples |
TFServing | TensorFlow SavedModel | v1 | ✔️ | ✔️ | TFServing Versions | TensorFlow Examples |
TorchServe | Eager Model/TorchScript | v1 | ✔️ | ✔️ | 0.3.0 | TorchServe Examples |
TorchServe Native | Eager Model/TorchScript | native | ✔️ | ✔️ | 0.3.0 | TorchServe Examples |
ONNXRuntime | Exported ONNX Model | v1 | ✔️ | ✔️ | Compatibility | ONNX Style Model |
SKLearn MLServer | Pickled Model | v2 | ✔️ | ✔️ | 0.23.1 | SKLearn Iris V2 |
XGBoost MLServer | Saved Model | v2 | ✔️ | ✔️ | 1.1.1 | XGBoost Iris V2 |
SKLearn KFServer | Pickled Model | v1 | ✔️ | -- | 0.20.3 | SKLearn Iris |
XGBoost KFServer | Saved Model | v1 | ✔️ | -- | 0.82 | XGBoost Iris |
PyTorch KFServer | Eager Model | v1 | ✔️ | -- | 1.3.1 | PyTorch Cifar10 |
PMML KFServer | PMML | v1 | ✔️ | -- | PMML4.4.1 | SKLearn PMML |
LightGBM KFServer | Saved LightGBM Model | v1 | ✔️ | -- | 2.3.1 | LightGBM Iris |
Custom Predictor | Examples |
---|---|
Deploy model on custom KFServer | Custom KFServer |
Deploy model on BentoML | SKLearn Iris with BentoML |
Deploy model on custom HTTP Server | Prebuilt Model Server |
Deploy model on custom gRPC Server | Prebuilt gRPC Server |
In addition to deploy InferenceService with HTTP/gRPC endpoint, you can also deploy InferenceService with Knative Event Sources such as Kafka , you can find an example here which shows how to build an async inference pipeline.
KFServing transformer enables users to define a pre/post processing step before the prediction and explanation workflow. KFServing transformer runs as a separate microservice and can work with any type of pre-packaged model server, it can also scale differently from the predictor if your transformer is CPU bound while predictor requires running on GPU.
Features | Examples |
---|---|
Deploy Transformer with KFServer | Image Transformer with PyTorch KFServer |
Deploy Transformer with Triton Server | BERT Model with tokenizer |
Deploy Transformer with TorchServe | Image classifier |
Model explainability answers the question: "Why did my model make this prediction" for a given instance. KFServing integrates with Alibi Explainer which implements a black-box algorithm by generating a lot of similar looking intances for a given instance and send out to the model server to produce an explanation.
Additionally KFServing also integrates with The AI Explainability 360 (AIX360) toolkit, an LF AI Foundation incubation project, which is an open-source library that supports the interpretability and explainability of datasets and machine learning models. The AI Explainability 360 Python package includes a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics. In addition to native algorithms, AIX360 also provides algorithms from LIME and Shap.
Features | Examples |
---|---|
Deploy Alibi Image Explainer | Imagenet Explainer |
Deploy Alibi Income Explainer | Income Explainer |
Deploy Alibi Text Explainer | Alibi Text Explainer |
Deploy AIX360 Image Explainer | AIX360 Image Explainer |
Multi Model Serving allows deploying TrainedModels
at scale without being bounded by the Kubernetes compute resources(CPU/GPU/Memory),
service/pod limits and reducing the TCO, see Multi Model Serving for more details.
Multi Model Serving is supported for Triton, SKLearn/XGBoost as well as Custom KFServer.
Features | Examples |
---|---|
Deploy multiple models with Triton Inference Server | Multi Model Triton InferenceService |
Deploy multiple models with SKLearn/XGBoost KFServer | Multi Model SKLearn InferenceService |
In order to trust and reliably act on model predictions, it is crucial to monitor the distribution of the incoming requests via various different type of detectors. KFServing integrates Alibi Detect with the following components:
- Drift detector checks when the distribution of incoming requests is diverging from a reference distribution such as that of the training data
- Outlier detector flags single instances which do not follow the training distribution.
Features | Examples |
---|---|
Deploy Alibi Outlier Detection | Cifar outlier detector |
Deploy Alibi Drift Detection | Cifar drift detector |
Feature | Examples |
---|---|
Deploy Model on S3 | Mnist model on S3 |
Deploy Model on PVC | Models on PVC |
Deploy Model on Azure | Models on Azure |
Deploy Model with HTTP/HTTPS | Models with HTTP/HTTPS URL |
KFServing's main serverless capability is to allow you to run inference workload without worrying about scaling your service manually once it is deployed. KFServing leverages Knative's autoscaler, the autoscaler works on GPU as well since the Autoscaler is based on request volume instead of GPU/CPU metrics which can be hard to reason about.
Autoscale inference workload on CPU/GPU
Canary deployment enables rollout releases by splitting traffic between different versions to ensure safe rollout.
InferenceService with Kubeflow Pipeline
Batching individual inference requests can be important as most of ML/DL frameworks are optimized for batch requests. In cases where the services receive heavy load of requests, its advantageous to batch the requests. This allows for maximally utilizing the CPU/GPU compute resource, but user needs to carefully perform enough tests to find optimal batch size and analyze the traffic patterns before enabling the batch inference. KFServing injects a batcher sidecar so it can work with any model server deployed on KFServing, you can read more from this example.
KFServing supports logging your inference request/response by injecting a sidecar alongside with your model server.
Feature | Examples |
---|---|
Deploy Logger with a Logger Service | Message Dumper Service |
Deploy Async Logger | Message Dumper Using Knative Eventing |