KFServing Features and Examples

Deploy InferenceService with Predictor

KFServing provides a simple Kubernetes CRD to allow deploying single or multiple trained models onto model servers such as TFServing, TorchServe, ONNXRuntime, Triton Inference Server. In addition KFServer is the Python model server implemented in KFServing itself with prediction v1 protocol, MLServer implements the prediction v2 protocol with both REST and gRPC. These model servers are able to provide out-of-the-box model serving, but you could also choose to build your own model server for more complex use case. KFServing provides basic API primitives to allow you easily build custom model server, you can use other tools like BentoML to build your custom model serve image.

After models are deployed onto model servers with KFServing, you get all the following serverless features provided by KFServing.

Scale to and from Zero
Request based Autoscaling on CPU/GPU
Revision Management
Optimized Container
Batching
Request/Response logging
Scalable Multi Model Serving
Traffic management
Security with AuthN/AuthZ
Distributed Tracing
Out-of-the-box metrics
Ingress/Egress control

Out-of-the-box Predictor	Exported model	Prediction Protocol	HTTP	gRPC	Versions	Examples
Triton Inference Server	TensorFlow,TorchScript,ONNX,TensorRT	v2	✔️	✔️	Compatibility Matrix	Triton Examples
TFServing	TensorFlow SavedModel	v1	✔️	✔️	TFServing Versions	TensorFlow Examples
TorchServe	Eager Model/TorchScript	v1	✔️	✔️	0.3.0	TorchServe Examples
TorchServe Native	Eager Model/TorchScript	native	✔️	✔️	0.3.0	TorchServe Examples
ONNXRuntime	Exported ONNX Model	v1	✔️	✔️	Compatibility	ONNX Style Model
SKLearn MLServer	Pickled Model	v2	✔️	✔️	0.23.1	SKLearn Iris V2
XGBoost MLServer	Saved Model	v2	✔️	✔️	1.1.1	XGBoost Iris V2
SKLearn KFServer	Pickled Model	v1	✔️	--	0.20.3	SKLearn Iris
XGBoost KFServer	Saved Model	v1	✔️	--	0.82	XGBoost Iris
PyTorch KFServer	Eager Model	v1	✔️	--	1.3.1	PyTorch Cifar10
PMML KFServer	PMML	v1	✔️	--	PMML4.4.1	SKLearn PMML
LightGBM KFServer	Saved LightGBM Model	v1	✔️	--	2.3.1	LightGBM Iris

Custom Predictor	Examples
Deploy model on custom KFServer	Custom KFServer
Deploy model on BentoML	SKLearn Iris with BentoML
Deploy model on custom HTTP Server	Prebuilt Model Server
Deploy model on custom gRPC Server	Prebuilt gRPC Server

In addition to deploy InferenceService with HTTP/gRPC endpoint, you can also deploy InferenceService with Knative Event Sources such as Kafka , you can find an example here which shows how to build an async inference pipeline.

Deploy InferenceService with Transformer

KFServing transformer enables users to define a pre/post processing step before the prediction and explanation workflow. KFServing transformer runs as a separate microservice and can work with any type of pre-packaged model server, it can also scale differently from the predictor if your transformer is CPU bound while predictor requires running on GPU.

Features	Examples
Deploy Transformer with KFServer	Image Transformer with PyTorch KFServer
Deploy Transformer with Triton Server	BERT Model with tokenizer
Deploy Transformer with TorchServe	Image classifier

Deploy InferenceService with Explainer

Model explainability answers the question: "Why did my model make this prediction" for a given instance. KFServing integrates with Alibi Explainer which implements a black-box algorithm by generating a lot of similar looking intances for a given instance and send out to the model server to produce an explanation.

Additionally KFServing also integrates with The AI Explainability 360 (AIX360) toolkit, an LF AI Foundation incubation project, which is an open-source library that supports the interpretability and explainability of datasets and machine learning models. The AI Explainability 360 Python package includes a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics. In addition to native algorithms, AIX360 also provides algorithms from LIME and Shap.

Features	Examples
Deploy Alibi Image Explainer	Imagenet Explainer
Deploy Alibi Income Explainer	Income Explainer
Deploy Alibi Text Explainer	Alibi Text Explainer
Deploy AIX360 Image Explainer	AIX360 Image Explainer

Deploy InferenceService with Multiple Models(Alpha)

Multi Model Serving allows deploying TrainedModels at scale without being bounded by the Kubernetes compute resources(CPU/GPU/Memory), service/pod limits and reducing the TCO, see Multi Model Serving for more details. Multi Model Serving is supported for Triton, SKLearn/XGBoost as well as Custom KFServer.

Features	Examples
Deploy multiple models with Triton Inference Server	Multi Model Triton InferenceService
Deploy multiple models with SKLearn/XGBoost KFServer	Multi Model SKLearn InferenceService

Deploy InferenceService with Outlier/Drift Detector

In order to trust and reliably act on model predictions, it is crucial to monitor the distribution of the incoming requests via various different type of detectors. KFServing integrates Alibi Detect with the following components:

Drift detector checks when the distribution of incoming requests is diverging from a reference distribution such as that of the training data
Outlier detector flags single instances which do not follow the training distribution.

Features	Examples
Deploy Alibi Outlier Detection	Cifar outlier detector
Deploy Alibi Drift Detection	Cifar drift detector

Deploy InferenceService with Cloud/PVC storage

Feature	Examples
Deploy Model on S3	Mnist model on S3
Deploy Model on PVC	Models on PVC
Deploy Model on Azure	Models on Azure
Deploy Model with HTTP/HTTPS	Models with HTTP/HTTPS URL

Autoscaling

KFServing's main serverless capability is to allow you to run inference workload without worrying about scaling your service manually once it is deployed. KFServing leverages Knative's autoscaler, the autoscaler works on GPU as well since the Autoscaler is based on request volume instead of GPU/CPU metrics which can be hard to reason about.

Autoscale inference workload on CPU/GPU

InferenceService on GPU nodes

Canary Rollout

Canary deployment enables rollout releases by splitting traffic between different versions to ensure safe rollout.

v1alpha2 canary rollout

v1beta1 canary rollout

Kubeflow Pipeline Integration

InferenceService with Kubeflow Pipeline

Request Batching(Alpha)

Batching individual inference requests can be important as most of ML/DL frameworks are optimized for batch requests. In cases where the services receive heavy load of requests, its advantageous to batch the requests. This allows for maximally utilizing the CPU/GPU compute resource, but user needs to carefully perform enough tests to find optimal batch size and analyze the traffic patterns before enabling the batch inference. KFServing injects a batcher sidecar so it can work with any model server deployed on KFServing, you can read more from this example.

Request/Response Logger

KFServing supports logging your inference request/response by injecting a sidecar alongside with your model server.

Feature	Examples
Deploy Logger with a Logger Service	Message Dumper Service
Deploy Async Logger	Message Dumper Using Knative Eventing

Deploy InferenceService behind an Authentication Proxy with Kubeflow

InferenceService on Kubeflow with Istio-Dex

InferenceService behind GCP Identity Aware Proxy (IAP)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

KFServing Features and Examples

Deploy InferenceService with Predictor

Deploy InferenceService with Transformer

Deploy InferenceService with Explainer

Deploy InferenceService with Multiple Models(Alpha)

Deploy InferenceService with Outlier/Drift Detector

Deploy InferenceService with Cloud/PVC storage

Autoscaling

Canary Rollout

Kubeflow Pipeline Integration

Request Batching(Alpha)

Request/Response Logger

Deploy InferenceService behind an Authentication Proxy with Kubeflow

Files

README.md

Latest commit

History

README.md

File metadata and controls

KFServing Features and Examples

Deploy InferenceService with Predictor

Deploy InferenceService with Transformer

Deploy InferenceService with Explainer

Deploy InferenceService with Multiple Models(Alpha)

Deploy InferenceService with Outlier/Drift Detector

Deploy InferenceService with Cloud/PVC storage

Autoscaling

Canary Rollout

Kubeflow Pipeline Integration

Request Batching(Alpha)

Request/Response Logger

Deploy InferenceService behind an Authentication Proxy with Kubeflow