-
Kubernetes cluster - A Kubernetes cluster is required. You will need
cluster-admin
authority in order to complete all of the prescribed steps. -
Kubectl and Kustomize - The installation will occur via the terminal using kubectl and kustomize.
-
etcd - ModelMesh Serving requires an etcd server in order to coordinate internal state which can be either dedicated or shared. More on this later.
-
Model storage - The model files have to be stored in a compatible form of remote storage or on a Kubernetes Persistent Volume. For more information about supported storage options take a look at our storage setup page.
We provide an install script --quickstart
option to quickly run ModelMesh Serving with a provisioned etcd server. This may be useful for experimentation or development but should not be used in production.
ModelMesh Serving can be used in either cluster scope or namespace mode.
- Cluster scope mode - Its components can exist in multiple user namespaces which are controlled by one instance of ModelMesh Serving Controller in the control plane namespace. Only one ModelMesh Serving instance can be installed within a Kubernetes cluster. A namespace label
modelmesh-enabled
needs to be "true" to enable a user namespace for ModelMesh Serving. - Namespace scope mode - All of its components must exist within a single namespace and only one instance of ModelMesh Serving can be installed per namespace. Multiple ModelMesh Serving instances can be installed in separate namespaces within the cluster.
The default configuration is for the cluster scope mode. Use the --namespace-scope-mode
option of the install script for namespace scope.
Type | Pod | Count | Default CPU request/limit per-pod | Default mem request/limit per-pod | |
---|---|---|---|---|---|
1 | Controller | modelmesh controller pod | 1 | 50m / 1 | 96Mi / 512Mi |
2 | Object Storage | MinIO pod (optional) | 1 | 200m / 200m | 256Mi / 256Mi |
3 | Metastore | ETCD pod | 1 | 200m / 200m | 512Mi / 512Mi |
4 | Built-in Runtime | Nvidia Triton runtime Pods | 0 (*) | 850m / 10 or 900m / 11 (**) | 1568Mi / 1984Mi or 1664Mi / 2496Mi (**) |
5 | Built-in Runtime | The MLServer runtime Pods | 0 (*) | 850m / 10 or 900m / 11 (**) | 1568Mi / 1984Mi or 1664Mi / 2496Mi (**) |
6 | Built-in Runtime | The OVMS runtime Pods | 0 (*) | 850m / 10 or 900m / 11 (**) | 1568Mi / 1984Mi or 1664Mi / 2496Mi (**) |
totals | 3 | 450m / 1.4 | 864Mi / 1.25Gi |
When a ModelMesh Serving instance is installed with the --quickstart
option, pods shown in 1 - 6 are created.
However, do note that the quickstart-deployed etcd and MinIO pods are intended for development/experimentation and not for production.
(*) ScaleToZero
is enabled by default, so runtimes will have 0 replicas until an InferenceService
is created that uses that runtime. Once an InferenceService
is assigned, the runtime pods will scale up to 2.
When ScaleToZero
is enabled (default), deployments for runtime pods will be scaled to 0 when there are no InferenceService
s for that runtime. When ScaletoZero
is enabled and first InferenceService
CR is submitted, ModelMesh Serving will spin up the corresponding built-in runtime pods.
When ScaletoZero
is disabled, pods shown in 4 to 6 are created (default two pods per runtime), which will greatly increase the total CPU(request/limit) and total memory(request/limit).
(**) When the REST inferencing is enabled via the restProxy
config parameter, every model serving pod will include an additional container that consumes resources. The default allocation for this proxy container is:
resources:
requests:
cpu: "50m"
memory: "96Mi"
limits:
cpu: "1"
memory: "512Mi"
The deployed footprint can be significantly reduced in the following ways:
-
Individual built-in runtimes can be disabled by setting
disabled: true
in their correspondingServingRuntime
resource - if the corresponding model types aren't used. -
The number of Pods per runtime can be changed from the default of 2 (e.g. down to 1), via the
podsPerRuntime
global configuration parameter (see configuration). It is recommended for this value to be a minimum of 2 for production deployments. -
Memory and/or CPU resource allocations can be reduced (or increased) on the primary model server container in any of the built-in
ServingRuntime
resources (container nametriton
,mlserver
, orovms
). This has the effect of adjusting the total capacity available for holding served models in memory.
> kubectl edit servingruntime triton-2.x
> kubectl edit servingruntime mlserver-1.x
> kubectl edit servingruntime ovms-1.x
Please be aware that:
- Changes made to the built-in runtime resources will likely be reverted when upgrading/re-installing
- Most of this resource allocation behaviour/config will change in future versions to become more dynamic - both the number of pods deployed and the system resources allocated to them
For more details see the built-in runtime configuration
The following resources will be created in the namespaces:
model-serving-defaults
- ConfigMap holding default values tied to a release, should not be modified. Configuration can be overriden by creating a user ConfigMap, see configurationtc-config
- ConfigMap used for some internal coordinationstorage-config
- Secret holding config for each of the storage backends from which models can be loaded - see the examplemodel-serving-etcd
- Secret providing access to the Etcd cluster. It is created by user in the controller namespace - see instructions, and will be automatically created in user namespaces when in the cluster scope mode.
-
See the configuration page for details of how to configure system-wide settings via a ConfigMap, either before or after installation.
-
See this example walkthrough of deploying a TensorFlow model as an
InferenceService
.