From b255439d345c0a128c71bbfcd7b8a7a2acb8b595 Mon Sep 17 00:00:00 2001 From: Trevor Royer Date: Fri, 30 Aug 2024 13:56:34 -0600 Subject: [PATCH 1/2] add documentation for resolving knative timeout issues fix tense --- docs/odh-rhoai/kserve-timeout.md | 19 +++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 20 insertions(+) create mode 100644 docs/odh-rhoai/kserve-timeout.md diff --git a/docs/odh-rhoai/kserve-timeout.md b/docs/odh-rhoai/kserve-timeout.md new file mode 100644 index 00000000..49f02ce0 --- /dev/null +++ b/docs/odh-rhoai/kserve-timeout.md @@ -0,0 +1,19 @@ +# KServe Timeout Issues + +When deploying large models or when releying on node autoscaling with KServe, KServe may timeout before a model has successfully deployed due to the default progress deadline of 10 minutes set by KNative Serving. + +When a pod takes longer than 10 minutes to deploy that leverages KNative Serving, like KServe does, KNative Serving will automatically back the pod deployment off and mark it as failed. This can happen for a number of reasons including deploying large models that take longer than 10m minutes to pull from S3 or if you are leveraging node autoscaling to reduce the consumption of expensive GPU nodes. + +To resolve this issue, KNative supports an annotion that can be added to a KServe `ServingRuntime` that can be updated to set a custom progress-deadline for your application: + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: ServingRuntime +metadata: + name: my-serving-runtime +spec: + annotations: + serving.knative.dev/progress-deadline: 30m +``` + +It is important to note that the annotation must be set at `spec.annotations` and not `metadata.annotations`. By setting it in `spec.annotations` the annotation will be copied to the KNative `Service` object that is created by your KServe `InferenceService`. The annotation on the `Service` will allow KNative to utilize the manually defined progress-deadline. diff --git a/mkdocs.yml b/mkdocs.yml index 9d1bdd1c..dbe33a50 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -119,6 +119,7 @@ nav: - NVIDIA GPUs: odh-rhoai/nvidia-gpus.md - OpenShift Group Management: odh-rhoai/openshift-group-management.md - Single stack serving certificate: odh-rhoai/single-stack-serving-certificate.md + - KServe Timeout Issues: odh-rhoai/kserve-timeout.md - Tools: - GPU pruner: odh-rhoai/gpu-pruner.md - ODH Tools and Extensions Companion: odh-rhoai/odh-tools-and-extensions-companion.md From 12095da5b3d9f42d5ef8d63594d43ce5577dd6a0 Mon Sep 17 00:00:00 2001 From: Trevor Royer Date: Tue, 3 Sep 2024 10:34:21 -0700 Subject: [PATCH 2/2] fix spelling issue --- docs/odh-rhoai/kserve-timeout.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/odh-rhoai/kserve-timeout.md b/docs/odh-rhoai/kserve-timeout.md index 49f02ce0..f994a69b 100644 --- a/docs/odh-rhoai/kserve-timeout.md +++ b/docs/odh-rhoai/kserve-timeout.md @@ -1,6 +1,6 @@ # KServe Timeout Issues -When deploying large models or when releying on node autoscaling with KServe, KServe may timeout before a model has successfully deployed due to the default progress deadline of 10 minutes set by KNative Serving. +When deploying large models or when relying on node autoscaling with KServe, KServe may timeout before a model has successfully deployed due to the default progress deadline of 10 minutes set by KNative Serving. When a pod takes longer than 10 minutes to deploy that leverages KNative Serving, like KServe does, KNative Serving will automatically back the pod deployment off and mark it as failed. This can happen for a number of reasons including deploying large models that take longer than 10m minutes to pull from S3 or if you are leveraging node autoscaling to reduce the consumption of expensive GPU nodes.