Skip to content

Commit

Permalink
Merge pull request #100 from strangiato/knative-timeout
Browse files Browse the repository at this point in the history
add documentation for resolving knative timeout issues
  • Loading branch information
rcarrata authored Sep 3, 2024
2 parents 771c361 + 12095da commit 91619db
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 0 deletions.
19 changes: 19 additions & 0 deletions docs/odh-rhoai/kserve-timeout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# KServe Timeout Issues

When deploying large models or when relying on node autoscaling with KServe, KServe may timeout before a model has successfully deployed due to the default progress deadline of 10 minutes set by KNative Serving.

When a pod takes longer than 10 minutes to deploy that leverages KNative Serving, like KServe does, KNative Serving will automatically back the pod deployment off and mark it as failed. This can happen for a number of reasons including deploying large models that take longer than 10m minutes to pull from S3 or if you are leveraging node autoscaling to reduce the consumption of expensive GPU nodes.

To resolve this issue, KNative supports an annotion that can be added to a KServe `ServingRuntime` that can be updated to set a custom progress-deadline for your application:

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: my-serving-runtime
spec:
annotations:
serving.knative.dev/progress-deadline: 30m
```
It is important to note that the annotation must be set at `spec.annotations` and not `metadata.annotations`. By setting it in `spec.annotations` the annotation will be copied to the KNative `Service` object that is created by your KServe `InferenceService`. The annotation on the `Service` will allow KNative to utilize the manually defined progress-deadline.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ nav:
- NVIDIA GPUs: odh-rhoai/nvidia-gpus.md
- OpenShift Group Management: odh-rhoai/openshift-group-management.md
- Single stack serving certificate: odh-rhoai/single-stack-serving-certificate.md
- KServe Timeout Issues: odh-rhoai/kserve-timeout.md
- Tools:
- GPU pruner: odh-rhoai/gpu-pruner.md
- ODH Tools and Extensions Companion: odh-rhoai/odh-tools-and-extensions-companion.md
Expand Down

0 comments on commit 91619db

Please sign in to comment.