diff --git a/docs/howto/upgrade-cluster/index.md b/docs/howto/upgrade-cluster/index.md index 9d12a501c7..79c36e0011 100644 --- a/docs/howto/upgrade-cluster/index.md +++ b/docs/howto/upgrade-cluster/index.md @@ -9,7 +9,21 @@ As of now, we also only have written documentation for how to upgrade Kubernetes clusters on AWS. ``` -## Upgrade policy +(upgrade-cluster:planning)= +## Upgrade planning + +```{warning} +We haven't yet established a policy for planning and communicating maintenance +procedures to community champions and users. + +Up until now we have made some k8s cluster upgrades opportunistically, +especially for clusters that has showed little to no activity during some +periods. Other cluster upgrades has been scheduled with community champions, and +some in shared clusters has been announced ahead of time. +``` + +(upgrade-cluster:ambition)= +## Upgrade ambition 1. To keep our k8s cluster's control plane and node pools upgraded to the latest _three_ and _four_ [official minor k8s versions] respectively at all times. @@ -35,5 +49,6 @@ clusters on AWS. ```{toctree} :maxdepth: 1 :caption: Upgrading Kubernetes clusters +upgrade-disruptions.md aws.md ``` diff --git a/docs/howto/upgrade-cluster/upgrade-disruptions.md b/docs/howto/upgrade-cluster/upgrade-disruptions.md new file mode 100644 index 0000000000..8f30967fa0 --- /dev/null +++ b/docs/howto/upgrade-cluster/upgrade-disruptions.md @@ -0,0 +1,78 @@ +(upgrade-cluster:disruptions)= + +# About upgrade disruptions + +When we upgrade our Kubernetes clusters we can cause different kinds of +disruptions, this text provides an overview of them. + +## Kubernetes api-server disruption + +K8s clusters' control plane (api-server etc.) can be either highly available +(HA) or not. EKS clusters, AKS clusters, and "regional" GKE clusters are HA, but +"zonal" GKE clusters are not. A few of our GKE clusters are zonal still, but as +the cost savings are minimal we only create for regional clusters now. + +If upgrading a zonal cluster, the single k8s api-server will be temporarily +unavailable, but that is not a big problem as user servers and JupyterHub will +remains accessible. The brief disruption is that JupyterHub won't be able to +start new user servers, and user servers won't be able to create or scale their +dask-clusters. + +## Provider managed workload disruptions + +When upgrading a cloud provider managed k8s cluster, it may upgrade some managed +workload part of the k8s cluster, such as calico that enforces NetworkPolicy +rules. Maybe this could cause a disruption for users, but its not currently know +to do so. + +## Core node pool disruptions + +Disruptions to the core node pool is a disruption to workloads running on it, +and there are a few workloads that when disrupted would disrupt users. + +### ingress-nginx-controller pod(s) disruptions + +The `support` chart we install in each cluster comes with the `ingress-nginx` +chart. The `ingress-nginx` chart creates one or more `ingress-nginx-controller` +pods that are proxying network traffic associated with incoming connections. + +To shut down such pod means to break connections from users working against the +user servers. A broken connection can be re-established if there is another +replica of this pod is ready to accept a new connection. + +We are currently running only one replica of the `ingress-nginx-controller` pod, +and we won't have issues with this during rolling updates, such as when the +Deployment's pod template specification is changed or when manually running +`kubectl rollout restart -n support deploy/support-ingress-nginx-controller`. We +will however have broken connections and user pods unable to establish new +directly if `kubectl delete` is used on this single pod, or `kubectl drain` is +used on the node. + +### hub pod disruptions + +Our JupyterHub installations each has a single `hub` pod, and having more isn't +supported by JupyterHub itself. Due to this, and because it has a disk mounted +to it that can only be mounted at the same time to one pod, it isn't configured +to do rolling updates. + +When the `hub` pod isn't running, users can't visit `/hub` paths, but they can +still visit `/user` paths and control their already started user server. + +### proxy pod disruptions + +Our JupyterHub installations each has a single `proxy` pod running +`configurable-http-proxy`, having more replicas isn't supported because +JupyterHub will only update one replica with new proxy routes. + +When the `proxy` pod isn't running, users can't visit `/hub`, `/user`, or +`/service` paths, because they all route through the proxy pod. + +When the `proxy` pod has started and become ready, it also needs to be +re-configured by JupyterHub on how to route traffic to arrive to `/user` and +`/service` paths. This is done during startup and then regularly by JupyterHub +every five minutes. Due to this, a proxy pod being restarted can cause a outage +of five minutes. + +## User node pool disruptions + +Disruptions to a user node pool will disrupt user server pods running on it.