docs: notes on various forms of k8s upgrade disruptions

consideRatio · May 7, 2024 · ff45c37 · ff45c37
1 parent d817b75
commit ff45c37
Show file tree

Hide file tree

Showing 2 changed files with 91 additions and 3 deletions.
diff --git a/docs/howto/upgrade-cluster/index.md b/docs/howto/upgrade-cluster/index.md
@@ -4,12 +4,21 @@
 How we upgrade a Kubernetes cluster is specific to the cloud provider. This
 section covers topics in upgrading an existing Kubernetes cluster.
 
+(upgrade-cluster:planning)=
+## Upgrade planning
+
 ```{warning}
-As of now, we also only have written documentation for how to upgrade Kubernetes
-clusters on AWS.
+We haven't yet established a policy for planning and communicating maintenance
+procedures to community champions and users.
+
+Up until now we have made some k8s cluster upgrades opportunistically,
+especially for clusters that has showed little to no activity during some
+periods. Other cluster upgrades has been scheduled with community champions, and
+some in shared clusters has been announced ahead of time.
 ```
 
-## Upgrade policy
+(upgrade-cluster:ambition)=
+## Upgrade ambition
 
 1. To keep our k8s cluster's control plane and node pools upgraded to the latest
    _three_ and _four_ [official minor k8s versions] respectively at all times.
@@ -35,5 +44,6 @@ clusters on AWS.
 ```{toctree}
 :maxdepth: 1
 :caption: Upgrading Kubernetes clusters
+upgrade-disruptions.md
 aws.md
 ```
diff --git a/docs/howto/upgrade-cluster/upgrade-disruptions.md b/docs/howto/upgrade-cluster/upgrade-disruptions.md
@@ -0,0 +1,78 @@
+(upgrade-cluster:disruptions)=
+
+# About upgrade disruptions
+
+When we upgrade our Kubernetes clusters we can cause different kinds of
+disruptions, this text provides an overview of them.
+
+## Kubernetes api-server disruption
+
+K8s clusters' control plane (api-server etc.) can be either highly available
+(HA) or not. EKS clusters, AKS clusters, and "regional" GKE clusters are HA, but
+"zonal" GKE clusters are not. A few of our GKE clusters are zonal still, but as
+the cost savings are minimal we only create for regional clusters now.
+
+If upgrading a zonal cluster, the single k8s api-server will be temporarily
+unavailable, but that is not a big problem as user servers and JupyterHub will
+remains accessible. The brief disruption is that JupyterHub won't be able to
+start new user servers, and user servers won't be able to create or scale their
+dask-clusters.
+
+## Provider managed workload disruptions
+
+When upgrading a cloud provider managed k8s cluster, it may upgrade some managed
+workload part of the k8s cluster, such as calico that enforces NetworkPolicy
+rules. Maybe this could cause a disruption for users, but its not currently know
+to do so.
+
+## Core node pool disruptions
+
+Disruptions to the core node pool is a disruption to workloads running on it,
+and there are a few workloads that when disrupted would disrupt users.
+
+### ingress-nginx-controller pod(s) disruptions
+
+The `support` chart we install in each cluster comes with the `ingress-nginx`
+chart. The `ingress-nginx` chart creates one or more `ingress-nginx-controller`
+pods that are proxying network traffic associated with incoming connections.
+
+To shut down such pod means to break connections from users working against the
+user servers. A broken connection can be re-established if there is another
+replica of this pod is ready to accept a new connection.
+
+We are currently running only one replica of the `ingress-nginx-controller` pod,
+and we won't have issues with this during rolling updates, such as when the
+Deployment's pod template specification is changed or when manually running
+`kubectl rollout restart -n support deploy/support-ingress-nginx-controller`. We
+will however have broken connections and user pods unable to establish new
+directly if `kubectl delete` is used on this single pod, or `kubectl drain` is
+used on the node.
+
+### hub pod disruptions
+
+Our JupyterHub installations each has a single `hub` pod, and having more isn't
+supported by JupyterHub itself. Due to this, and because it has a disk mounted
+to it that can only be mounted at the same time to one pod, it isn't configured
+to do rolling updates.
+
+When the `hub` pod isn't running, users can't visit `/hub` paths, but they can
+still visit `/user` paths and control their already started user server.
+
+### proxy pod disruptions
+
+Our JupyterHub installations each has a single `proxy` pod running
+`configurable-http-proxy`, having more replicas isn't supported because
+JupyterHub will only update one replica with new proxy routes.
+
+When the `proxy` pod isn't running, users can't visit `/hub`, `/user`, or
+`/service` paths, because they all route through the proxy pod.
+
+When the `proxy` pod has started and become ready, it also needs to be
+re-configured by JupyterHub on how to route traffic to arrive to `/user` and
+`/service` paths. This is done during startup and then regularly by JupyterHub
+every five minutes. Due to this, a proxy pod being restarted can cause a outage
+of five minutes.
+
+## User node pool disruptions
+
+Disruptions to a user node pool will disrupt user server pods running on it.