From 9b6d8c4592d06c99928442b4f0ecb66331b373b8 Mon Sep 17 00:00:00 2001
From: Erik Sundell <erik.i.sundell@gmail.com>
Date: Tue, 7 May 2024 13:03:42 +0200
Subject: [PATCH 1/2] docs: notes on various forms of k8s upgrade disruptions

---
 docs/howto/upgrade-cluster/index.md           | 17 +++-
 .../upgrade-cluster/upgrade-disruptions.md    | 78 +++++++++++++++++++
 2 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 docs/howto/upgrade-cluster/upgrade-disruptions.md

diff --git a/docs/howto/upgrade-cluster/index.md b/docs/howto/upgrade-cluster/index.md
index 9d12a501c7..79c36e0011 100644
--- a/docs/howto/upgrade-cluster/index.md
+++ b/docs/howto/upgrade-cluster/index.md
@@ -9,7 +9,21 @@ As of now, we also only have written documentation for how to upgrade Kubernetes
 clusters on AWS.
 ```
 
-## Upgrade policy
+(upgrade-cluster:planning)=
+## Upgrade planning
+
+```{warning}
+We haven't yet established a policy for planning and communicating maintenance
+procedures to community champions and users.
+
+Up until now we have made some k8s cluster upgrades opportunistically,
+especially for clusters that has showed little to no activity during some
+periods. Other cluster upgrades has been scheduled with community champions, and
+some in shared clusters has been announced ahead of time.
+```
+
+(upgrade-cluster:ambition)=
+## Upgrade ambition
 
 1. To keep our k8s cluster's control plane and node pools upgraded to the latest
    _three_ and _four_ [official minor k8s versions] respectively at all times.
@@ -35,5 +49,6 @@ clusters on AWS.
 ```{toctree}
 :maxdepth: 1
 :caption: Upgrading Kubernetes clusters
+upgrade-disruptions.md
 aws.md
 ```
diff --git a/docs/howto/upgrade-cluster/upgrade-disruptions.md b/docs/howto/upgrade-cluster/upgrade-disruptions.md
new file mode 100644
index 0000000000..8f30967fa0
--- /dev/null
+++ b/docs/howto/upgrade-cluster/upgrade-disruptions.md
@@ -0,0 +1,78 @@
+(upgrade-cluster:disruptions)=
+
+# About upgrade disruptions
+
+When we upgrade our Kubernetes clusters we can cause different kinds of
+disruptions, this text provides an overview of them.
+
+## Kubernetes api-server disruption
+
+K8s clusters' control plane (api-server etc.) can be either highly available
+(HA) or not. EKS clusters, AKS clusters, and "regional" GKE clusters are HA, but
+"zonal" GKE clusters are not. A few of our GKE clusters are zonal still, but as
+the cost savings are minimal we only create for regional clusters now.
+
+If upgrading a zonal cluster, the single k8s api-server will be temporarily
+unavailable, but that is not a big problem as user servers and JupyterHub will
+remains accessible. The brief disruption is that JupyterHub won't be able to
+start new user servers, and user servers won't be able to create or scale their
+dask-clusters.
+
+## Provider managed workload disruptions
+
+When upgrading a cloud provider managed k8s cluster, it may upgrade some managed
+workload part of the k8s cluster, such as calico that enforces NetworkPolicy
+rules. Maybe this could cause a disruption for users, but its not currently know
+to do so.
+
+## Core node pool disruptions
+
+Disruptions to the core node pool is a disruption to workloads running on it,
+and there are a few workloads that when disrupted would disrupt users.
+
+### ingress-nginx-controller pod(s) disruptions
+
+The `support` chart we install in each cluster comes with the `ingress-nginx`
+chart. The `ingress-nginx` chart creates one or more `ingress-nginx-controller`
+pods that are proxying network traffic associated with incoming connections.
+
+To shut down such pod means to break connections from users working against the
+user servers. A broken connection can be re-established if there is another
+replica of this pod is ready to accept a new connection.
+
+We are currently running only one replica of the `ingress-nginx-controller` pod,
+and we won't have issues with this during rolling updates, such as when the
+Deployment's pod template specification is changed or when manually running
+`kubectl rollout restart -n support deploy/support-ingress-nginx-controller`. We
+will however have broken connections and user pods unable to establish new
+directly if `kubectl delete` is used on this single pod, or `kubectl drain` is
+used on the node.
+
+### hub pod disruptions
+
+Our JupyterHub installations each has a single `hub` pod, and having more isn't
+supported by JupyterHub itself. Due to this, and because it has a disk mounted
+to it that can only be mounted at the same time to one pod, it isn't configured
+to do rolling updates.
+
+When the `hub` pod isn't running, users can't visit `/hub` paths, but they can
+still visit `/user` paths and control their already started user server.
+
+### proxy pod disruptions
+
+Our JupyterHub installations each has a single `proxy` pod running
+`configurable-http-proxy`, having more replicas isn't supported because
+JupyterHub will only update one replica with new proxy routes.
+
+When the `proxy` pod isn't running, users can't visit `/hub`, `/user`, or
+`/service` paths, because they all route through the proxy pod.
+
+When the `proxy` pod has started and become ready, it also needs to be
+re-configured by JupyterHub on how to route traffic to arrive to `/user` and
+`/service` paths. This is done during startup and then regularly by JupyterHub
+every five minutes. Due to this, a proxy pod being restarted can cause a outage
+of five minutes.
+
+## User node pool disruptions
+
+Disruptions to a user node pool will disrupt user server pods running on it.

From 1101abb91ef943d53a8e95055bfc075b2e6e6e42 Mon Sep 17 00:00:00 2001
From: Erik Sundell <erik.i.sundell@gmail.com>
Date: Tue, 7 May 2024 13:56:17 +0200
Subject: [PATCH 2/2] docs: spelling and grammar fixes

Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com>
---
 docs/howto/upgrade-cluster/index.md               |  4 ++--
 docs/howto/upgrade-cluster/upgrade-disruptions.md | 14 +++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/howto/upgrade-cluster/index.md b/docs/howto/upgrade-cluster/index.md
index 79c36e0011..bc05598406 100644
--- a/docs/howto/upgrade-cluster/index.md
+++ b/docs/howto/upgrade-cluster/index.md
@@ -18,8 +18,8 @@ procedures to community champions and users.
 
 Up until now we have made some k8s cluster upgrades opportunistically,
 especially for clusters that has showed little to no activity during some
-periods. Other cluster upgrades has been scheduled with community champions, and
-some in shared clusters has been announced ahead of time.
+periods. Other cluster upgrades have been scheduled with community champions, and
+some in shared clusters have been announced ahead of time.
 ```
 
 (upgrade-cluster:ambition)=
diff --git a/docs/howto/upgrade-cluster/upgrade-disruptions.md b/docs/howto/upgrade-cluster/upgrade-disruptions.md
index 8f30967fa0..75a6a46fbf 100644
--- a/docs/howto/upgrade-cluster/upgrade-disruptions.md
+++ b/docs/howto/upgrade-cluster/upgrade-disruptions.md
@@ -14,7 +14,7 @@ the cost savings are minimal we only create for regional clusters now.
 
 If upgrading a zonal cluster, the single k8s api-server will be temporarily
 unavailable, but that is not a big problem as user servers and JupyterHub will
-remains accessible. The brief disruption is that JupyterHub won't be able to
+remain accessible. The brief disruption is that JupyterHub won't be able to
 start new user servers, and user servers won't be able to create or scale their
 dask-clusters.
 
@@ -22,8 +22,8 @@ dask-clusters.
 
 When upgrading a cloud provider managed k8s cluster, it may upgrade some managed
 workload part of the k8s cluster, such as calico that enforces NetworkPolicy
-rules. Maybe this could cause a disruption for users, but its not currently know
-to do so.
+rules. Maybe this could cause a disruption for users, but its not currently known
+if it does and in what manner.
 
 ## Core node pool disruptions
 
@@ -38,21 +38,21 @@ pods that are proxying network traffic associated with incoming connections.
 
 To shut down such pod means to break connections from users working against the
 user servers. A broken connection can be re-established if there is another
-replica of this pod is ready to accept a new connection.
+replica of this pod ready to accept a new connection.
 
 We are currently running only one replica of the `ingress-nginx-controller` pod,
 and we won't have issues with this during rolling updates, such as when the
 Deployment's pod template specification is changed or when manually running
 `kubectl rollout restart -n support deploy/support-ingress-nginx-controller`. We
-will however have broken connections and user pods unable to establish new
+will however have broken connections and user pods unable to establish new connections
 directly if `kubectl delete` is used on this single pod, or `kubectl drain` is
 used on the node.
 
 ### hub pod disruptions
 
-Our JupyterHub installations each has a single `hub` pod, and having more isn't
+Our JupyterHub installations each have a single `hub` pod, and having more isn't
 supported by JupyterHub itself. Due to this, and because it has a disk mounted
-to it that can only be mounted at the same time to one pod, it isn't configured
+to it that can only be mounted to one pod at a time, it isn't configured
 to do rolling updates.
 
 When the `hub` pod isn't running, users can't visit `/hub` paths, but they can