From 9b6d8c4592d06c99928442b4f0ecb66331b373b8 Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Tue, 7 May 2024 13:03:42 +0200 Subject: [PATCH 1/2] docs: notes on various forms of k8s upgrade disruptions --- docs/howto/upgrade-cluster/index.md | 17 +++- .../upgrade-cluster/upgrade-disruptions.md | 78 +++++++++++++++++++ 2 files changed, 94 insertions(+), 1 deletion(-) create mode 100644 docs/howto/upgrade-cluster/upgrade-disruptions.md diff --git a/docs/howto/upgrade-cluster/index.md b/docs/howto/upgrade-cluster/index.md index 9d12a501c7..79c36e0011 100644 --- a/docs/howto/upgrade-cluster/index.md +++ b/docs/howto/upgrade-cluster/index.md @@ -9,7 +9,21 @@ As of now, we also only have written documentation for how to upgrade Kubernetes clusters on AWS. ``` -## Upgrade policy +(upgrade-cluster:planning)= +## Upgrade planning + +```{warning} +We haven't yet established a policy for planning and communicating maintenance +procedures to community champions and users. + +Up until now we have made some k8s cluster upgrades opportunistically, +especially for clusters that has showed little to no activity during some +periods. Other cluster upgrades has been scheduled with community champions, and +some in shared clusters has been announced ahead of time. +``` + +(upgrade-cluster:ambition)= +## Upgrade ambition 1. To keep our k8s cluster's control plane and node pools upgraded to the latest _three_ and _four_ [official minor k8s versions] respectively at all times. @@ -35,5 +49,6 @@ clusters on AWS. ```{toctree} :maxdepth: 1 :caption: Upgrading Kubernetes clusters +upgrade-disruptions.md aws.md ``` diff --git a/docs/howto/upgrade-cluster/upgrade-disruptions.md b/docs/howto/upgrade-cluster/upgrade-disruptions.md new file mode 100644 index 0000000000..8f30967fa0 --- /dev/null +++ b/docs/howto/upgrade-cluster/upgrade-disruptions.md @@ -0,0 +1,78 @@ +(upgrade-cluster:disruptions)= + +# About upgrade disruptions + +When we upgrade our Kubernetes clusters we can cause different kinds of +disruptions, this text provides an overview of them. + +## Kubernetes api-server disruption + +K8s clusters' control plane (api-server etc.) can be either highly available +(HA) or not. EKS clusters, AKS clusters, and "regional" GKE clusters are HA, but +"zonal" GKE clusters are not. A few of our GKE clusters are zonal still, but as +the cost savings are minimal we only create for regional clusters now. + +If upgrading a zonal cluster, the single k8s api-server will be temporarily +unavailable, but that is not a big problem as user servers and JupyterHub will +remains accessible. The brief disruption is that JupyterHub won't be able to +start new user servers, and user servers won't be able to create or scale their +dask-clusters. + +## Provider managed workload disruptions + +When upgrading a cloud provider managed k8s cluster, it may upgrade some managed +workload part of the k8s cluster, such as calico that enforces NetworkPolicy +rules. Maybe this could cause a disruption for users, but its not currently know +to do so. + +## Core node pool disruptions + +Disruptions to the core node pool is a disruption to workloads running on it, +and there are a few workloads that when disrupted would disrupt users. + +### ingress-nginx-controller pod(s) disruptions + +The `support` chart we install in each cluster comes with the `ingress-nginx` +chart. The `ingress-nginx` chart creates one or more `ingress-nginx-controller` +pods that are proxying network traffic associated with incoming connections. + +To shut down such pod means to break connections from users working against the +user servers. A broken connection can be re-established if there is another +replica of this pod is ready to accept a new connection. + +We are currently running only one replica of the `ingress-nginx-controller` pod, +and we won't have issues with this during rolling updates, such as when the +Deployment's pod template specification is changed or when manually running +`kubectl rollout restart -n support deploy/support-ingress-nginx-controller`. We +will however have broken connections and user pods unable to establish new +directly if `kubectl delete` is used on this single pod, or `kubectl drain` is +used on the node. + +### hub pod disruptions + +Our JupyterHub installations each has a single `hub` pod, and having more isn't +supported by JupyterHub itself. Due to this, and because it has a disk mounted +to it that can only be mounted at the same time to one pod, it isn't configured +to do rolling updates. + +When the `hub` pod isn't running, users can't visit `/hub` paths, but they can +still visit `/user` paths and control their already started user server. + +### proxy pod disruptions + +Our JupyterHub installations each has a single `proxy` pod running +`configurable-http-proxy`, having more replicas isn't supported because +JupyterHub will only update one replica with new proxy routes. + +When the `proxy` pod isn't running, users can't visit `/hub`, `/user`, or +`/service` paths, because they all route through the proxy pod. + +When the `proxy` pod has started and become ready, it also needs to be +re-configured by JupyterHub on how to route traffic to arrive to `/user` and +`/service` paths. This is done during startup and then regularly by JupyterHub +every five minutes. Due to this, a proxy pod being restarted can cause a outage +of five minutes. + +## User node pool disruptions + +Disruptions to a user node pool will disrupt user server pods running on it. From 1101abb91ef943d53a8e95055bfc075b2e6e6e42 Mon Sep 17 00:00:00 2001 From: Erik Sundell Date: Tue, 7 May 2024 13:56:17 +0200 Subject: [PATCH 2/2] docs: spelling and grammar fixes Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com> --- docs/howto/upgrade-cluster/index.md | 4 ++-- docs/howto/upgrade-cluster/upgrade-disruptions.md | 14 +++++++------- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/howto/upgrade-cluster/index.md b/docs/howto/upgrade-cluster/index.md index 79c36e0011..bc05598406 100644 --- a/docs/howto/upgrade-cluster/index.md +++ b/docs/howto/upgrade-cluster/index.md @@ -18,8 +18,8 @@ procedures to community champions and users. Up until now we have made some k8s cluster upgrades opportunistically, especially for clusters that has showed little to no activity during some -periods. Other cluster upgrades has been scheduled with community champions, and -some in shared clusters has been announced ahead of time. +periods. Other cluster upgrades have been scheduled with community champions, and +some in shared clusters have been announced ahead of time. ``` (upgrade-cluster:ambition)= diff --git a/docs/howto/upgrade-cluster/upgrade-disruptions.md b/docs/howto/upgrade-cluster/upgrade-disruptions.md index 8f30967fa0..75a6a46fbf 100644 --- a/docs/howto/upgrade-cluster/upgrade-disruptions.md +++ b/docs/howto/upgrade-cluster/upgrade-disruptions.md @@ -14,7 +14,7 @@ the cost savings are minimal we only create for regional clusters now. If upgrading a zonal cluster, the single k8s api-server will be temporarily unavailable, but that is not a big problem as user servers and JupyterHub will -remains accessible. The brief disruption is that JupyterHub won't be able to +remain accessible. The brief disruption is that JupyterHub won't be able to start new user servers, and user servers won't be able to create or scale their dask-clusters. @@ -22,8 +22,8 @@ dask-clusters. When upgrading a cloud provider managed k8s cluster, it may upgrade some managed workload part of the k8s cluster, such as calico that enforces NetworkPolicy -rules. Maybe this could cause a disruption for users, but its not currently know -to do so. +rules. Maybe this could cause a disruption for users, but its not currently known +if it does and in what manner. ## Core node pool disruptions @@ -38,21 +38,21 @@ pods that are proxying network traffic associated with incoming connections. To shut down such pod means to break connections from users working against the user servers. A broken connection can be re-established if there is another -replica of this pod is ready to accept a new connection. +replica of this pod ready to accept a new connection. We are currently running only one replica of the `ingress-nginx-controller` pod, and we won't have issues with this during rolling updates, such as when the Deployment's pod template specification is changed or when manually running `kubectl rollout restart -n support deploy/support-ingress-nginx-controller`. We -will however have broken connections and user pods unable to establish new +will however have broken connections and user pods unable to establish new connections directly if `kubectl delete` is used on this single pod, or `kubectl drain` is used on the node. ### hub pod disruptions -Our JupyterHub installations each has a single `hub` pod, and having more isn't +Our JupyterHub installations each have a single `hub` pod, and having more isn't supported by JupyterHub itself. Due to this, and because it has a disk mounted -to it that can only be mounted at the same time to one pod, it isn't configured +to it that can only be mounted to one pod at a time, it isn't configured to do rolling updates. When the `hub` pod isn't running, users can't visit `/hub` paths, but they can