forked from 2i2c-org/infrastructure
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: notes on various forms of k8s upgrade disruptions
- Loading branch information
1 parent
d817b75
commit ff45c37
Showing
2 changed files
with
91 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
(upgrade-cluster:disruptions)= | ||
|
||
# About upgrade disruptions | ||
|
||
When we upgrade our Kubernetes clusters we can cause different kinds of | ||
disruptions, this text provides an overview of them. | ||
|
||
## Kubernetes api-server disruption | ||
|
||
K8s clusters' control plane (api-server etc.) can be either highly available | ||
(HA) or not. EKS clusters, AKS clusters, and "regional" GKE clusters are HA, but | ||
"zonal" GKE clusters are not. A few of our GKE clusters are zonal still, but as | ||
the cost savings are minimal we only create for regional clusters now. | ||
|
||
If upgrading a zonal cluster, the single k8s api-server will be temporarily | ||
unavailable, but that is not a big problem as user servers and JupyterHub will | ||
remains accessible. The brief disruption is that JupyterHub won't be able to | ||
start new user servers, and user servers won't be able to create or scale their | ||
dask-clusters. | ||
|
||
## Provider managed workload disruptions | ||
|
||
When upgrading a cloud provider managed k8s cluster, it may upgrade some managed | ||
workload part of the k8s cluster, such as calico that enforces NetworkPolicy | ||
rules. Maybe this could cause a disruption for users, but its not currently know | ||
to do so. | ||
|
||
## Core node pool disruptions | ||
|
||
Disruptions to the core node pool is a disruption to workloads running on it, | ||
and there are a few workloads that when disrupted would disrupt users. | ||
|
||
### ingress-nginx-controller pod(s) disruptions | ||
|
||
The `support` chart we install in each cluster comes with the `ingress-nginx` | ||
chart. The `ingress-nginx` chart creates one or more `ingress-nginx-controller` | ||
pods that are proxying network traffic associated with incoming connections. | ||
|
||
To shut down such pod means to break connections from users working against the | ||
user servers. A broken connection can be re-established if there is another | ||
replica of this pod is ready to accept a new connection. | ||
|
||
We are currently running only one replica of the `ingress-nginx-controller` pod, | ||
and we won't have issues with this during rolling updates, such as when the | ||
Deployment's pod template specification is changed or when manually running | ||
`kubectl rollout restart -n support deploy/support-ingress-nginx-controller`. We | ||
will however have broken connections and user pods unable to establish new | ||
directly if `kubectl delete` is used on this single pod, or `kubectl drain` is | ||
used on the node. | ||
|
||
### hub pod disruptions | ||
|
||
Our JupyterHub installations each has a single `hub` pod, and having more isn't | ||
supported by JupyterHub itself. Due to this, and because it has a disk mounted | ||
to it that can only be mounted at the same time to one pod, it isn't configured | ||
to do rolling updates. | ||
|
||
When the `hub` pod isn't running, users can't visit `/hub` paths, but they can | ||
still visit `/user` paths and control their already started user server. | ||
|
||
### proxy pod disruptions | ||
|
||
Our JupyterHub installations each has a single `proxy` pod running | ||
`configurable-http-proxy`, having more replicas isn't supported because | ||
JupyterHub will only update one replica with new proxy routes. | ||
|
||
When the `proxy` pod isn't running, users can't visit `/hub`, `/user`, or | ||
`/service` paths, because they all route through the proxy pod. | ||
|
||
When the `proxy` pod has started and become ready, it also needs to be | ||
re-configured by JupyterHub on how to route traffic to arrive to `/user` and | ||
`/service` paths. This is done during startup and then regularly by JupyterHub | ||
every five minutes. Due to this, a proxy pod being restarted can cause a outage | ||
of five minutes. | ||
|
||
## User node pool disruptions | ||
|
||
Disruptions to a user node pool will disrupt user server pods running on it. |