Skip to content

Commit

Permalink
Merge pull request #5268 from ministryofjustice/rewrite-upgrade-eks-c…
Browse files Browse the repository at this point in the history
…luster

Add v2 of upgrade cluster runbook
  • Loading branch information
mikebell authored Feb 8, 2024
2 parents ec45ca0 + 92b85c5 commit 45b64b7
Showing 1 changed file with 65 additions and 96 deletions.
161 changes: 65 additions & 96 deletions runbooks/source/upgrade-eks-cluster.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,6 @@ The EKS version and addons are currently independent of the version of the terra
Therefore, it will not always require an upgrade of the terraform-aws-eks module and/or the addons whenever there is an upgrade of the EKS version.
Please check the changelogs for the terraform-aws-eks module, the EKS version and the addons when planning an upgrade.

## Run the upgrade, via the tools image

The cloud platform [tools image](https://github.com/ministryofjustice/cloud-platform-tools-image.git) has all the software required to run the upgrade.

Start from the root directory of a working copy of the [infrastructure repo](https://github.com/ministryofjustice/cloud-platform-infrastructure.git).

With your environment variables set, launch a bash shell on the tools image:

```bash
make tools-shell
```

## Pre-requisites

Before you begin, there are a few pre-requisites:
Expand All @@ -49,136 +37,117 @@ Before you begin, there are a few pre-requisites:

## Upgrade Steps

### Upgrade EKS Terraform Module
### Compatibility Check

The first step of the EKS moduke upgrade is to identify the major version you want to upgrade to. Review the changes in the [changelog](https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/CHANGELOG.md).
The following areas need to be looked into to determine if there's any additional preparation work to do:

Create a PR in Cloud Platform Infrastructure repository against the [EKS module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf) making the change to the desired [terraform-aws-eks version](https://github.com/terraform-aws-modules/terraform-aws-eks)
* Kubernetes API Deprecations/Removals
* EKS module
* EKS addons
* Components

```
module "eks" {
source = "terraform-aws-modules/eks/aws"
- version = "v16.2.0"
+ version = "v17.1.0"
```
Tools:

Based on the changes in the changelog, you can decide if the upgrade is a breaking change or not.
For Kubernetes API deprecations or removals you can use [kubent](https://github.com/doitintl/kube-no-trouble) and [pluto](https://github.com/FairwindsOps/pluto) to scan the cluster and find if there are any resources impacted in upcoming releases.

#### Upgrade with no breaking changes
From the AWS console you can also see "Upgrade Insights" which has a break down of API deprecations and removals. You can drill down into specific versions and see the resources effected. In particular, the User Agent
field here can be useful for tracking down API calling services.

- Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, run `terraform apply` to execute the changes.
> Sometimes the User Agent ID isn't clear enough to immediately identify where the resource is effected, if this is the case it's worth cross checking components or helm chart versions. Additionally, you can head
over to CloudWatch > Log groups > /aws/eks/[cluster-name] and view the `kube-apiserver-audit` logs, and filter by the `userAgent` field, which can help determine the source of the API calls.

Note: When you run `terraform plan`, if it is only showing launch_template version change as below, executing `terraform apply` will only create a new template version.
Users will need to be notified if their resources are affected by API deprecations or removals.

```
# module.eks.module.node_groups.aws_launch_template.workers["monitoring_ng"] will be updated in-place
~ resource "aws_launch_template" "workers" {
~ default_version = 1 -> (known after apply)
~ latest_version = 1 -> (known after apply)
```
### Preparing for upgrade

For cluster node groups to use the new template version created, you need to run `terraform apply` again, that will trigger a re-cycle of all the nodes using terraform.
This can be disruptive and also incur terraform apply timeout. Hence, follow the below steps to update the node groups with the new template version.
Communication is an important part of the upgrade procedure, make sure to update `#ask-cloud-platform` and `#cloud-platform-update` when commencing the upgrade. Create a thread in `#cloud-platform` to keep the team updated on the current status of the upgrade.

To update the node groups with the new template version:
- login to AWS console
- select EKS and select the cluster
- click on the Compute and select the node group
- click on `Change launch template version` option
- select update strategy as `force update` and Update.
Pause the following pipelines:

This will perform a rolling update of all the nodes in the node group. Follow the steps in [Recycle all nodes](#recycle-all-nodes) section to recycle all the nodes.
* bootstrap
* infrastructure-live
* infrastructure-live-2
* infrastructure-manager

#### Upgrade with breaking changes
Update `cluster.tf` in `cloud-platform-infrastructure` with the version of Kubernetes you are upgrading to.

Recent EKS module upgrade from 17 to 18 mentioned breaking changes to the resources. Hence, a non-distruptive process of creating new node group, moving terraform state,
draining the old node group and finally deleting the old node group was followed.
Run a `tf plan` against the cluster your upgrading to check to see if everything is expected, the only changes should be to resources relating to the the version upgrade.

Detailed steps are mentioned in the [google doc](https://docs.google.com/document/d/1Nv1WsqdYMBzjpO8jfmXEqjAY5nZ9GYWUpNNaMJVJyaw/edit?usp=sharing)
> **IMPORTANT:** Do not run `tf apply` this will most likely time out and fail. Upgrades are manually carried out through the AWS Console.

Any future upgrades where the terraform plan or the changelog shows breaking changes, the procedure needs to be reviewed and modified based on what the breaking changes are.
### Monitoring the upgrade

### Upgrade EKS version
Before you start the upgrade it is useful to have a few monitoring resources up and running so you can catch any issues quickly.

#### Upgrade Control Plane
[k9s](https://k9scli.io/) is a useful tool to have open in a few terminal windows, the following views are helpful:

- Create a PR in Cloud Platform Infrastructure repository against the [EKS module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf) making the change to the desired EKS cluster version.
* nodes - see nodes recycling and coming up with new version
* events - check to see if there are any errors
* pods - you can use vim style searching to see pods in `Error` state.

```
module "eks" {
source = "terraform-aws-modules/eks/aws"
- cluster_version = "1.14"
+ cluster_version = "1.15"
When a node group version changes, this will cause all of the nodes to recycle. When AWS recycles the nodes, it will not evict pods if it will break the PDB.
This will cause the node to stall the update and the nodes will **not** continue to recycle.

```
To rectify this, run the script mentioned in [Recycle-all-nodes Gotchas](/recycle-all-nodes.html#gotchas) section.

- Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, perform the upgrade from the AWS Console EKS Control Plane.
Once the process is completed, [AWS Console](https://eu-west-2.console.aws.amazon.com/eks/home?region=eu-west-2#/clusters) will confirm the Control Plane is on the correct version.
[This](https://kibana.cloud-platform.service.justice.gov.uk/_plugin/kibana/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15d,to:now))&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%22failed%20to%20assign%20an%20IP%20address%20to%20container%22'),sort:!())) kibana dashboard is used to monitor the IP assignment for pods when they are rescheduled. If there is a spike in errors then the could be a starvation of IP address while scheduling pods.

Note: We don't want to run `terraform apply` to apply the EKS cluster version, as this will trigger a re-cycle of all the nodes using terraform. This can be distruptive and also incur terraform apply timeout.
### Starting the upgrade

```
$ aws eks describe-cluster --query 'cluster.version' --name manager
"1.15"
$
```
As with preparing for the upgrade communication is really important, keep the thread in `#cloud-platform` up to date as much as possible.

![AWS Console](../images/aws-eks-upgrade.png)
#### Increasing coredns pods

#### Upgrade Node Group(s)
To ensure that coredns stays up and running during the cluster upgrade replications should be scaled up to 10.

The easiest way to upgrade node groups is through AWS Console. We advise to follow the official AWS EKS upgrade instructions from the [Updating a Managed Node Group](https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html) documentation.
#### Upgrading the control pane

While updating the node group AMI release version, we should also change the launch template version which is created in step 2. To perform both the changes together, select `Update Node Group version` and `Change launch template version` options as shown below. Select update strategy as `force update`, this does not respect pod disruption budgets and it forces node restarts.
Log in to the AWS console and select the EKS cluster we're going to upgrade.

![Update Node Group](../images/update-node-group.png)
In the top right corner there should be a button called `Upgrade now`, click that and ensure the correct Kubernetes version is selected then press `Update`.

**Testing the upgrade in a test cluster**
Control pane updates usually take 10 minutes to run.

Testing the upgrade involves several things and varies depends on the changes involved. Some of the things to consider are:
#### Upgrading the monitoring node group

- Run integrations tests
- Monitor Cloudwatch API logs for any failures
- Compare launch template before and after the upgrade and check for any variable changes
- Check for disk space, Ip subnet and IP allocations changes for any IP starvations. This might not be obvious in test cluster, but to monitor when upgrading live
From the cluster control panel select `Compute` tab.

#### Update kubectl version in tools image
Select `Upgrade now` next to the monitoring node group.

kubectl is supported within one minor version (older or newer) of the cluster version. Update the kubectl version in the cloud platform
[tools image](https://github.com/ministryofjustice/cloud-platform-tools-image.git) to match the current cluster version.
For update strategy select "Force update"

### Upgrade addon(s)
Click `Update`

We have 3 addons managed through cloud-platform-terraform-eks-add-ons [module](https://github.com/ministryofjustice/cloud-platform-terraform-eks-add-ons).
#### Upgrading the default node group

Before every EKS major versions, check and upgrade if the addons versions don't match the EKS major version the cluster is currently on.
From the cluster control panel select `Compute` tab.

After every EKS major versions, check and upgrade if the addons don't match the EKS major version the cluster you just upgraded to.
Select `Upgrade now` next to the monitoring node group.

The following addons are managed through cloud-platform-terraform-eks-add-ons module.
For update strategy select "Force update"

[managing-kube-proxy](https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html)
Click `Update`

[managing-coredns](https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html)
Once the upgrade has completed notify the Slack channels.

[managing-vpc-cni](https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html)
### Finishing the upgrade

Create a PR in Cloud Platform Infrastructure repository against the cloud-platform-terraform-eks-add-ons [module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf#L192) making the changes to the desired addon version’s [here](https://github.com/ministryofjustice/cloud-platform-terraform-eks-add-ons/blob/main/variables.tf#L28-L44). Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, run `terraform apply` to execute the changes.
Create a new pull request in the `cloud-platform-infrastructure` repo with the updated version strings.

### Upgrade AMI version
Unpause the following pipelines in this order and check to make sure no changes are present:

AWS releases new AMI versions for EKS node groups that include Kubernetes patches and security updates. To upgrade the node groups to use the new AMI version:
1. infrastructure-live-2
2. infrastructure-manager
3. infrastructure-live

- login to the AWS console
- Select EKS and select the cluster
- Select the node group and click on `Update AMI version`
- Select the Update Strategy to "Force update" and click on "Update"
If there are no changes for terraform shown in each pipeline then the PR can be merged in.

This will perform a rolling update of all the nodes in the node group. Follow the steps in [Recycle all nodes](#recycle-all-nodes) section to recycle all the nodes.
Unpause the bootstrap pipeline.

### Recycle all nodes
Scale down the coredns pods.

When a node group version changes, this will cause all of the nodes to recycle. When AWS recycles the nodes, it will not evict pods if it will break the PDB.
This will cause the node to stall the update and the nodes will **not** continue to recycle.
### Finishing touches

The `kubectl` version in the `cloud-platform-cli` and `cloud-platform-tools-image` needs updating to match the current Kubernetes version.

To rectify this, run the script mentioned in [Recycle-all-nodes- Gotchas](/recycle-all-nodes.html#gotchas) section.
Documentation used as part of the upgrade should be reviewed and refined if needed.

0 comments on commit 45b64b7

Please sign in to comment.