Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update EKS upgrade runbook notes based on recent 1.25 and eks module upgrades #5195

Merged
merged 8 commits into from
Jan 16, 2024
81 changes: 60 additions & 21 deletions runbooks/source/upgrade-eks-cluster.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Upgrade EKS cluster
weight: 53
last_reviewed_on: 2023-10-24
last_reviewed_on: 2024-01-16
review_in: 3 months
---

Expand All @@ -12,9 +12,12 @@ The Cloud Platform EKS cluster upgrade consists of three distinct parts:
- Upgrade EKS Terraform Module
- Upgrade EKS version (Control Plane and Node Groups)
- Upgrade addon(s)
- Upgrade AMI version

The Cloud Platform EKS clusters are created using the official [terraform-aws-eks](https://github.com/terraform-aws-modules/terraform-aws-eks) module. The EKS version and addons are currently independent of the version of the terraform-aws-eks module.
Therefore, it will not always require an upgrade of the terraform-aws-eks module and/or the addons whenever there is an upgrade of the EKS version. Please check the changelogs for the terraform-aws-eks module, the EKS version and the addons when planning an upgrade.
The Cloud Platform EKS clusters are created using the official [terraform-aws-eks](https://github.com/terraform-aws-modules/terraform-aws-eks) module.
The EKS version and addons are currently independent of the version of the terraform-aws-eks module.
Therefore, it will not always require an upgrade of the terraform-aws-eks module and/or the addons whenever there is an upgrade of the EKS version.
Please check the changelogs for the terraform-aws-eks module, the EKS version and the addons when planning an upgrade.

## Run the upgrade, via the tools image

Expand Down Expand Up @@ -48,9 +51,7 @@ Before you begin, there are a few pre-requisites:

### Upgrade EKS Terraform Module

As mentioned previously; when a new EKS major version is released, it is normally followed by a release of an associated [terraform-aws-eks module](https://github.com/terraform-aws-modules/terraform-aws-eks).

1) The first step of the EKS upgrade is to identify the corresponding module release with the EKS major version you want to upgrade to. Review the changes in the [changelog](https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/CHANGELOG.md). Plan/make any necessary changes or required updates.
The first step of the EKS moduke upgrade is to identify the major version you want to upgrade to. Review the changes in the [changelog](https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/CHANGELOG.md).

Create a PR in Cloud Platform Infrastructure repository against the [EKS module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf) making the change to the desired [terraform-aws-eks version](https://github.com/terraform-aws-modules/terraform-aws-eks)

Expand All @@ -61,9 +62,13 @@ Create a PR in Cloud Platform Infrastructure repository against the [EKS module]
+ version = "v17.1.0"
```

2) Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, run `terraform apply` to execute the changes.
Based on the changes in the changelog, you can decide if the upgrade is a breaking change or not.

#### Upgrade with no breaking changes

Note: When you run `terraform plan`, if it is only showing launch_template version change as below, executing `terraform apply` will only create a new template version. For cluster node groups to use the new template version created, you need to run `terraform apply` again, that will trigger a re-cycle of all the nodes. To avoid the re-cycle of nodes at this stage, we don't run `terraform apply` until we complete the upgrade of node groups along with updating the template version at a later stage.
- Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, run `terraform apply` to execute the changes.

Note: When you run `terraform plan`, if it is only showing launch_template version change as below, executing `terraform apply` will only create a new template version.

```
# module.eks.module.node_groups.aws_launch_template.workers["monitoring_ng"] will be updated in-place
Expand All @@ -72,9 +77,31 @@ Note: When you run `terraform plan`, if it is only showing launch_template versi
~ latest_version = 1 -> (known after apply)
```

### Upgrade Control Plane
For cluster node groups to use the new template version created, you need to run `terraform apply` again, that will trigger a re-cycle of all the nodes using terraform.
This can be distruptive and also incur terraform apply timeout. Hence, follow the below steps to update the node groups with the new template version.
poornima-krishnasamy marked this conversation as resolved.
Show resolved Hide resolved

To update the node groups with the new template version:
- login to AWS console
- click on the node group
- click on `Change launch template version` option
- select update strategy as `force update` and Submit.

This will perform a rolling update of all the nodes in the node group. Follow the steps in [Recycle all nodes](#recycle-all-nodes) section to recycle all the nodes.

#### Upgrade with breaking changes

Recent EKS module upgrade from 17 to 18 mentioned breaking changes to the resources. Hence, a non-distruptive process of creating new node group, moving terraform state,
draining the old node group and finally deleting the old node group was followed.

3) Create a PR in Cloud Platform Infrastructure repository against the [EKS module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf) making the change to the desired EKS cluster version.
Detailed steps are mentioned in the [google doc](https://docs.google.com/document/d/1Nv1WsqdYMBzjpO8jfmXEqjAY5nZ9GYWUpNNaMJVJyaw/edit?usp=sharing)

Any future upgrades where the terraform plan or the changelog shows breaking changes, the procedure needs to be reviewed and modified based on what the breaking changes are.

### Upgrade EKS version

#### Upgrade Control Plane

- Create a PR in Cloud Platform Infrastructure repository against the [EKS module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf) making the change to the desired EKS cluster version.

```
module "eks" {
Expand All @@ -84,12 +111,11 @@ Note: When you run `terraform plan`, if it is only showing launch_template versi

```

4) Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, perform the upgrade from the AWS Console EKS Control Plane.

We don't want to run `terraform apply` to apply the EKS cluster version, as the terraform apply process will take longer and timed out, also to avoid re-cycling of nodes as explained in step 2.

- Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, perform the upgrade from the AWS Console EKS Control Plane.
Once the process is completed, [AWS Console](https://eu-west-2.console.aws.amazon.com/eks/home?region=eu-west-2#/clusters) will confirm the Control Plane is on the correct version.

Note: We don't want to run `terraform apply` to apply the EKS cluster version, as this will trigger a re-cycle of all the nodes using terraform. This can be distruptive and also incur terraform apply timeout.

```
$ aws eks describe-cluster --query 'cluster.version' --name manager
"1.15"
Expand All @@ -98,22 +124,24 @@ $

![AWS Console](../images/aws-eks-upgrade.png)

### Upgrade Node Group(s)
#### Upgrade Node Group(s)

The easiest way to upgrade node groups is through AWS Console. We advise to follow the official AWS EKS upgrade instructions from the [Updating a Managed Node Group](https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html) documentation.

While updating the node group AMI release version, we should also change the launch template version which is created in step 2. To perform both the changes together, select `Update Node Group version` and `Change launch template version` options as shown below. Select update strategy as `force update`, this does not respect pod disruption budgets and it forces node restarts.

![Update Node Group](../images/update-node-group.png)

### Recycle all nodes
**Testing the upgrade in a test cluster**

When a node group version changes, this will cause all of the nodes to recycle. When AWS recycles the nodes, it will not evict pods if it will break the PDB.
This will cause the node to stall the update and the nodes will **not** continue to recycle.
Testing the upgrade involves several things and vary depends on the changes involved. Some of the things to consider are:
poornima-krishnasamy marked this conversation as resolved.
Show resolved Hide resolved

To rectify this, run the script mentioned in [Recycle-all-nodes- Gotchas](/recycle-all-nodes.html#gotchas) section.
- Run integrations tests
- Monitor Cloudwatch API logs for any failures
- Compare launch template before and after the upgrade and check for any variable changes
- Check for disk space, Ip subnet and IP allocations changes for any IP starvations. This might not be obvious in test clsuter, but to monitor when upgrading live
poornima-krishnasamy marked this conversation as resolved.
Show resolved Hide resolved

### Update kubectl version in tools image
#### Update kubectl version in tools image

kubectl is supported within one minor version (older or newer) of the cluster version. Update the kubectl version in the cloud platform
[tools image](https://github.com/ministryofjustice/cloud-platform-tools-image.git) to match the current cluster version.
Expand All @@ -122,7 +150,11 @@ kubectl is supported within one minor version (older or newer) of the cluster ve

We have 3 addons managed through cloud-platform-terraform-eks-add-ons [module](https://github.com/ministryofjustice/cloud-platform-terraform-eks-add-ons).

Refer to the below documents to get the addon version to be used with the EKS major version you just upgraded to.
Before every EKS major versions, check and upgrade if the addons versions doesnt match the EKS major version the cluster is currently on.
poornima-krishnasamy marked this conversation as resolved.
Show resolved Hide resolved

After every EKS major versions, check and upgrade if the addons doesnt match the EKS major version the cluster you just upgraded to.
poornima-krishnasamy marked this conversation as resolved.
Show resolved Hide resolved

The following addons are managed through cloud-platform-terraform-eks-add-ons [module](

[managing-kube-proxy](https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html)

Expand All @@ -131,3 +163,10 @@ Refer to the below documents to get the addon version to be used with the EKS ma
[managing-vpc-cni](https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html)

Create a PR in Cloud Platform Infrastructure repository against the cloud-platform-terraform-eks-add-ons [module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf#L192) making the changes to the desired addon version’s [here](https://github.com/ministryofjustice/cloud-platform-terraform-eks-add-ons/blob/main/variables.tf#L28-L44). Execute `terraform plan` (or the automated plan pipeline) and review changes. If changes are all as expected, run `terraform apply` to execute the changes.

### Recycle all nodes

When a node group version changes, this will cause all of the nodes to recycle. When AWS recycles the nodes, it will not evict pods if it will break the PDB.
This will cause the node to stall the update and the nodes will **not** continue to recycle.

To rectify this, run the script mentioned in [Recycle-all-nodes- Gotchas](/recycle-all-nodes.html#gotchas) section.