Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster create/delete runbook update #5199

Merged
merged 3 commits into from
Jan 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 44 additions & 25 deletions runbooks/source/delete-cluster.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Delete a cluster
weight: 55
last_reviewed_on: 2023-11-20
last_reviewed_on: 2024-01-16
review_in: 6 months
---

Expand All @@ -13,8 +13,8 @@ In most cases, it is recommended to pass responsibility for deleting a test clus

## Delete the cluster with Concourse `delete-cluster` pipeline

We have a [dedicated pipeline](https://concourse.cloud-platform.service.justice.gov.uk/teams/main/pipelines/delete-cluster) for deleting test clusters. You can configure and trigger this pipeline
against your test cluster for removal by utilising the associated cloud-platform cli `pipeline delete-cluster` [command](https://github.com/ministryofjustice/cloud-platform-cli/blob/19d33d6618013f0f4047a545b5f0d184d3d2fdfb/pkg/commands/pipeline.go).
We have a [dedicated pipeline](https://concourse.cloud-platform.service.justice.gov.uk/teams/main/pipelines/delete-cluster) for deleting test clusters.
You can configure and trigger this pipeline against your test cluster for removal by utilising the associated cloud-platform cli `pipeline delete-cluster` [command](https://github.com/ministryofjustice/cloud-platform-cli/blob/19d33d6618013f0f4047a545b5f0d184d3d2fdfb/pkg/commands/pipeline.go).

In order to use this command, ensure you have the following installed:

Expand Down Expand Up @@ -49,46 +49,65 @@ configuration updated
started delete-cluster/delete #123
```

## Delete the cluster locally, using the cli `cluster delete` command
## Delete an EKS cluster manually

Follow these steps, to delete the EKS cluster.

To delete a cluster:
First, set the kubectl context for the EKS cluster you are deleting. The easiest way to do this is with aws command:

```
$ export AWS_PROFILE=moj-cp
$ export KUBECONFIG=~/.kube/config
$ export cluster=<cluster-name>
$ aws eks --region eu-west-2 update-kubeconfig --name ${cluster}
```

Start from the root directory of a working copy of the [infrastructure repo].

There is a [delete-cluster command] which will handle deleting your cluster.

The command is entirely non-interactive, and will not prompt you to confirm anything. It just destroys things.
You should see this output:

### First, run `make tools-shell`
```
Added new context arn:aws:eks:eu-west-2:754256621582:cluster/<cluster-name> to .kube/config

> The delete cluster command must *always* be run in a container. This ensures that the environment of the command is fully controlled, and you don't run into problems such as the kubernetes context being changed in another window, or extra environment variables causing unwanted effects.
```

Then invoke the command like this:
Then, from the root of a checkout of the `cloud-platform-infrastructure` repository, run
these commands to destroy all cluster components, and delete the terraform workspace:

```
cloud-platform cluster delete --name <cluster-nameto-be-deleted> --dry-run=false
$ cd terraform/aws-accounts/cloud-platform-aws/vpc/eks/components
$ terraform init
$ terraform workspace select ${cluster}
$ terraform destroy
```

Run with `--dry-run=true` to do a dry run (if you don't pass a flag it will default to true), and see what commands would be executed.

You can get more information using:
> The destroy process often gets stuck on prometheus operator. If that happens, running this in a separate window usually works:
> ```
> kubectl -n monitoring delete job prometheus-operator-operator-cleanup
> ```

```
cloud-platform cluster delete --help
$ terraform workspace select default
$ terraform workspace delete ${cluster}
```

If any steps fail:
Change directories and perform the following to destroy the EKS cluster, and delete the terraform workspace.

* Fix the underlying problem
* Re-run the command
```
$ cd .. # working dir is now `eks`
$ terraform init
$ terraform workspace select ${cluster}
$ terraform destroy
$ terraform workspace select default
$ terraform workspace delete ${cluster}
```

## Delete an EKS cluster manually
Change directories and perform the following to destroy the cluster VPC, and delete the terraform workspace.

The steps can be found here - [Delete an EKS Cluster]
```
$ cd .. # working dir is now `vpc`
$ terraform init
$ terraform workspace select ${cluster}
$ terraform destroy
$ terraform workspace select default
$ terraform workspace delete ${cluster}
```

[infrastructure repo]: https://github.com/ministryofjustice/cloud-platform-infrastructure
[delete-cluster command]: https://github.com/ministryofjustice/cloud-platform-cli/blob/19d33d6618013f0f4047a545b5f0d184d3d2fdfb/pkg/cluster/delete.go
Expand Down
162 changes: 22 additions & 140 deletions runbooks/source/eks-cluster.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: EKS Cluster
weight: 350
last_reviewed_on: 2024-01-09
last_reviewed_on: 2024-01-16
review_in: 3 months
---

Expand All @@ -11,7 +11,7 @@ review_in: 3 months

You can create a new EKS test cluster using the [cluster build pipeline].

Alternatively, using the `create-cluster` script.
Alternatively, if you want to create a cluster manually, follow the steps below.

## Pre-requisites

Expand Down Expand Up @@ -42,16 +42,10 @@ export AUTH0_CLIENT_ID=
export AUTH0_CLIENT_SECRET=
```

Execute the script inside the [cloud-platform-tool] container from the root of [cloud-platform-infrastructure] repo, run:

```
make tools-shell
```

This will launch the tool container, from there you can run the execute script by providing the desired name of your new cluster. e.g.:
Execute the cloud-platform command to create a new cluster:

```bash
./create-cluster.rb --name mogaal-eks
cloud-platform cluster create --name <cluster-name>
```

Check the pre-requisites and environment variables section of this document before running this script.
Expand All @@ -60,12 +54,12 @@ NB: Your cluster name must be **no more than 12 characters**. Any longer, and so

See our [cluster naming policy](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/009-Naming-convention-for-clusters.md) for information on how to choose a suitable name for your cluster.

By default, the script will create a `small` cluster. This means the master and worker EC2 instances will be less powerful machine types than in our production cluster.
By default, the script will create a `small` cluster. This means the worker EC2 instances will be less powerful machine types than in our production cluster.

You can see more options to use when creating the cluster by running:

```bash
./create-cluster.rb --help
cloud-platform cluster create --help
```

The script takes around 30 minutes to execute. At the end, you should see output like this:
Expand Down Expand Up @@ -161,138 +155,26 @@ terraform workspace new <WorkspaceName>
terraform apply
```

### 4. Delete the EKS cluster

#### Delete the EKS cluster using the script

There is a [destroy-cluster.rb] script which you can use to delete your cluster.

Read the script before using it. Deleting a cluster is something you should be very cautious about, and ensure you know exactly what you're doing.

The script is entirely non-interactive, and will not prompt you to confirm anything. It just destroys things.

First, run `make tools-shell`

> The delete cluster script must *always* be run in a container. This ensures that the environment of the script is fully controlled, and you don't run into problems such as the kubernetes context being changed in another window, or extra environment variables causing unwanted effects.

Then invoke the script like this:

```
./destroy-cluster.rb --name [short cluster name] --yes
```

Run without `--yes` to do a dry run, and see what commands would be executed.

You can get more information using:

```
./destroy-cluster.rb --help
```

If any steps fail:

* Fix the underlying problem
* Edit the script to comment out any sections of the `ClusterDeleter.run` function which you no longer need to run
* Re-run the script

#### Delete the cluster using concourse fly commands

In case you prefer concourse pipeline to destroy the cluster, these are the steps to follow, to delete the cluster using "concourse fly commands"

First, `cd`` to the working copy of the concourse [pipelines repo][pipelines repo]. Make below two changes to the [eks-create-test-destroy.yaml][create-test-destroy] file.

In the eks-create-test-destroy pipeline definition, comment out the below line in destroy-cluster job.

```
args:
# export $(cat keyval/keyval.properties | grep CLUSTER_NAME )
```

Commenting out this will not set the `CLUSTER_NAME` provided by the create-cluster-run-tests job.

```
./destroy-cluster.rb --name $CLUSTER_NAME --yes
```

Run the below commands updating the `<cluster-name-to-be-deleted>`.

The first fly command will apply the changes made for the [eks-create-test-destroy.yaml][create-test-destroy] file with the hardcoded `CLUSTER_NAME` in the destroy-cluster job

The second command will trigger the destroy-cluster job for the CLUSTER_NAME updated in the destroy-cluster job.

```
fly -t manager sp -p create-test-destroy -c create-test-destroy.yaml
fly -t manager trigger-job -j create-test-destroy/destroy-cluster
```
Note: After the destroy-cluster job completed sucessfully, run the [bootstrap pipleine][bootstrap pipleine] to discard the changes made to [eks-create-test-destroy.yaml][create-test-destroy] file.

```
fly -t manager trigger-job -j bootstrap/bootstrap-pipelines
```

#### Delete the EKS cluster manually

Follow these steps, to delete the EKS cluster.

First, set the kubectl context for the EKS cluster you are deleting. The easiest way to do this is with aws command:

```
$ export KUBECONFIG=~/.kube/config
$ export cluster=<cluster-name>
$ aws eks --region eu-west-2 update-kubeconfig --name ${cluster}
```

You should see this output:

```
Added new context arn:aws:eks:eu-west-2:754256621582:cluster/<cluster-name> to .kube/config
## Creating a live like test cluster

```
When testing clusteer upgrades, it is useful to test the procedure which is as close to the live cluster as possible. The following steps will update an existing test cluster
to the configuration similar to the live cluster.

Then, from the root of a checkout of the `cloud-platform-infrastructure` repository, run
these commands to destroy all cluster components, and delete the terraform workspace:
**Pre-requisites:**

```
$ cd terraform/aws-accounts/cloud-platform-aws/vpc/eks/components
$ terraform init
$ terraform workspace select ${cluster}
$ terraform destroy
```
> The destroy process often gets stuck on prometheus operator. If that happens, running this in a separate window usually works:
> ```
> kubectl -n monitoring delete job prometheus-operator-operator-cleanup
> ```
- a test cluster created using the [cluster build pipeline] or manually
- The environment variables and pre-requisites as described [above](#pre-requisites)

```
$ terraform workspace select default
$ terraform workspace delete ${cluster}
```
**Steps:**

Change directories and perform the following to destroy the EKS cluster, and delete the terraform workspace.

```
$ cd .. # working dir is now `eks`
$ terraform init
$ terraform workspace select ${cluster}
$ terraform destroy
$ terraform workspace select default
$ terraform workspace delete ${cluster}
```

Change directories and perform the following to destroy the cluster VPC, and delete the terraform workspace.

```
$ cd .. # working dir is now `vpc`
$ terraform init
$ terraform workspace select ${cluster}
$ terraform destroy
$ terraform workspace select default
$ terraform workspace delete ${cluster}
```
- Update the node group desired count to same as live cluster (say 50) in the console. The terraform way of applying doesnt work for desired count
- Set the node_groups_count to same as live cluster (say 64) and default_ng_min_count to 50 in [terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf]
- Apply the terraform code changes to the test cluster
- cd to [terraform/aws-accounts/cloud-platform-aws/vpc/eks/components] and enable ecr-exporter, cloudwatch_exporter, velero, overprovisioner and other components that are installed specific to live cluster
- Apply the terraform code changes to the test cluster
- Update the starter pack count to 40 and apply the terraform code changes to the test cluster
- Setup pingdom alerts for starter-pack helloworld app

[create a cluster]: https://runbooks.cloud-platform.service.justice.gov.uk/eks-cluster.html#provisioning-eks-clusters
[cluster build pipeline]: https://concourse.cloud-platform.service.justice.gov.uk/teams/main/pipelines/create-cluster
[destroy-cluster.rb]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/destroy-cluster.rb
[create-test-destroy]: https://github.com/ministryofjustice/cloud-platform-terraform-concourse/blob/main/pipelines/manager/main/eks-create-test-destroy.yaml
[cloud-platform-tool]: https://github.com/ministryofjustice/cloud-platform-tools-image
[cloud-platform-infrastructure]: https://github.com/ministryofjustice/cloud-platform-infrastructure
[terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf
[terraform/aws-accounts/cloud-platform-aws/vpc/eks/components]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components