Skip to content

Commit

Permalink
Merge pull request #6279 from ministryofjustice/update-runbooks
Browse files Browse the repository at this point in the history
update review dates, fix punctuations and tenses
  • Loading branch information
FolarinOyenuga authored Oct 16, 2024
2 parents 9f2652f + bb4164f commit 9bb15da
Show file tree
Hide file tree
Showing 13 changed files with 35 additions and 35 deletions.
2 changes: 1 addition & 1 deletion runbooks/source/access-eks-cluster.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Access EKS Cluster
weight: 8600
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
review_in: 3 months
---

Expand Down
14 changes: 7 additions & 7 deletions runbooks/source/change-alias-in-route53.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
---
title: Change load balancer alias to the interface IP's in Route53.
weight: 358
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
review_in: 3 months
---

# <%= current_page.data.title %>

This run book is a recovery action to mitigate slow performance of ingress traffic [incident][performance incident] when an interface fails in an availability zone (AZ), clients time out when they attempt to connect to one of unhealthy NLB EIPs
This runbook is a recovery action to mitigate slow performance of ingress traffic [incident][performance incident] when an interface fails in an availability zone (AZ), clients time out when they attempt to connect to one of the unhealthy NLB EIPs

## Request AWS to restart the health check

AWS confirmed the root cause of the [incident][performance incident] as, “the health checking subsystem did not correctly detect some of your targets as unhealthy, which resulted in clients timing out when they attempted to connect to one of your NLB EIPs".

AWS mitigated the impact by restarting the health checking service, which caused the target health to be updated appropriately. Cloud-platform team don't have access to restart the health check service, request AWS to restart it for us.
AWS mitigated the impact by restarting the health checking service, which caused the target health to be updated appropriately. The cloud-platform team don't have access to restart the health check service; we request AWS to restart it for us.

If restarting still has not resolved the issue, look at changing the load balancer alias.

Expand Down Expand Up @@ -47,9 +47,9 @@ a76b4f2b1811e4f7589eaca69c4a46c5-b700f2aa70780ce3.elb.eu-west-2.amazonaws.com ha

2) Find the unhealthy NLB EIP

Now we have all of the information we need to make a cURL call over to the external load balancer EIPs.
Now, we have all of the information we need to make a cURL call over to the external load balancer EIPs.

Run this on 3 EIPs of the NLB. If everything is working correctly it would return OK. If it return "Timeout", then it is most likely an unhealthy external load balancer EIP.
Run this on 3 EIPs of the NLB. If everything works correctly, it would return OK. If it returns "Timeout", then it is most likely an unhealthy external load balancer EIP.

```
while :; do (curl -o/dev/null -m1 -k -H 'Host: login.yy-0208-0000.cloud-platform.service.justice.gov.uk' https://35.179.65.116 2>/dev/null && echo "OK") || echo "Timeout" ; sleep 1 ; done
Expand All @@ -67,11 +67,11 @@ _external_dns.login.yy-0208-0000.cloud-platform.service.justice.gov.uk TXT Weigh
"heritage=external-dns,external-dns/owner=yy-0208-0000,external-dns/resource=ingress/kuberos/kuberos"
```

Edit the route53 TXT record and update the owner, set the incorrect owner field, so external-dns couldn't revert the information in the A record.
Edit the route53 TXT record and update the owner, set the incorrect owner field, so external-dns can't revert the information in the A record.
```
"heritage=external-dns,external-dns/owner=yy-CCCC-BBBB,external-dns/resource=ingress/kuberos/kuberos"
```

Edit the "A" record and uncheck the alias option, add 2 healthy IP's in the value filed and save the record. Repeat this on all the hosts using the affected NLB.
Edit the "A" record and uncheck the alias option, add 2 healthy IP's in the value field and save the record. Repeat this on all the hosts using the affected NLB.

[performance incident]: https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#q3-2022-july-september
6 changes: 3 additions & 3 deletions runbooks/source/creating-a-live-like.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Creating a live-like Cluster
weight: 350
last_reviewed_on: 2024-04-10
last_reviewed_on: 2024-10-16
review_in: 6 months
---

Expand Down Expand Up @@ -38,7 +38,7 @@ to the configuration similar to the live cluster.

2. Add the `starter_pack_count = 40` variable to the starter_pack module

> Sometimes terraform will error out with an unclear error message this is usually due to a low default `ulimit` to fix this you can set `ulimit -n 2048`
> Sometimes terraform will error out with an unclear error message. This is usually due to a low default `ulimit`. To fix this, you can set `ulimit -n 2048`

3. Run `terraform plan` and confirm that your changes are correct
4. Run `terraform apply` to apply the changes to your test cluster
Expand All @@ -52,7 +52,7 @@ See documentation for upgrading a [cluster](upgrade-eks-cluster.html).

* Setup pingdom alerts for starter-pack helloworld and multi-container app

> When nodes recycle it's possible that the multi-container app will break giving false positives.
> When nodes recycle, it's possible that the multi-container app will break giving false positives.

* Useful command liners
* `watch -n 1 "kubectl get events"` - get all Kubernetes events
Expand Down
10 changes: 5 additions & 5 deletions runbooks/source/custom-default-backend.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Custom default-backend
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
weight: 9000
review_in: 3 months
---
Expand All @@ -16,12 +16,12 @@ However, some applications don’t want to use the cloud-platform custom default
## Creating your own custom error page

### 1. Create your docker image
First create a docker image containing custom HTTP error pages using the [example][ingress-nginx-custom-error-pages] from the ingress-nginx, or [simplified version][cloud-platform-custom-error-pages] created by the cloud platform team.
First, create a docker image containing custom HTTP error pages using the [example][ingress-nginx-custom-error-pages] from the ingress-nginx, or [simplified version][cloud-platform-custom-error-pages] created by the cloud platform team.

### 2. Creating a service and deployment
Using this [custom-default-backend][customized-default-backend] example from ingress-nginx, create a service and deployment of the error pages container in your namespace.

To create Deployment and Service manually use this below command:
To create Deployment and Service manually, use the command below:

```
$ kubectl -n ${namespace} create -f custom-default-backend.yaml
Expand Down Expand Up @@ -80,11 +80,11 @@ spec:
port:
number: 4567
```
> Note - Please change the `ingress-name` and `environment-name` values in the above example, you can get the `environment-name` value from your namespace label "cloud-platform.justice.gov.uk/environment-name". The `colour` should be `green` for ingress in EKS `live` cluster
> Note - Please change the `ingress-name` and `environment-name` values in the above example. You can get the `environment-name` value from your namespace label "cloud-platform.justice.gov.uk/environment-name". The `colour` should be `green` for ingress in EKS `live` cluster

## Use the platform-level error page

Some teams want their application to serve their own error page for example 404s, but want to serve cloud platforms custom error page from ingress controller default backend for other error codes like 502,503 and 504, this can be done by using [custom-http-errors][custom-http-error-annotation] annotation in your ingress for error codes teams want to serve the cloud platforms custom error page.
Some teams want their application to serve their own error page, for example 404s, but want to serve cloud platforms custom error page from ingress controller default backend for other error codes like 502,503 and 504. This can be done by using the [custom-http-errors][custom-http-error-annotation] annotation in your ingress for error codes teams want to serve the cloud platforms custom error page.

Example Ingress file to use platform-level error page for custom-http-errors: "502,503,504". All other errors except `502,503,504` will be served from the application error page.

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/destroy-concourse-build-data.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Destroy Concourse Build Data
weight: 9000
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
review_in: 3 months
---

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/divergence-error.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: How to Investigate Divergence Errors
weight: 210
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
review_in: 3 months
---

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/expand.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Expanding Persistent Volumes created using StatefulSets
weight: 600
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
review_in: 3 months
---

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/helm-repository.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Helm Charts Repository
weight: 710
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
review_in: 3 months
---

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/leavers-guide.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Leavers Guide
weight: 9100
last_reviewed_on: 2024-07-12
last_reviewed_on: 2024-10-16
review_in: 3 months
---

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/monitor-eks-cluster.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Monitor EKS Cluster
weight: 70
last_reviewed_on: 2024-04-10
last_reviewed_on: 2024-10-16
review_in: 6 months
---

Expand Down
8 changes: 4 additions & 4 deletions runbooks/source/recycle-node.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: Manually run recycle node command
weight: 250
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-10
review_in: 3 months
---

# Recycle-node

The [recycle-node pipeline][recyclenode-pipeline-definition] runs every day on the `live` cluster, it executes the [cloud-platform cli][recycle-node-cli] command to replace the oldest worker node by:
The [recycle-node pipeline][recyclenode-pipeline-definition] runs every day on the `live` cluster. It executes the [cloud-platform cli][recycle-node-cli] command to replace the oldest worker node by:

* Cordoning the oldest node
* Draining the node
Expand All @@ -19,7 +19,7 @@ To recycle to oldest node on the cluster in your current context:

cloud-platform cluster recycle-node --oldest

To recycle a given node on the cluster in your current context
To recycle a given node on the cluster in your current context,:

cloud-platform cluster recycle-node --name ip-XXX.XX.XX.XX.eu-west-2.compute.internal

Expand All @@ -33,7 +33,7 @@ FATA[0000] node ip-172-20-53-167.eu-west-2.compute.internal is already cordoned,

cloud-platform cluster recycle-node --ignore-label

The other optional flags are
The other optional flags are:

```bash
--aws-access-key string aws access key to use
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Revoke auth0 kubeconfig access token
weight: 275
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-16
review_in: 3 months
---

Expand All @@ -13,9 +13,9 @@ Use this runbook if we make changes to the Auth0 authorisation process and requi

GitHub is being used as an OIDC provider. Once you've logged in to GitHub, it provides an ID token(valid for 10 hours), which is a signed JWT containing your GitHub username and a a list of teams you're in.

To revoke the tokens you need MOJ organisation administrator access, if you are not a Github admin request some one in the team who are Github admin to do it for you.
To revoke the tokens, you need MOJ organisation administrator access. If you are not a Github admin, request someone in the team who is a Github admin to do it for you.

Once you logged in as MOJ github Organization administrator, go into [settings](https://github.com/organizations/ministryofjustice/settings/profile), select developer settings and Oauth [Apps](https://github.com/organizations/ministryofjustice/settings/applications) and search for "MOJ Cloud Platforms Auth0 (prod)"
Once you are logged in as an MOJ github Organization administrator, go into [settings](https://github.com/organizations/ministryofjustice/settings/profile), select developer settings and Oauth [Apps](https://github.com/organizations/ministryofjustice/settings/applications) and search for "MOJ Cloud Platforms Auth0 (prod)"

Click on the "Revoke all user tokens" button, this will force users to reauthenticate to get a new token.

Expand Down Expand Up @@ -57,7 +57,7 @@ $ terraform apply -target=module.kuberos

#### 3) Verifiying changes

In order to verify that the changes were successfully applied
In order to verify that the changes were successfully applied,:

- You can authenticate to the cluster (follow [user guide](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/getting-started/kubectl-config.html#authentication))

Expand Down
10 changes: 5 additions & 5 deletions runbooks/source/scheduled-pr-reminders.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Scheduled PR Reminders
weight: 9101
last_reviewed_on: 2024-07-10
last_reviewed_on: 2024-10-10
review_in: 3 months
---

Expand All @@ -11,19 +11,19 @@ Scheduled reminders help the Cloud Platform focus on the most important review r

All reminders are created on Github Team level - For the Cloud Platform team, the team is `webops`

To view all scheduled reminders for team webops;
To view all scheduled reminders for team webops:

https://github.com/ministryofjustice > Teams > Webops > Settings > Scheduled Reminders

There is currently 2 reminders setup;
There are currently 2 reminders setup;

- **cloud-platform-notify**

Report all open PRs for all cloud-platform-* repos every hour between 9am-5pm UTC Monday to Friday.
Reports all open PRs for all cloud-platform-* repos every hour between 9am-5pm UTC Monday to Friday.

- **cloud-platform**

Report all open PRs for the cloud-platform-environments and cloud-platform-infrastructure repos at 9am UTC Monday to Friday.
Reports all open PRs for the cloud-platform-environments and cloud-platform-infrastructure repos at 9am UTC Monday to Friday.

### Steps required for new repositories

Expand Down

0 comments on commit 9bb15da

Please sign in to comment.