Skip to content

Commit

Permalink
Merge branch 'main' into update-runbooks
Browse files Browse the repository at this point in the history
  • Loading branch information
folarin oyenuga committed Dec 6, 2024
2 parents 128417e + 85e7189 commit 10a34d3
Show file tree
Hide file tree
Showing 29 changed files with 298 additions and 270 deletions.
38 changes: 0 additions & 38 deletions .github/workflows/auto-approve-pr.yml

This file was deleted.

2 changes: 1 addition & 1 deletion .github/workflows/dependency-review.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Checkout Repository
uses: actions/checkout@v4
- name: Dependency Review
uses: actions/dependency-review-action@v2
uses: actions/dependency-review-action@v4
with:
# Possible values: critical, high, moderate, low
fail-on-severity: critical
2 changes: 1 addition & 1 deletion .github/workflows/format-code.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ministryofjustice/github-actions/code-formatter@e08cbcac12ec9c09d867ab2b803d4ea1a87300ad # v18.2.4
- uses: ministryofjustice/github-actions/code-formatter@ccf9e3a4a828df1ec741f6c8e6ed9d0acaef3490 # v18.5.0
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
4 changes: 2 additions & 2 deletions .github/workflows/link-checker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ jobs:
- uses: actions/checkout@v4

- name: Link Checker
uses: lycheeverse/lychee-action@v1.9.1
uses: lycheeverse/lychee-action@v2.1.0
with:
args: --verbose --no-progress **/*.md **/*.html **/*.erb --accept 200,429,403,400,301,302,401 --exclude-mail
env:
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

- name: Create Issue From File
if: env.lychee_exit_code != 0
uses: peter-evans/create-issue-from-file@v3
uses: peter-evans/create-issue-from-file@v5
with:
title: Link Checker Report
content-filepath: ./lychee/out.md
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
apk update && apk add rsync
which rsync
- name: Deploy
uses: JamesIves/github-pages-deploy-action@v4.3.3
uses: JamesIves/github-pages-deploy-action@v4.7.1
with:
token: ${{ secrets.PUBLISHING_GIT_TOKEN }}
git-config-name: cloud-platform-moj
Expand Down
129 changes: 66 additions & 63 deletions architecture-decision-record/022-EKS.md

Large diffs are not rendered by default.

7 changes: 5 additions & 2 deletions architecture-decision-record/023-Logging.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# 23 Logging

Date: 02/06/2021
Date: 11/11/2024

## Status

✅ Accepted

## Context

Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch). This allows [service teams](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/access-logs.html#accessing-application-log-data) and Cloud Platform team to use Kibana's search and browse functionality, for the purpose of debug and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days.
> Cloud Platform's existing strategy for logs has been to **centralize** them in an ElasticSearch instance (Saas hosted by AWS OpenSearch).
As of November 2024, we have migrated the logging service over to AWS OpenSearch, with ElasticSearch due for retirement (pending some decisions and actions on how to manage existing data retention on that cluster).
Service teams can use OpenSearch's [search and browse functionality](https://app-logs.cloud-platform.service.justice.gov.uk/_dashboards/app/home#/) for the purposes of debugging and resolving incidents. All pods' stdout get [shipped using Fluentbit](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#application-log-collection-and-storage) and ElasticSearch stored them for 30 days. The full lifecycle policy configuration for OpenSearch can be viewed [here](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/account/resources/opensearch/ism-policy.json.tpl).

Concerns with existing ElasticSearch logging:

Expand Down
6 changes: 4 additions & 2 deletions architecture-decision-record/026-Managed-Prometheus.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 26 Managed Prometheus

Date: 2021-10-08
Date: 2024-11-11

## Status

Expand Down Expand Up @@ -67,7 +67,9 @@ We also need to address:

**Sharding**: We could split/shard the Prometheus instance: perhaps dividing into two - tenants and platform. Or if we did multi-cluster we could have one Prometheus instance per cluster. This appears relatively straightforward to do. There would be concern that however we split it, as we scale in the future we'll hit future scaling thresholds, where it will be necessary to change how to divide it into shards, so a bit of planning would be needed.

**High Availability**: The recommended approach would be to run multiple instances of Prometheus configured the same, scraping the same endpoints independently. [Source](https://prometheus-operator.dev/docs/operator/high-availability/#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://prometheus-operator.dev/docs/operator/high-availability/) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since.
**High Availability**: We are now running Prometheus in HA mode [with 3 replicas](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/pull/239). Keeping the findings below as we may have some additional elements of HA to consider in the future:

> [Source](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md#prometheus) There is a `replicas` option to do this. However for HA we would also need to have a load balancer for the PromQL queries to the Prometheus API, to fail-over if the primary is unresponsive. And it's not clear how this works with duplicate alerts being sent to AlertManager. This doesn't feel like a very paved path, with Prometheus Operator [saying](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/high-availability.md) "We are currently implementing some of the groundwork to make this possible, and figuring out the best approach to do so, but it is definitely on the roadmap!" - Jan 2017, and not updated since.
**Managed Prometheus**: Using a managed service of prometheus, such as AMP, would address most of these concerns, and is evaluated in detail in the next section.

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/add-new-opa-policy.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Add a new OPA policy
weight: 9000
last_reviewed_on: 2024-05-24
last_reviewed_on: 2024-11-25
review_in: 6 months
---

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/auth0-rotation.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Credentials rotation for auth0 apps
weight: 68
last_reviewed_on: 2024-05-24
last_reviewed_on: 2024-11-25
review_in: 6 months
---

Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/bastion-node.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Create and access bastion node
weight: 97
last_reviewed_on: 2024-05-24
last_reviewed_on: 2024-11-25
review_in: 6 months
---

Expand Down
Loading

0 comments on commit 10a34d3

Please sign in to comment.