From cebfafb8949c8166bac02e3486974bd473c2ea39 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Fri, 12 Jan 2024 17:49:45 +0000 Subject: [PATCH] Commit changes made by code formatters --- .../006-Use-github-as-user-directory.md | 2 -- .../012-One-cluster-for-dev-staging-prod.md | 14 ++++---- .../021-Multi-cluster.md | 36 +++++++++---------- architecture-decision-record/023-Logging.md | 22 ++++++------ .../034-EKS-Fargate.md | 8 ++--- runbooks/source/how-we-work.html.md.erb | 2 +- .../upgrade-terraform-version.html.md.erb | 4 +-- 7 files changed, 43 insertions(+), 45 deletions(-) diff --git a/architecture-decision-record/006-Use-github-as-user-directory.md b/architecture-decision-record/006-Use-github-as-user-directory.md index 6f389e34..a34d1b8b 100644 --- a/architecture-decision-record/006-Use-github-as-user-directory.md +++ b/architecture-decision-record/006-Use-github-as-user-directory.md @@ -18,14 +18,12 @@ We are proposing that we aim for a "single sign on" approach where users can use The current most complete source of this information for people who will be the first users of the cloud platform is GitHub. So our proposal is to use GitHub as our initial user directory - authentication for the new services that we are building will be through GitHub. - ## Decision We will use GitHub as the identify provider for the cloud platform. We will design and build the new cloud platform with the assumption that users will login to all components using a single GitHub id. - ## Consequences We will define users and groups in GitHub and use GitHub's integration tools to provide access to other tools that require authentication. diff --git a/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md b/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md index 47665a7c..19bc6721 100644 --- a/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md +++ b/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md @@ -22,10 +22,10 @@ After consideration of the pros and cons of each approach we went with one clust Some important reasons behind this move were: -* A single k8s cluster can be made powerful enough to run all of our workloads -* Managing a single cluster keeps our operational overhead and costs to a minimum. -* Namespaces and RBAC keep different workloads isolated from each other. -* It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments +- A single k8s cluster can be made powerful enough to run all of our workloads +- Managing a single cluster keeps our operational overhead and costs to a minimum. +- Namespaces and RBAC keep different workloads isolated from each other. +- It would be very hard to keep multiple clusters (dev/staging/prod) from becoming too different to be representative environments To clarify the last point; to be useful, a development cluster must be as similar as possible to the production cluster. However, given multiple clusters, with different security and other constraints, some 'drift' is inevitable - e.g. the development cluster might be upgraded to a newer kubernetes version before staging and production, or it could have different connectivity into private networks, or different performance constraints from the production cluster. @@ -39,6 +39,6 @@ If namespace segregation is not sufficient for this, then the whole cloud platfo Having a single cluster to maintain works well for us. -* Service teams know that their development environments accurately reflect the production environments they will eventually create -* There is no duplication of effort, maintaining multiple, slightly different clusters -* All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies) +- Service teams know that their development environments accurately reflect the production environments they will eventually create +- There is no duplication of effort, maintaining multiple, slightly different clusters +- All services are managed centrally (e.g. ingress controller, centralised log collection via fluentd, centralised monitoring with Prometheus, cluster security policies) diff --git a/architecture-decision-record/021-Multi-cluster.md b/architecture-decision-record/021-Multi-cluster.md index 5911da09..b7b46621 100644 --- a/architecture-decision-record/021-Multi-cluster.md +++ b/architecture-decision-record/021-Multi-cluster.md @@ -8,27 +8,27 @@ Date: 2021-05-11 ## What’s proposed -We host user apps across *more than one* Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster *may* be further isolated by placing them in separate VPCs or separate AWS accounts. +We host user apps across _more than one_ Kubernetes cluster. Apps could be moved between clusters without too much disruption. Each cluster _may_ be further isolated by placing them in separate VPCs or separate AWS accounts. ## Context Service teams' apps currently run on [one Kubernetes cluster](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/012-One-cluster-for-dev-staging-prod.md). That includes their dev/staging/prod environments - they are not split off. The key reasoning was: -* Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments -* Maintaining clusters for each environment is a cost in effort -* You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod. +- Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments +- Maintaining clusters for each environment is a cost in effort +- You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod. (We also have clusters for other purposes: a 'management' cluster for Cloud Platform team's CI/CD and ephemeral 'test' clusters for the Cloud Platform team to test changes to the cluster.) However we have seen some problems with using one cluster, and advantages to moving to multi-cluster: -* Scaling limits -* Single point of failure -* Derisk upgrading of k8s -* Reduce blast radius for security -* Reduce blast radius of accidental deletion -* Pre-prod cluster -* Cattle not pets +- Scaling limits +- Single point of failure +- Derisk upgrading of k8s +- Reduce blast radius for security +- Reduce blast radius of accidental deletion +- Pre-prod cluster +- Cattle not pets ### Scaling limits @@ -40,11 +40,11 @@ Running everything on a single cluster is a 'single point of failure', which is Several elements in the cluster are a single point of failure: -* ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls)) -* external-dns -* cert manager -* kiam -* OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58)) +- ingress (incidents: [1](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-10-06-09-07-intermittent-quot-micro-downtimes-quot-on-various-services-using-dedicated-ingress-controllers) [2](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-04-15-10-58-nginx-tls)) +- external-dns +- cert manager +- kiam +- OPA ([incident](https://runbooks.cloud-platform.service.justice.gov.uk/incident-log.html#incident-on-2020-02-25-10-58)) ### Derisk upgrading of k8s @@ -76,8 +76,8 @@ Multi-cluster will allow us to put pre-prod environments on a separate cluster t If we were to create a fresh cluster, and an app is moved onto it, then there are a lot of impacts: -* **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl. -* **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated. +- **Kubecfg** - a fresh cluster will have a fresh kubernetes key, which invalidates everyone's kubecfg. This means that service teams will need to obtain a fresh token and add it to their app's CI/CD config and every dev will need to refresh their command-line kubecfg for running kubectl. +- **IP Addresses** - unless the load balancer instance and elastic IPs are reused, it'll have fresh IP addresses. This will particularly affect devices on mobile networks that accessing our CP-hosted apps, because they often cache the DNS longer than the TTL. And if CP-hosted apps access third party systems and have arranged for our egress IP to be allow-listed in their firewall, then they will not work until that's updated. ## Steps to achieve it diff --git a/architecture-decision-record/023-Logging.md b/architecture-decision-record/023-Logging.md index 3d76e3fb..1eb16149 100644 --- a/architecture-decision-record/023-Logging.md +++ b/architecture-decision-record/023-Logging.md @@ -12,10 +12,10 @@ Cloud Platform's existing strategy for logs has been to **centralize** them in a Concerns with existing ElasticSearch logging: -* ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes) -* CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs` -* Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged -* Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES +- ElasticSearch costs a lot to run - it uses a lot of memory (for lots of things, although it is disk first for the documents and indexes) +- CP team doesn't need the power of ElasticSearch very often - rather than use Kibana to look at logs, the CP team mostly uses `kubectl logs` +- Service teams have access to other teams' logs, which is a concern should personal information be inadvertantly logged +- Fluentd + AWS OpenSearch combination has no flexibility to parse/define the JSON structure of logs, so all our teams right now have to contend with grabbing the contents of a single log field and parsing it outside ES With these concerns in mind, and the [migration to EKS](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md) meaning we'd need to reimplement log shipping, we reevaluate this strategy. @@ -37,11 +37,11 @@ Rather than centralized logging in ES, we'll evaluate different logging solution **AWS services for logging** - with the cluster now in EKS, it wouldn't be too much of a leap to centralizing logs in CloudWatch and make use of the AWS managed tools. One one hand it's proprietary to AWS, so adds cost of switching away. But it might be preferable to the cost of running ES, and related tools like GuardDuty and Security Hub, with use across Modernization Platform, is attractive. -### Observing apps** +### Observing apps\*\* -* Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it *requires* heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability. -* CloudWatch Logs - possible and low operational overhead - needs further evaluation. -* Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging. +- Loki - seems a good fit. For occasional searches, a disk-based index seems more appropriate - higher latency than memory, but much lower cost to run. (In comparison, ES describes itself as primarily disk based indexes, but it _requires_ heavy use of memory.) Could setup an instance per team. Need to evaluate how we'd integrate it, and usability. +- CloudWatch Logs - possible and low operational overhead - needs further evaluation. +- Sentry - Some teams have beeing using Sentry for logs, but [Sentry says themself it is better suited to error management](https://sentry.io/vs/logging/), which is a narrower benefit than full logging. ### Observing the platform @@ -53,9 +53,9 @@ TBD ### Security -* MLAP was designed for this, but it is stalled, so probably best to manage it ourselves. -* ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period. -* AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous. +- MLAP was designed for this, but it is stalled, so probably best to manage it ourselves. +- ElasticSearch does have open source plugins for SIEM scanning. And it offers quick searching needed during a live incident. Maybe we could reduce the amount of data we put in it. But fundamentally it is an expensive option, to get both live searching and long retention period. +- AWS-native solution using GuardDuty and CloudWatch Logs may provide something analogous. ## Next steps diff --git a/architecture-decision-record/034-EKS-Fargate.md b/architecture-decision-record/034-EKS-Fargate.md index 6e1b829e..21f306d6 100644 --- a/architecture-decision-record/034-EKS-Fargate.md +++ b/architecture-decision-record/034-EKS-Fargate.md @@ -14,8 +14,8 @@ Move from EKS managed nodes to EKS Fargate. This is really attractive because: -* to reduce our operational overhead -* improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container). +- to reduce our operational overhead +- improve security isolation between pods (it uses Firecracker, so we can stop worrying about an attacker managing to escape a container). However there’s plenty of things we’d need to tackle, to achieve this (copied from [ADR022 EKS - Fargate considerations](https://github.com/ministryofjustice/cloud-platform/blob/main/architecture-decision-record/022-EKS.md#future-fargate-considerations)): @@ -23,8 +23,8 @@ However there’s plenty of things we’d need to tackle, to achieve this (copie **Daemonset functionality** - needs replacement: -* fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. -* prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network +- fluent-bit - currently used for log shipping to ElasticSearch. AWS provides a managed version of [Fluent Bit on Fargate](https://aws.amazon.com/blogs/containers/fluent-bit-for-amazon-eks-on-aws-fargate-is-here/) which can be configured to ship logs to ElasticSearch. +- prometheus-node-exporter - currently used to export node metrics to prometheus. In Fargate the node itself is managed by AWS and therefore hidden. However we can [collect some useful metrics about pods running in Fargate from scraping cAdvisor](https://aws.amazon.com/blogs/containers/monitoring-amazon-eks-on-aws-fargate-using-prometheus-and-grafana/), including on CPU, memory, disk and network **No EBS support** - Prometheus will run still in a managed node group. Likely other workloads too to consider. diff --git a/runbooks/source/how-we-work.html.md.erb b/runbooks/source/how-we-work.html.md.erb index 345f5e1a..d1a24254 100644 --- a/runbooks/source/how-we-work.html.md.erb +++ b/runbooks/source/how-we-work.html.md.erb @@ -116,7 +116,7 @@ Instead, when not answering queries and reviewing PRs, the Hammer should work on Most of our user-facing documentation is in the [user guide], and documentation for the team is in the [runbooks] site. -There are also a lot of important `README.md` files like [this one](https://github.com/ministryofjustice/cloud-platform#ministry-of-justice-cloud-platform-master-repo), especially for our terraform modules. +There are also a lot of important `README.md` files like [this one](https://github.com/ministryofjustice/cloud-platform#ministry-of-justice-cloud-platform-master-repo), especially for our terraform modules. We also have code samples like [this](https://github.com/ministryofjustice/cloud-platform-terraform-rds-instance/blob/main/examples/rds-postgresql.tf) for each of our terraform modules. It is important to keep all of this up to date as the underlying code changes, so please remember to factor this in when estimating and working on tickets. diff --git a/runbooks/source/upgrade-terraform-version.html.md.erb b/runbooks/source/upgrade-terraform-version.html.md.erb index 8dfddc53..03c15c20 100644 --- a/runbooks/source/upgrade-terraform-version.html.md.erb +++ b/runbooks/source/upgrade-terraform-version.html.md.erb @@ -118,8 +118,8 @@ When all namespaces in the cloud-platform-environments repository are using the - [Remove](https://github.com/ministryofjustice/cloud-platform-environments/commit/b11b0372fe71289e51739395664355014df0e655) the conditional logic in the apply library. ### Infrastructure state files -The Infrastructure state we have in the Cloud Platform is structured in a tree related to its dependency, -so for example, the [components](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components) state (in the output below) relies heavily on the directory above and so on. +The Infrastructure state we have in the Cloud Platform is structured in a tree related to its dependency, +so for example, the [components](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components) state (in the output below) relies heavily on the directory above and so on. Here is a snapshot of how our directory looks but this is likely to change: ```