diff --git a/runbooks/makefile b/runbooks/makefile index 5ea2732f..fcc924d4 100644 --- a/runbooks/makefile +++ b/runbooks/makefile @@ -1,4 +1,4 @@ -IMAGE := ministryofjustice/tech-docs-github-pages-publisher:v2 +IMAGE := ministryofjustice/tech-docs-github-pages-publisher:v3 # Use this to run a local instance of the documentation site, while editing .PHONY: preview diff --git a/runbooks/source/incident-log.html.md.erb b/runbooks/source/incident-log.html.md.erb index 09479a61..bbad7f76 100644 --- a/runbooks/source/incident-log.html.md.erb +++ b/runbooks/source/incident-log.html.md.erb @@ -9,9 +9,9 @@ weight: 45 ## Q3 2023 (July-September) -- **Mean Time to Repair**: 17h 45m +- **Mean Time to Repair**: 10h 55m -- **Mean Time to Resolve**: 31h 21m +- **Mean Time to Resolve**: 19h 21m ### Incident on 2023-09-18 15:12 - Lack of Disk space on nodes @@ -21,9 +21,9 @@ weight: 45 - Repaired: 2023-09-18 17:54 - Resolved 2023-09-20 19:18 -- **Time to repair**: 1h 30m +- **Time to repair**: 4h 12m -- **Time to resolve**: 53h 36m +- **Time to resolve**: 35h 36m - **Identified**: User reported that they are seeing [ImagePull errors](https://mojdt.slack.com/archives/C57UPMZLY/p1695042194935169) no space left on device error @@ -66,9 +66,9 @@ weight: 45 - Repaired: 2023-08-10 12:28 - Resolved 2023-08-10 14:47 -- **Time to repair**: 63h 14m +- **Time to repair**: 33h 14m -- **Time to resolve**: 65h 33m +- **Time to resolve**: 35h 33m - **Identified**: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana. @@ -96,7 +96,7 @@ weight: 45 - Push notifications from logging clusters to #lower-priority-alerts [#4704](https://github.com/ministryofjustice/cloud-platform/issues/4704) - Add integration test to check that logs are being sent to the logging cluster -### Incident on 2023-07-25 15:21 Prometheus on live cluster DOWN +### Incident on 2023-07-25 15:21 - Prometheus on live cluster DOWN - **Key events** - First detected: 2023-07-25 14:05 @@ -168,11 +168,11 @@ weight: 45 ## Q2 2023 (April-June) -- **Mean Time to Repair**: +- **Mean Time to Repair**: 0h 55m -- **Mean Time to Resolve**: +- **Mean Time to Resolve**: 0h 55m -### Incident on 2023-06-06 11:00 - User services down during recyle of nodes +### Incident on 2023-06-06 11:00 - User services down - **Key events** - First detected: 2023-06-06 10:26