Skip to content

Commit

Permalink
Updated the incident process to remove diagrams
Browse files Browse the repository at this point in the history
  • Loading branch information
DominicGriffin committed Oct 31, 2024
1 parent 219c074 commit 5ed8840
Show file tree
Hide file tree
Showing 4 changed files with 5 additions and 11 deletions.
Binary file removed source/diagrams/escalation-process-p1-p3.jpeg
Binary file not shown.
Binary file removed source/diagrams/escalation-process-p4.jpeg
Binary file not shown.
Binary file removed source/diagrams/escalation-process-p4.png
Binary file not shown.
16 changes: 5 additions & 11 deletions source/incident_management/incident_process.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,14 @@ title: Incident Process

# So, you’re having an incident

This document is the GOV.UK PaaS team playbook for managing a technical incident. It covers tasks for the engineering lead and the communications (comms) lead.

## Team roles
**PaaS SREs:** Full time SREs who work on the service day to day. Absences should be staggered to reduce the amount of time where neither PaaS SREs are available.
**Managed Service SREs:** Wider pool of SREs supplied via a managed service contract. Respond to incidents when neither PaaS SRE is available. Manage P1-P3 incidents only, using Team Manual runbook. Escalate to GDS backstop engineers if unable to mitigate the incident using the runbooks.
**GDS Backstop Engineers:** GDS Civil Servants who previously worked on GOV.UK PaaS. These can be escalated to as a last resort, if an incident has not been resolved using runbooks or investigation. Contactable via slack channel #paas-escalation.
![Diagram of SRE capacity plan](/diagrams/sre-escalation-service-capacity.png)

## Incident process
### P1-P3 Process
![Diagram of escalation procedure for p1-p3 incidents](/diagrams/escalation-process-p1-p3.jpeg)

### P4 Process
We do not have an SLA for P4s, as P4s are outside of scope for the Managed Service SREs. If the engineer on support is from the wider Managed Service Pool (and no PaaS SREs are available) then P4s will be paused until a PaaS SRE is available to investigate and remediate.

![Diagram of escalation procedure for p1-p3 incidents](/diagrams/escalation-process-p4.jpeg)

This document is the GOV.UK PaaS team playbook for managing a technical incident. It covers tasks for the engineering lead and the communications (comms) lead.

## Engineering lead tasks

As the engineer on support, you become the engineering lead in an incident, and you’re responsible for declaring an incident. However, other engineers should help as necessary, especially in incidents lasting several hours or more.
Expand All @@ -42,6 +33,9 @@ If an incident is ongoing outside of office hours (i.e. an in-hours incident con
3. The #paas-incident channel has a bookmarked hangout link. Join this video call to communicate with the comms lead and talk through what you’re doing and what’s happening.
4. If you decide it’s not an incident after investigating further, you must resolve the incident in PagerDuty. If you are sure it is an incident, [agree on a priority](https://www.cloud.service.gov.uk/support-and-response-times/#response-times-for-services-in-production) for the incident with the comms lead. You can change this priority level later as more information emerges.

### P4 Process
We do not have an SLA for P4s, as P4s are outside of scope for the Managed Service SREs. If the engineer on support is from the wider Managed Service Pool (and no PaaS SREs are available) then P4s will be paused until a PaaS SRE is available to investigate and remediate.

## Communication lead tasks

You are not expected to be involved every time an alert goes off. PagerDuty will call the engineering lead in the event of an alert, and it is the engineering lead’s responsibility to triage and escalate to you as necessary.
Expand Down

0 comments on commit 5ed8840

Please sign in to comment.