Skip to content

Latest commit

 

History

History
49 lines (32 loc) · 1.87 KB

INCIDENT-RESPONSE.md

File metadata and controls

49 lines (32 loc) · 1.87 KB

Incident Response

hello friendly neighbourhood L3 responder! the engineering team is extremely sorry for L1+L2 response being AWOL.

Things to check:

Resolving Issues:

Google Cloud Shell

Google helpfully have a web-based shell that you can use anywhere for incident response.

$ gcloud container clusters get-credentials "$(< current-cluster)" --zone=us-west1 --project=balmy-ground-195100

Checking on the Kubernetes pods:

Run the following command at the cloud shell prompt:

$ kubectl get pods

You should see something like:

https://www.dropbox.com/s/nt9vd579eiyq3kh/Screenshot%202018-07-17%2016.28.36.png?dl=0

Interpreting the 'Status' column
  • If everything is mostly 'Running' then Kubernetes thinks the system is healthy. If we're still down, you should try to restart the pods via the section below
  • Anything else like Terminating, CrashLoopBackOff etc. indicates that kubernetes knows that the system is unhealthy and is working to fix it

Restarting Kubernetes pods:

Run the following command:

$ curl -L https://gist.githubusercontent.com/IanConnolly/5cc6625a54ad7b2c49eff61b52729602/raw/21d7d1af8b2c933669b1cd585af120b31281b96c/force-system-restart | bash