Skip to content

Commit

Permalink
Remove trailing whitespace.
Browse files Browse the repository at this point in the history
  • Loading branch information
ryanlovett committed Aug 9, 2024
1 parent 26b7f8a commit 7085e9e
Show file tree
Hide file tree
Showing 17 changed files with 24 additions and 24 deletions.
2 changes: 1 addition & 1 deletion docs/incidents/2017-02-24-autoscaler-incident.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Custom Autoscaler gonee haywire
date: 2017-02-24
date: 2017-02-24
---

## Summary ##
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2017-03-06-helm-config-image-mismatch.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Non-matching hub image tags cause downtime
date: 2017-03-06
date: 2017-03-06
---

## Summary ##
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2017-03-20-too-many-volumes.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Too many volumes per disk leave students stuck
date: 2017-03-20
date: 2017-03-20
---

## Summary ##
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2017-03-23-kernel-deaths-incident.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Weird upstream ipython bug kills kernels
date: 2017-03-23
date: 2017-03-23
---

## Summary ##
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2017-04-03-cluster-full-incident.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Custom autoscaler does not scale up when it should
date: 2017-04-03
date: 2017-04-03
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2017-05-09-gce-billing.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Oops we forgot to pay the bill
date: 2017-05-09
date: 2017-05-09
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2017-10-10-hung-nodes.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Docker dies on a few Azure nodes
date: 2017-10-10
date: 2017-10-10
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2017-10-19-course-subscription-canceled.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Billing confusion with Azure portal causes summer hub to be lost
date: 2017-10-19
date: 2017-10-19
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2018-01-25-helm-chart-upgrade.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Accidental merge to prod brings things down
date: 2018-01-25
date: 2018-01-25
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2018-01-26-hub-slow-startup.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Hub starts up very slow, causing outage for users
date: 2018-01-26
date: 2018-01-26
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2018-02-06-hub-db-dir.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Azure PD refuses to detach, causing downtime for data100
date: 2018-02-06
date: 2018-02-06
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2018-02-28-hung-node.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: A node hangs, causing a subset of users to report issues
date: 2018-02-28
date: 2018-02-28
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2018-06-11-course-subscription-canceled.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Azure billing issue causes downtime
date: 2018-06-11
date: 2018-06-11
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2019-02-25-k8s-api-server-down.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Azure Kubernetes API Server outage causes downtime
date: 2019-02-25
date: 2019-02-25
---

## Summary
Expand Down
2 changes: 1 addition & 1 deletion docs/incidents/2019-05-01-service-account-leak.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Service Account key leak incident
date: 2019-05-01
date: 2019-05-01
---

## Summary
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ date: 2022-01-20

[PR 1](https://github.com/berkeley-dsep-infra/datahub/pull/3161) and [PR 2](https://github.com/berkeley-dsep-infra/datahub/pull/3164/commits/a3fc71d5a68b030cda91029b5dbb6c01c0eec8fe) were merged to prod between 2 AM and 2.30 AM PST on 1/20. Difference due to the commits can be viewed [here](https://github.com/berkeley-dsep-infra/datahub/pull/3151/files#diff-72ab2727eb8dffad68933fd8e624ef3126cc0a107685c3f0e16fcee62fc77c76)

Due to these changes, image rebuild happened which broke multiple hubs which used that image including Datahub, ISchool, R, Data 100 and Data 140 hubs.
Due to these changes, image rebuild happened which broke multiple hubs which used that image including Datahub, ISchool, R, Data 100 and Data 140 hubs.

One of the dependenices highlighted as part of the image build had an upgrade which resulted in R hub throwing 505 error and Data 100/140 hub throwing "Error starting Kernel". [Yuvi to fill in the right technical information]

Expand All @@ -21,7 +21,7 @@ Quick summary of the problem. Update this section as we learn more, answering:
- what went wrong and how we fixed it.
-->

- R Hub was not accessible for about 6 hours. Issue affected 10+ Stat 20 GSIs planning for their first class of the semester (catering to the needs of 600+ students). Hub went down for few minutes during the instruction.
- R Hub was not accessible for about 6 hours. Issue affected 10+ Stat 20 GSIs planning for their first class of the semester (catering to the needs of 600+ students). Hub went down for few minutes during the instruction.
- Prob 140 hub was not available till 12.15 AM PST
- Data 100 hub was not available till 12.33 AM. Thankfully, assignments were not due till friday (1/21)
- Few users in Ischool were affected as they could not access R Studio
Expand All @@ -37,7 +37,7 @@ Quick summary of the problem. Update this section as we learn more, answering:

### {{ 06:10 }}

Andrew Bray (Stat 20 instructor) raised a [github issue](https://github.com/berkeley-dsep-infra/datahub/issues/3166) around 5.45 AM PST.
Andrew Bray (Stat 20 instructor) raised a [github issue](https://github.com/berkeley-dsep-infra/datahub/issues/3166) around 5.45 AM PST.

### {{ 07:45 }}

Expand Down Expand Up @@ -74,13 +74,13 @@ They should focus on the knowledge we've gained and any improvements we should t

Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items.
Action items.

## Where we got lucky

These are good things that happened to us but not because we had planned for them.

- Yuvi was awake at the time when issue was reported and was able to fix the issues immediately.
- Yuvi was awake at the time when issue was reported and was able to fix the issues immediately.
- Classes using hubs were not completely affected due to this outage (Data 100 did not have assignments due till 1/21 and Stat 20 had few mins of outage during instruction)

## Action items
Expand All @@ -91,7 +91,7 @@ These are only sample subheadings. Every action item should have a GitHub issue
### Process/Policy improvements

1. {{[Develop manual testing process](https://github.com/berkeley-dsep-infra/datahub/issues/2953) whenever a PR gets merged to staging of the major hubs (till automated test suites are written)}} [link to github issue](https://github.com/berkeley-dsep-infra/datahub/issues/2953)]
2. Develop a policy around when to create a new hub and what type of changes get deployed to Datahub!
2. Develop a policy around when to create a new hub and what type of changes get deployed to Datahub!

### Documentation improvements

Expand Down
6 changes: 3 additions & 3 deletions docs/incidents/2024-core-node-incidents.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ We have spent much time working to debug and track this, including with our frie

After some back and forth w/the upstream maintainers, we received a [forked version](https://github.com/berkeley-dsep-infra/datahub/pull/5501) of the proxy to test.

During this testing, we triggered some user-facing downtime, as well as the proxy itself crashing and causing small outages.
During this testing, we triggered some user-facing downtime, as well as the proxy itself crashing and causing small outages.

Another (unrelated) issue that impacted users was that [GKE](https://cloud.google.com/kubernetes-engine) was autoscaling our core pool (where the hub and proxy pods run) node to zero. Since it takes about 10-15m for a new node to spin up, all hubs were inaccessible until the new node was deployed.

Expand Down Expand Up @@ -59,7 +59,7 @@ proxy ram 800Mi (steady)
spike on proxy — cpu 181%, mem 1.06Gi --> 1.86Gi

### 16:05:53
chp healthz readiness probe failure
chp healthz readiness probe failure

### 16:05:56
chp/javascript runs out of heap “Ineffective mark-compass near heap limit Allocation Failed”
Expand Down Expand Up @@ -95,7 +95,7 @@ chp restarts (no heap error)
5.7K 503 errors

### 16:54:15 - 17:15:31
300 users (slowly descreasing), 3x chp “uncaught exception: write EPIPE”, intermittent 503 errors in spikes of 30, 60, 150, hub latency 2.5sec
300 users (slowly descreasing), 3x chp “uncaught exception: write EPIPE”, intermittent 503 errors in spikes of 30, 60, 150, hub latency 2.5sec

### 18:47:19 - 18:58:10
~120 users (constant), 3x chp “uncaught exception: write EPIPE”, intermittent 503 errors in spikes of 30, 60, hub latency 3sec
Expand Down

0 comments on commit 7085e9e

Please sign in to comment.