Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] 403 Errors on UofT and Temple Hubs #3137

Closed
5 tasks
jmunroe opened this issue Sep 14, 2023 · 4 comments
Closed
5 tasks

[Incident] 403 Errors on UofT and Temple Hubs #3137

jmunroe opened this issue Sep 14, 2023 · 4 comments

Comments

@jmunroe
Copy link
Contributor

jmunroe commented Sep 14, 2023

Summary

Community reports of errors with the UofT hubs at https://jupyter.utoronto.ca/ and https://r.datatools.utoronto.ca/ like:

403 : Forbidden
Sorry, you are not currently authorized to use this hub. Please contact the hub administrator.
First report was just after midnight (Eastern) today. We are able to replicate the issue. Please give it a look, thanks!

Impact on users

Users cannot log into the JuptyerHub.

Important information

Tasks and updates

  • Discuss and address incident, leaving comments below with updates
  • Incident has been dealt with or is over
  • Copy/paste the after-action report below and fill in relevant sections
  • Incident title is discoverable and accurate
  • All actionable items in report have linked GitHub Issues
After-action report template
# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- 2023-09-14 - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]
@jmunroe jmunroe self-assigned this Sep 14, 2023
@jmunroe jmunroe changed the title [Incident] 403 Errors on UofT Hub [Incident] 403 Errors on UofT and Temple Hubs Sep 14, 2023
@jmunroe
Copy link
Contributor Author

jmunroe commented Sep 14, 2023

There was an upgrade last night in the oauthenticator (15.1 -> 16) package that appears to have caused issues for logging across many hubs.

That change has now been rolled back

This incident should now be resolved and students should be able to log in as before. Both UofT and Temple have been communicated with and the corresponding Freshdesk tickets are now closed.

@jmunroe
Copy link
Contributor Author

jmunroe commented Sep 14, 2023

This was @jmunroe first time going through the Incident Response process in 'incident commander' role. I can confirm that the documentation given at https://team-compass.2i2c.org/projects/managed-hubs/incidents/ was sufficient for me to follow.

My one small hiccup was not seeing immediately how to create the Slack channel. The instructions referred to "checking the box for Create a dedicated Public Slack channel for this incident and I either missed it or it wasn't there. I spent some time trying to find how to create the incident Slack channel from the PagerDuty web UI before recognizing that there was a button in the #pagerduty-notifications slack widget tool.

While it was clear to me that this constituted an "incident" so I felt I needed to shift focus to it as soon as I learned about it (via Freshdesk email notification), I was a bit hesitant to assume I should take on the 'Incident Commander' role. In life guarding/first aid contexts, the first person on the scene is automatically in that role until they are officially relieved of that role so I assumed it would be the same here.

@jmunroe
Copy link
Contributor Author

jmunroe commented Sep 14, 2023

I appreciated @consideRatio being available and willing to assist with this outage even though he was scheduled for leave today.

The PR that caused the outage #3118 looked like it affected many parts of our infrastructure. We should revisit our testing procedures to improve our likelihood of catching authentication related errors.

Was the nature of the upgrade such at all hubs needs to be changed over together? I don't understand enough about details of the upgrade z2jh and oauthenticator to know if they are intimately linked or could have been split into two separate changes.

For testing, I think we need to have at least one 'regular user' (non-privileged, non-admin, non-2i2c) account able to log in for the hubs. This is especially important for our education hubs where we are often not using GitHub or Google but and education oauth provider.

For UofT, I have a UTORid that I think would have caught the issue if I had tested it with those credentials. I don't know how we would have verified the Temple hub. My understanding is authentication was still working against Google but the problem was related to these other oauth providers and the changes in oauthenticator 16.0.

@jmunroe
Copy link
Contributor Author

jmunroe commented Sep 14, 2023

@consideRatio (and a few others) is on leave today and tomorrow, so we should revisit this incident early next week to debrief and create an incident report.

@damianavila damianavila moved this to Waiting 🕛 in Sprint Board Sep 25, 2023
@damianavila damianavila moved this from Needs Shaping / Refinement to Waiting in DEPRECATED Engineering and Product Backlog Sep 25, 2023
@github-project-automation github-project-automation bot moved this from Waiting 🕛 to Done 🎉 in Sprint Board Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done 🎉
Development

No branches or pull requests

2 participants