[Incident] Kernel restarts and internal gateway errors for for U. Texas hub #1021

choldgraf · 2022-02-22T21:36:29Z

Summary

We recently deployed the UT Austin hub, which is running a database sidecar for each student (#968). Since the hub began use, we've gotten reports of intermittent kernel crashes affecting students. This has been difficult to reproduce, but has been happening consistently.

There was originally a FreshDesk ticket about this here: https://2i2c.freshdesk.com/a/tickets/84?lang=en

We thought we had resolved it by increasing the memory limits for each user session, but these issues have persisted. Here are the two major things that we've found:

Users experiencing "kernel died" errors sporadically
Users getting "internal gateway errors" when using pgadmin ([New Hub] University of Texas #968 (comment))

Here's an issue where we describe some of the problems people have, and how to reproduce them:

#1008 (comment)

Hub information

Hub URL: https://utexas.pilot.2i2c.cloud/

Timeline (if relevant)

All times in US/Pacific.

2022-02-17

@jameshowison reports that student kernels are dying via FreshDesk

2022-02-18

We conclude that this may be a problem with memory limits students are hitting during queries, and bump their guarantee here: #1008

2022-02-21

We get reports that many students are still running into this problem. It is not reliable to replicate, but people are consistently hitting these errors.

ref: #1008 (comment)

TODO: add more as new developments happen

After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

What went wrong

Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items.

Where we got lucky

These are good things that happened to us but not because we had planned for them.

Action items

These are only sample subheadings. Every action item should have a GitHub issue
(even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/, they can be in other repositories.

Process improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Documentation improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Technical improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Actions

Incident has been dealt with or is over
Sections above are filled out
Incident title and after-action report is cleaned up
All actionable items above have linked GitHub Issues

The text was updated successfully, but these errors were encountered:

choldgraf · 2022-02-22T21:38:54Z

@jameshowison - if you like feel free to add context about the problems you're seeing here. I opened this issue so that we could track this as an ongoing incident because it sounds more complex than expected.

choldgraf · 2022-02-22T21:46:12Z

Here are some potentially relevant graphs from our JupyterHub dashboard in the Grafana. The following plots are just for the utexas hub...it seems like at some moments the hub response latency spikes, and I wonder if that's related to these problems as well.

jameshowison · 2022-02-22T23:24:59Z

I think those peaking times correspond with my class (which is 11-12:30 Tues/Thur central time)

Only approach to replicating is to do what i'm asking the class to do:

Open up a new server at https://utexas.pilot.2i2c.cloud/
Import the class_music_festival github folder via: http://utexas.pilot.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhowisonlab%2Fdatawrangling-exercises&branch=main&urlpath=lab%2Ftree%2Fdatawrangling-exercises%2Fclass_music_festival
Run create_class_music_festival.ipynb
Run load_class_music_festival.ipynb
Run queries against the class_music_festival database. eg.

SELECT bands.name
FROM bands
WHERE bands.id IN (11, 16, 22, 23, 30, 48, 11, 55, 22, 16, 55, 23, 30, 100, 108, 48)

choldgraf · 2022-02-22T23:33:08Z

Just noting that I just tried this and was able to run all of those commands without an issue. So maybe this is an error that only pops up when a lot of users are doing it at once?

jameshowison · 2022-02-22T23:52:16Z

Yes, the issue is definitely transient.

Thing is that I don't know what to do once faced with it. If I was doing this on a server directly I'd be looking at logs, but I don't know if/how to access the logs. Is there some way to centralize logs for review, or could note username and time for later review?

I'm trying to do a screenshare with a student that has it frequently (including at non-peak times and with a totally new instance from a new github username). That makes me wonder whether it is a client browser issue, is that possible?

I'm also going to try to give people instructions for doing queries via iPython SQL magics, in case this is a xeus-sql issue (I think it's only happened in the xeus-sql kernels, but we haven't used ipython kernels yet).

jameshowison · 2022-02-23T00:10:07Z

I just did a screen share with the user on the hub with the github username lkiil. She is getting the error consistently when running commands with the homework.ipynb file. Perhaps someone can get into that server and take a look?

I tried using the built in terminal to find logs, but no luck. Googling suggested sudo journalctl -u jupyter would help but no dice (sudo not installed0, so I'm guessing one would need to ssh into the container for that to be useful.

jameshowison · 2022-02-23T00:26:24Z

I was able to use the admin interface to connect to the server for lkiil and I can confirm that I'm seeing the error when I execute cells in my browser. Can't see any logs via the notebook terminal, though.

yuvipanda · 2022-02-23T07:52:16Z

I'm sorry you're running into this, @jameshowison - definitely frustrating. Thank you for helping us debug as we go along!

Notebook logs are written to ~/.jupyter-server-log.txt file - you can check that out. I looked and found that xeus-sql was crashing with an error message terminate called after throwing an instance of 'char const*'. I used our Google Cloud logger (query) to search for that particular error message in all notebook logs, and found that has been happening for a while (just looking at last 7 days)

I think if we could actually get the value of the char const* we might be able to get a better idea of what the error is. I've opened jupyter-xeus/xeus-sql#60 to let the xeus-sql developers know.

yuvipanda · 2022-02-23T08:08:58Z

On the postgresql container side, I can see that there's an error message saying 'FATAL: terminating connection due to administrator command' (query), and its frequence looks like this:

Looking around, it looks like this might becaused by external process sending SIGTERM to the database (https://www.postgresql.org/message-id/4564.1284559661@sss.pgh.pa.us). In our case, that might be because postgres is hitting a memory limit. I'm going to bump the limit up and see if that helps - earlier we had just bumped up the limit for the data the kernel can handle, but now am going to do that for the database itself.

yuvipanda · 2022-02-23T08:22:43Z

But looking at memory usage for the postgres container itself,

The current memory limit is 512Mi, and we are nowhere near close to touching that.

Increase postgresql limits by a bit - it is possible that the kernel deaths are caused by the postgres container hitting its memory limit and dying. The graph in 2i2c-org#1021 (comment) makes it look like it isn't hitting the limit, but perhaps node memory pressure is making it get killed just past the request (which was 64MB) - so the bump might still help.

yuvipanda · 2022-02-23T12:08:10Z

I've also installed jupyterlab-limit-output - while it seems to work in Python, it seems to have no effect on xeus-sql kernel notebooks :(

yuvipanda · 2022-02-23T16:18:37Z

@jameshowison I've brought in a newer version of xeus-sql with jupyter-xeus/xeus-sql#60 merged, so hopefully the error logs are more helpful next time.

jameshowison · 2022-02-23T16:39:37Z

Great. I can just ask everyone to stop and start their server, right? And that will update everything?

yuvipanda · 2022-02-23T16:43:28Z

@jameshowison nobody else is on, so I think everyone (including you!) have the newest version already.

I'm investigating the pgadmin issue. Not sure why that came back...

yuvipanda · 2022-02-23T20:51:39Z

@jameshowison ok, for the large outputs freezing the browser, I've a partial solution - to use the classic notebook. I've enabled https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/limit_output, which seems to work well.

Switch to using classic notebook interface, using the interface switcher in jupyterlab on top right

If I then run SELECT * FROM tickets, I don't crash the browser! Instead, I see at the end of my scroll:

You can also modify your nbgitpuller links to point to classic notebook instead of jupyterlab if you would like, although then you lose a lot of the benefits you get with JupyterLab.

yuvipanda · 2022-02-23T21:00:51Z

There are basically three issues:

Kernel dying
Browser freezing
pgadmin4 not working

Today, I've put in place fixes for (1) and (2), but not (3). Intermittent issues are always a bit of a mess to debug... I'm hopeful I can put in some more time for (3) tomorrow.

yuvipanda · 2022-02-23T21:01:29Z

I've also opened deshaw/jupyterlab-limit-output#2 to add suppoort for xeus-sql in jupyterlab-limit-output

choldgraf · 2022-02-23T21:11:02Z

This is really excellent investigative engineering @yuvipanda :-)

Would it be helpful for us to document the jupyterlab-limit-output extension as well? I'm happy to submit a PR if we think it'll be common-enough to put in docs/

(as an aside, what a strange world we live in where we can just submit a patch to a random JupyterLab extension developed by DEShaw)

jameshowison · 2022-02-23T23:53:19Z

Thanks all. Status here going into class tomorrow is that we have bumped memory on postgres and improving logging with xeus-sql. I intend to use xeus-sql with %%sql magics as a fallback. If we encounter issues I'll take a look at the jupyter logs and flag the user accounts here (so we have a chance of seeing them before the container shuts down after class).

I don't read that we have a definite diagnosis on what was causing it, best assessment is some resource spike on the postgres container, right? My only input there is that I think that I was always able to query the postgres sidecar, but I'll check for certain from the terminal if issues show up again. Perhaps it was just transitory but xeus-sql already terminated and by the time it came back up postgres was responsive again?

One student that was having issues does say that she hasn't seen them in the second half of the day today, which is hopeful!

yuvipanda · 2022-02-24T06:46:10Z

@jameshowison yeah, current working diagnosis is that it's a resource limit issue on the postgres database. However, the hope is that the additional debug logging in xeus-sql can give us a better answer for why the kernel is crashing - with the previous version, that information was being lost. I'm hopeful that if the kernels crash again, we'll be investigating less blindly.

jameshowison · 2022-02-24T14:55:31Z

This morning prepping for class saw a new error. "Invalid Response: 504"

log shows:

[E 2022-02-08 16:56:10.202 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2022-02-08 16:56:10.975 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2022-02-08 16:56:11.132 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden

choldgraf · 2022-02-24T16:00:54Z

Hmm - does that error persist? Do you know if others have run into it?

jameshowison · 2022-02-24T16:06:32Z

Only me working with it at present, but I'm working in ipython rather than xues-sql, so I have a backup if it does.

jameshowison · 2022-02-24T19:18:21Z

No sign of kernel issues during class today. No repeat of the "Invalid Response Error". I will report as student do their assignments remainder of the week and weekend.

yuvipanda · 2022-02-24T19:44:26Z

@jameshowison GREAT TO HEAR! Is that with xeus-sql or with the ipython magic?

jameshowison · 2022-02-24T22:46:09Z

xeus-sql so that's good!

jameshowison · 2022-02-25T00:21:05Z

Have run into another issue. A student's xeus-sql ipynb file is making the server unresponsive. The file is 33.9MB large, so I suspect that it has inside it a number of queries that returned very large results. I downloaded the file and yes, it has some very large output cells.

https://utexas.pilot.2i2c.cloud/user/andreakxn/lab/workspaces/auto-S/tree/datawrangling-exercises/class_music_festival/Untitled.ipynb

I'm just not sure how to handle this, I have asked that they include LIMIT 100 or something as they work, but need some guard rails here.

There is a discussion on xeus-sql of options for limiting results. jupyter-xeus/xeus-sql#59

jameshowison · 2022-02-25T01:30:54Z

btw, I removed that large file, but I have a copy if it is ever needed for debugging.

jameshowison · 2022-02-25T16:01:11Z

The "Invalid Response: 504" error occurred again this morning, immediately after logging in and creating a new xsql file. Choosing "Restart" and the new xsql kernel seemed to work.

choldgraf · 2022-02-25T17:07:35Z

hmmm - maybe it's a race condition of some kind? This only ever happens right after logging in and running for the first time?

jameshowison · 2022-02-25T17:18:18Z

Yes, so far I've only seen it on startup. Will report if I see it at other times.

jameshowison · 2022-03-01T00:50:13Z

Still have this issue of the Bad Gateway popping up. Sometimes it goes away when tell it to restart xsql, but just then I had to say "No Kernel" and then start an xsql kenel.

Scroll down in http://howisonlab.github.io/datawrangling/class_modules/sql_queries/01_intro/ to see what I mean

sgibson91 · 2022-03-03T19:01:36Z

@2i2c-org/tech-team What do we need to move this forward? Do we need the ~/.jupyter-server-log.txt from the next time a Bad Gateway error appears, or is there another path?

choldgraf · 2022-03-03T21:29:06Z

Maybe @yuvipanda has some ideas here since he set this one up. I seem to remember that he said this suggests there is a websockets issue but I'm not sure how to debug that. I did find a stackoverflow post with the same error message, and indeed they mention web sockets in there as well: https://stackoverflow.com/questions/57297987/jupyterhub-kernel-connection-returns-http504-gateway-timeout).

sgibson91 · 2022-03-04T09:50:42Z

I thought the websockets issue was related to this ticket? https://discourse.pangeo.io/t/google-deployment-stuck/2280

yuvipanda · 2022-03-11T04:30:35Z

I think current state is:

xeus-sql is producing some weird 504 gateway errors
ipython-sql works alright, doesn't cause any issues
xeus-sql produces results sometimes that are too large and freezes user browsers
pgadmin4 is still broken

Is this an accurate summary of current status, @jameshowison? If ipython-sql works fine, then perhaps a longer term outlook is to move towards making sure ipython-sql works well, and fix pgadmin4...

choldgraf · 2022-03-15T22:44:56Z

Hey all - I am going to close out this issue, since I think the hub is in a good-enough steady state, and doesn't require active changes to it. We can spot-check improvements via support channels and new issues.

choldgraf added type: Hub Incident labels Feb 22, 2022

choldgraf added this to DEPRECATED Engineering and Product Backlog Feb 22, 2022

choldgraf moved this to Ready to work in DEPRECATED Engineering and Product Backlog Feb 22, 2022

yuvipanda mentioned this issue Feb 23, 2022

Bump postgres resource limit for utexas #1023

Merged

jameshowison mentioned this issue Feb 23, 2022

Add path to jupyter logs 2i2c-org/docs#124

Merged

choldgraf mentioned this issue Mar 10, 2022

[New Hub] University of Texas #968

Closed

8 tasks

choldgraf closed this as completed Mar 15, 2022

Repository owner moved this from Ready to work to Complete in DEPRECATED Engineering and Product Backlog Mar 15, 2022

choldgraf assigned yuvipanda Mar 15, 2022

jameshowison mentioned this issue Apr 5, 2022

Kernel restarts and "Bad Gateway" nuisance pop-ups #1170

Closed

[Incident] Kernel restarts and internal gateway errors for for U. Texas hub #1021

[Incident] Kernel restarts and internal gateway errors for for U. Texas hub #1021

Comments

choldgraf commented Feb 22, 2022

Summary

Hub information

Timeline (if relevant)

2022-02-17

2022-02-18

2022-02-21

TODO: add more as new developments happen

After-action report

What went wrong

Where we got lucky

Action items

Process improvements

Documentation improvements

Technical improvements

Actions

choldgraf commented Feb 22, 2022

choldgraf commented Feb 22, 2022

jameshowison commented Feb 22, 2022

choldgraf commented Feb 22, 2022

jameshowison commented Feb 22, 2022

jameshowison commented Feb 23, 2022

jameshowison commented Feb 23, 2022

yuvipanda commented Feb 23, 2022 • edited Loading

yuvipanda commented Feb 23, 2022

yuvipanda commented Feb 23, 2022

yuvipanda commented Feb 23, 2022

yuvipanda commented Feb 23, 2022

jameshowison commented Feb 23, 2022

yuvipanda commented Feb 23, 2022

yuvipanda commented Feb 23, 2022 • edited Loading

yuvipanda commented Feb 23, 2022

yuvipanda commented Feb 23, 2022

choldgraf commented Feb 23, 2022

jameshowison commented Feb 23, 2022

yuvipanda commented Feb 24, 2022

jameshowison commented Feb 24, 2022

choldgraf commented Feb 24, 2022

jameshowison commented Feb 24, 2022

jameshowison commented Feb 24, 2022

yuvipanda commented Feb 24, 2022

jameshowison commented Feb 24, 2022

jameshowison commented Feb 25, 2022

jameshowison commented Feb 25, 2022

jameshowison commented Feb 25, 2022

choldgraf commented Feb 25, 2022

jameshowison commented Feb 25, 2022

jameshowison commented Mar 1, 2022 • edited Loading

sgibson91 commented Mar 3, 2022

choldgraf commented Mar 3, 2022

sgibson91 commented Mar 4, 2022

yuvipanda commented Mar 11, 2022

choldgraf commented Mar 15, 2022

yuvipanda commented Feb 23, 2022 •

edited

Loading

yuvipanda commented Feb 23, 2022 •

edited

Loading

jameshowison commented Mar 1, 2022 •

edited

Loading