-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] Kernel restarts and internal gateway errors for for U. Texas hub #1021
Comments
@jameshowison - if you like feel free to add context about the problems you're seeing here. I opened this issue so that we could track this as an ongoing incident because it sounds more complex than expected. |
I think those peaking times correspond with my class (which is 11-12:30 Tues/Thur central time) Only approach to replicating is to do what i'm asking the class to do:
|
Just noting that I just tried this and was able to run all of those commands without an issue. So maybe this is an error that only pops up when a lot of users are doing it at once? |
Yes, the issue is definitely transient. Thing is that I don't know what to do once faced with it. If I was doing this on a server directly I'd be looking at logs, but I don't know if/how to access the logs. Is there some way to centralize logs for review, or could note username and time for later review? I'm trying to do a screenshare with a student that has it frequently (including at non-peak times and with a totally new instance from a new github username). That makes me wonder whether it is a client browser issue, is that possible? I'm also going to try to give people instructions for doing queries via iPython SQL magics, in case this is a xeus-sql issue (I think it's only happened in the xeus-sql kernels, but we haven't used ipython kernels yet). |
I just did a screen share with the user on the hub with the github username I tried using the built in terminal to find logs, but no luck. Googling suggested |
I was able to use the admin interface to connect to the server for |
I'm sorry you're running into this, @jameshowison - definitely frustrating. Thank you for helping us debug as we go along! Notebook logs are written to I think if we could actually get the value of the |
On the postgresql container side, I can see that there's an error message saying 'FATAL: terminating connection due to administrator command' (query), and its frequence looks like this: Looking around, it looks like this might becaused by external process sending SIGTERM to the database (https://www.postgresql.org/message-id/4564.1284559661@sss.pgh.pa.us). In our case, that might be because postgres is hitting a memory limit. I'm going to bump the limit up and see if that helps - earlier we had just bumped up the limit for the data the kernel can handle, but now am going to do that for the database itself. |
Increase postgresql limits by a bit - it is possible that the kernel deaths are caused by the postgres container hitting its memory limit and dying. The graph in 2i2c-org#1021 (comment) makes it look like it isn't hitting the limit, but perhaps node memory pressure is making it get killed just past the request (which was 64MB) - so the bump might still help.
I've also installed jupyterlab-limit-output - while it seems to work in Python, it seems to have no effect on xeus-sql kernel notebooks :( |
@jameshowison I've brought in a newer version of xeus-sql with jupyter-xeus/xeus-sql#60 merged, so hopefully the error logs are more helpful next time. |
Great. I can just ask everyone to stop and start their server, right? And that will update everything? |
@jameshowison nobody else is on, so I think everyone (including you!) have the newest version already. I'm investigating the pgadmin issue. Not sure why that came back... |
@jameshowison ok, for the large outputs freezing the browser, I've a partial solution - to use the classic notebook. I've enabled https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/limit_output, which seems to work well. Switch to using classic notebook interface, using the interface switcher in jupyterlab on top right If I then run You can also modify your nbgitpuller links to point to classic notebook instead of jupyterlab if you would like, although then you lose a lot of the benefits you get with JupyterLab. |
There are basically three issues:
Today, I've put in place fixes for (1) and (2), but not (3). Intermittent issues are always a bit of a mess to debug... I'm hopeful I can put in some more time for (3) tomorrow. |
I've also opened deshaw/jupyterlab-limit-output#2 to add suppoort for xeus-sql in jupyterlab-limit-output |
This is really excellent investigative engineering @yuvipanda :-) Would it be helpful for us to document the (as an aside, what a strange world we live in where we can just submit a patch to a random JupyterLab extension developed by DEShaw) |
Thanks all. Status here going into class tomorrow is that we have bumped memory on postgres and improving logging with xeus-sql. I intend to use xeus-sql with %%sql magics as a fallback. If we encounter issues I'll take a look at the jupyter logs and flag the user accounts here (so we have a chance of seeing them before the container shuts down after class). I don't read that we have a definite diagnosis on what was causing it, best assessment is some resource spike on the postgres container, right? My only input there is that I think that I was always able to query the postgres sidecar, but I'll check for certain from the terminal if issues show up again. Perhaps it was just transitory but xeus-sql already terminated and by the time it came back up postgres was responsive again? One student that was having issues does say that she hasn't seen them in the second half of the day today, which is hopeful! |
@jameshowison yeah, current working diagnosis is that it's a resource limit issue on the postgres database. However, the hope is that the additional debug logging in xeus-sql can give us a better answer for why the kernel is crashing - with the previous version, that information was being lost. I'm hopeful that if the kernels crash again, we'll be investigating less blindly. |
Hmm - does that error persist? Do you know if others have run into it? |
Only me working with it at present, but I'm working in ipython rather than xues-sql, so I have a backup if it does. |
No sign of kernel issues during class today. No repeat of the "Invalid Response Error". I will report as student do their assignments remainder of the week and weekend. |
@jameshowison GREAT TO HEAR! Is that with xeus-sql or with the ipython magic? |
xeus-sql so that's good! |
Have run into another issue. A student's xeus-sql ipynb file is making the server unresponsive. The file is 33.9MB large, so I suspect that it has inside it a number of queries that returned very large results. I downloaded the file and yes, it has some very large output cells. I'm just not sure how to handle this, I have asked that they include LIMIT 100 or something as they work, but need some guard rails here. There is a discussion on xeus-sql of options for limiting results. jupyter-xeus/xeus-sql#59 |
btw, I removed that large file, but I have a copy if it is ever needed for debugging. |
The "Invalid Response: 504" error occurred again this morning, immediately after logging in and creating a new xsql file. Choosing "Restart" and the new xsql kernel seemed to work. |
hmmm - maybe it's a race condition of some kind? This only ever happens right after logging in and running for the first time? |
Yes, so far I've only seen it on startup. Will report if I see it at other times. |
Still have this issue of the Bad Gateway popping up. Sometimes it goes away when tell it to restart xsql, but just then I had to say "No Kernel" and then start an xsql kenel. Scroll down in http://howisonlab.github.io/datawrangling/class_modules/sql_queries/01_intro/ to see what I mean |
@2i2c-org/tech-team What do we need to move this forward? Do we need the |
Maybe @yuvipanda has some ideas here since he set this one up. I seem to remember that he said this suggests there is a websockets issue but I'm not sure how to debug that. I did find a stackoverflow post with the same error message, and indeed they mention web sockets in there as well: https://stackoverflow.com/questions/57297987/jupyterhub-kernel-connection-returns-http504-gateway-timeout). |
I thought the websockets issue was related to this ticket? https://discourse.pangeo.io/t/google-deployment-stuck/2280 |
I think current state is:
Is this an accurate summary of current status, @jameshowison? If ipython-sql works fine, then perhaps a longer term outlook is to move towards making sure ipython-sql works well, and fix pgadmin4... |
Hey all - I am going to close out this issue, since I think the hub is in a good-enough steady state, and doesn't require active changes to it. We can spot-check improvements via support channels and new issues. |
Summary
We recently deployed the UT Austin hub, which is running a database sidecar for each student (#968). Since the hub began use, we've gotten reports of intermittent kernel crashes affecting students. This has been difficult to reproduce, but has been happening consistently.
There was originally a FreshDesk ticket about this here: https://2i2c.freshdesk.com/a/tickets/84?lang=en
We thought we had resolved it by increasing the memory limits for each user session, but these issues have persisted. Here are the two major things that we've found:
Here's an issue where we describe some of the problems people have, and how to reproduce them:
#1008 (comment)
Hub information
Timeline (if relevant)
All times in US/Pacific.
2022-02-17
2022-02-18
We conclude that this may be a problem with memory limits students are hitting during queries, and bump their guarantee here: #1008
2022-02-21
We get reports that many students are still running into this problem. It is not reliable to replicate, but people are consistently hitting these errors.
ref: #1008 (comment)
TODO: add more as new developments happen
After-action report
These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.
What went wrong
Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items.
Where we got lucky
These are good things that happened to us but not because we had planned for them.
Action items
These are only sample subheadings. Every action item should have a GitHub issue
(even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in
infrastructure/
, they can be in other repositories.Process improvements
Documentation improvements
Technical improvements
Actions
The text was updated successfully, but these errors were encountered: