Advice needed for long running jobs on hubs #184

jmunroe · 2023-05-16T23:48:38Z

Context

Original source: https://2i2c.freshdesk.com/a/tickets/706

A community asked:

Can you point me to the documentation for how a pod will auto shutdown? We occasionally run log processes that may take a week to finish. There is a little confusion from our developers if jobs have been killed early by hitting memory limits being hit (I can see on grafana, this has happened a few times) or some if there is some other 'pod idle detection'

My response was

Regarding auto shutdown for either Jupyter server instances or kernels, the relevant docs are

https://docs.2i2c.org/admin/howto/control-user-server/#stop-user-servers-after-inactivity

and

https://infrastructure.2i2c.org/sre-guide/manage-k8s/culling/#configure-culling

In particular,

Stop user servers after inactivity
To ensure efficient resource usage, user servers without interactive usage for a period of time (default 1h) are automatically stopped (via jupyterhub-idle-culler). This means your notebook server might be stopped for inactivity even if you have a long running process in the notebook. This timeout can be configured.

While the1h default is good for most interactive sessions, I don't think changing it to a really long time (like 168h) makes sense for long-running processes. Too much risk a server will be inadvertedly left running by accident.

I recall some discussion (perhaps on Slack?) about solutions other groups have used for submitting a long-running job on a JupyterHub instance but can't find anything on the service documentation.

Proposal

Could @2i2c-org/engineering please provide guidance on the original freshdesk ticket ?

Once we have resolved it for this particular community, we should then add to our service documentation advice on how to submit long running jobs on our infrastructure.

Updates and actions

No response

The text was updated successfully, but these errors were encountered:

consideRatio · 2023-05-17T11:45:22Z

For QCL, I think they merit from having all culling logic disabled to avoid issues - but warn them that they need to shut down their own servers if they aren't using them.

If they have very expensive machines running long duration, and they incorrectly fail along the way due to culling, that is the far bigger cost I expect.

Action points

Investigate basehub's jupyterhub-idle-culling configuration in jupyterhub
jupyterhub-idle-culler is not configured in basehub, but it is by default in z2jh to cull servers with no activity reported in the last hour
Investigate basehub's kernel culling configuration in user servers
Kernels are not culled if they are busy by the kernel culling, but idle kernels are after one hour of idling
Investigate if kernel culling is something you opt-in or opt-out of, so that we understand the consequences of removing config via basehub
It seems that cull_idle_timeout is defaulting to 0, with culling of idle kernels disabled.
We are setting it to 3600, which means that a server with a long running job will loose its state after 3600 seconds.
Read up on what Min helped me understand once in Additions to how it works, and a simple "keep alive" strategy jupyterhub/jupyterhub-idle-culler#55
Make a decision on how to best help QCL avoid possible disruption of long running jobs
- Idea 1: disable kernel culling to avoid loosing state after a long computation completes
- Idea 2: disable jupyterhub-idle-culler to avoid loosing server after a period of inactivity
Consider if and how we want to update our docs and default config for basehub
Yes. But I'll open a separate issue about it.

consideRatio · 2023-05-23T07:11:43Z

Advice provided, I'll probably reconfigure something for QCL as a followup so I re-assigned myself to the support ticket.

consideRatio self-assigned this May 17, 2023

consideRatio closed this as completed May 23, 2023

consideRatio added this to DEPRECATED Engineering and Product Backlog May 23, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog May 23, 2023

consideRatio moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog May 23, 2023

consideRatio mentioned this issue May 23, 2023

Refine docs in this repo and upstream about server and kernel culling #185

Open

damianavila added this to Sprint Board Jun 4, 2023

damianavila moved this to Done 🎉 in Sprint Board Jun 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice needed for long running jobs on hubs #184

Advice needed for long running jobs on hubs #184

jmunroe commented May 16, 2023

consideRatio commented May 17, 2023 •

edited

Loading

consideRatio commented May 23, 2023

Advice needed for long running jobs on hubs #184

Advice needed for long running jobs on hubs #184

Comments

jmunroe commented May 16, 2023

Context

Proposal

Updates and actions

consideRatio commented May 17, 2023 • edited Loading

Related

Action points

consideRatio commented May 23, 2023

consideRatio commented May 17, 2023 •

edited

Loading