Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice needed for long running jobs on hubs #184

Closed
jmunroe opened this issue May 16, 2023 · 2 comments
Closed

Advice needed for long running jobs on hubs #184

jmunroe opened this issue May 16, 2023 · 2 comments
Assignees

Comments

@jmunroe
Copy link
Contributor

jmunroe commented May 16, 2023

Context

Original source: https://2i2c.freshdesk.com/a/tickets/706

A community asked:

Can you point me to the documentation for how a pod will auto shutdown? We occasionally run log processes that may take a week to finish. There is a little confusion from our developers if jobs have been killed early by hitting memory limits being hit (I can see on grafana, this has happened a few times) or some if there is some other 'pod idle detection'

My response was

Regarding auto shutdown for either Jupyter server instances or kernels, the relevant docs are

https://docs.2i2c.org/admin/howto/control-user-server/#stop-user-servers-after-inactivity

and

https://infrastructure.2i2c.org/sre-guide/manage-k8s/culling/#configure-culling

​In particular,

Stop user servers after inactivity
To ensure efficient resource usage, user servers without interactive usage for a period of time (default 1h) are automatically stopped (via jupyterhub-idle-culler). This means your notebook server might be stopped for inactivity even if you have a long running process in the notebook. This timeout can be configured.

While the1h default is good for most interactive sessions, I don't think changing it to a really long time (like 168h) makes sense for long-running processes. Too much risk a server will be inadvertedly left running by accident.

I recall some discussion (perhaps on Slack?) about solutions other groups have used for submitting a long-running job on a JupyterHub instance but can't find anything on the service documentation.

Proposal

Could @2i2c-org/engineering please provide guidance on the original freshdesk ticket ?

Once we have resolved it for this particular community, we should then add to our service documentation advice on how to submit long running jobs on our infrastructure.

Updates and actions

No response

@consideRatio
Copy link
Contributor

consideRatio commented May 17, 2023

For QCL, I think they merit from having all culling logic disabled to avoid issues - but warn them that they need to shut down their own servers if they aren't using them.

If they have very expensive machines running long duration, and they incorrectly fail along the way due to culling, that is the far bigger cost I expect.

Related

Action points

  • Investigate basehub's jupyterhub-idle-culling configuration in jupyterhub
    jupyterhub-idle-culler is not configured in basehub, but it is by default in z2jh to cull servers with no activity reported in the last hour
  • Investigate basehub's kernel culling configuration in user servers
    Kernels are not culled if they are busy by the kernel culling, but idle kernels are after one hour of idling
  • Investigate if kernel culling is something you opt-in or opt-out of, so that we understand the consequences of removing config via basehub
    It seems that cull_idle_timeout is defaulting to 0, with culling of idle kernels disabled.
    We are setting it to 3600, which means that a server with a long running job will loose its state after 3600 seconds.
  • Read up on what Min helped me understand once in Additions to how it works, and a simple "keep alive" strategy jupyterhub/jupyterhub-idle-culler#55
  • Make a decision on how to best help QCL avoid possible disruption of long running jobs
    • Idea 1: disable kernel culling to avoid loosing state after a long computation completes
    • Idea 2: disable jupyterhub-idle-culler to avoid loosing server after a period of inactivity
  • Consider if and how we want to update our docs and default config for basehub
    Yes. But I'll open a separate issue about it.

@consideRatio consideRatio self-assigned this May 17, 2023
@consideRatio
Copy link
Contributor

Advice provided, I'll probably reconfigure something for QCL as a followup so I re-assigned myself to the support ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants