Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker startup timeout leads to inconsistent cluster state #620

Open
zmbc opened this issue Dec 28, 2023 · 3 comments
Open

Worker startup timeout leads to inconsistent cluster state #620

zmbc opened this issue Dec 28, 2023 · 3 comments

Comments

@zmbc
Copy link

zmbc commented Dec 28, 2023

Describe the issue: I am using dask_jobqueue on a Slurm cluster. I noticed that frequently, a cluster that I scaled up to 50 jobs would only actually scale to 45 or so. If I called wait_for_workers, it would hang indefinitely. The Slurm logs showed that the workers that never joined had failed with Reason: failure-to-start-<class 'asyncio.exceptions.TimeoutError'>. However the SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.

Minimal Complete Verifiable Example: I have yet to be able to consistently reproduce this, since the timeouts are intermittent. Any pointers on how to cause this timeout on command would be appreciated, and help me create an MCVE.

Anything else we need to know?: Nope

Environment:

  • Dask version:
  • Python version:
  • Operating System: Linux
  • Install method (conda, pip, source): pip
@guillaumeeb
Copy link
Member

Hi @zmbc,

I've seen that on some HPC cluster. Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.

It is true that currently, dask-jobqueue does not enforce the number of really started Workers, because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure. But I might be missing something, what do you mean by:

SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.

There is probably a way of increasing Worker start-up timeout though.

@zmbc
Copy link
Author

zmbc commented Jan 17, 2024

Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.

Ah, that sounds like a good theory on why the timeouts occur: it is probably that the software environment itself (code files) are stored on a network file system, and are trying to be concurrently read on a large number of different nodes.

because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure

Really? On Slurm, it is trivial to check this: use squeue or sacct to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.

@guillaumeeb
Copy link
Member

Really? On Slurm, it is trivial to check this: use squeue or sacct to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.

Yes, I know that there are meaning to do this in all Scheduler abstractions. But this might not be as simple to implement this in this package, there has been discussion already just to retrieve the real "HPC Scheduler" status: #11.

This would be really nive to have contributions here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants