-
-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker startup timeout leads to inconsistent cluster state #620
Comments
Hi @zmbc, I've seen that on some HPC cluster. Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up. It is true that currently, dask-jobqueue does not enforce the number of really started Workers, because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure. But I might be missing something, what do you mean by:
There is probably a way of increasing Worker start-up timeout though. |
Ah, that sounds like a good theory on why the timeouts occur: it is probably that the software environment itself (code files) are stored on a network file system, and are trying to be concurrently read on a large number of different nodes.
Really? On Slurm, it is trivial to check this: use |
Yes, I know that there are meaning to do this in all Scheduler abstractions. But this might not be as simple to implement this in this package, there has been discussion already just to retrieve the real "HPC Scheduler" status: #11. This would be really nive to have contributions here! |
Describe the issue: I am using dask_jobqueue on a Slurm cluster. I noticed that frequently, a cluster that I scaled up to 50 jobs would only actually scale to 45 or so. If I called
wait_for_workers
, it would hang indefinitely. The Slurm logs showed that the workers that never joined had failed withReason: failure-to-start-<class 'asyncio.exceptions.TimeoutError'>
. However the SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.Minimal Complete Verifiable Example: I have yet to be able to consistently reproduce this, since the timeouts are intermittent. Any pointers on how to cause this timeout on command would be appreciated, and help me create an MCVE.
Anything else we need to know?: Nope
Environment:
The text was updated successfully, but these errors were encountered: