Worker startup timeout leads to inconsistent cluster state #620

zmbc · 2023-12-28T23:24:36Z

Describe the issue: I am using dask_jobqueue on a Slurm cluster. I noticed that frequently, a cluster that I scaled up to 50 jobs would only actually scale to 45 or so. If I called wait_for_workers, it would hang indefinitely. The Slurm logs showed that the workers that never joined had failed with Reason: failure-to-start-<class 'asyncio.exceptions.TimeoutError'>. However the SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.

Minimal Complete Verifiable Example: I have yet to be able to consistently reproduce this, since the timeouts are intermittent. Any pointers on how to cause this timeout on command would be appreciated, and help me create an MCVE.

Anything else we need to know?: Nope

Environment:

Dask version:
Python version:
Operating System: Linux
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

guillaumeeb · 2024-01-17T15:21:09Z

Hi @zmbc,

I've seen that on some HPC cluster. Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.

It is true that currently, dask-jobqueue does not enforce the number of really started Workers, because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure. But I might be missing something, what do you mean by:

SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.

There is probably a way of increasing Worker start-up timeout though.

zmbc · 2024-01-17T18:05:30Z

Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.

Ah, that sounds like a good theory on why the timeouts occur: it is probably that the software environment itself (code files) are stored on a network file system, and are trying to be concurrently read on a large number of different nodes.

because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure

Really? On Slurm, it is trivial to check this: use squeue or sacct to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.

guillaumeeb · 2024-01-24T15:48:17Z

Really? On Slurm, it is trivial to check this: use squeue or sacct to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.

Yes, I know that there are meaning to do this in all Scheduler abstractions. But this might not be as simple to implement this in this package, there has been discussion already just to retrieve the real "HPC Scheduler" status: #11.

This would be really nive to have contributions here!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker startup timeout leads to inconsistent cluster state #620

Worker startup timeout leads to inconsistent cluster state #620

zmbc commented Dec 28, 2023

guillaumeeb commented Jan 17, 2024

zmbc commented Jan 17, 2024

guillaumeeb commented Jan 24, 2024

Worker startup timeout leads to inconsistent cluster state #620

Worker startup timeout leads to inconsistent cluster state #620

Comments

zmbc commented Dec 28, 2023

guillaumeeb commented Jan 17, 2024

zmbc commented Jan 17, 2024

guillaumeeb commented Jan 24, 2024