Skip to content

Commit

Permalink
Increase minimum timeout to wait for workers in CI (#1192) (#1193)
Browse files Browse the repository at this point in the history
We have been getting timeouts waiting for workers in CI, those are not reproducible locally. The reason for that is probably some sort of congestion causing spinup to take longer in CI, therefore this change introduces a variable that can be used to control the minimum timeout and the minimum timeout is doubled in CI.

Authors:
   - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - Ray Douglass (https://github.com/raydouglass)
  • Loading branch information
pentschev authored Jun 6, 2023
1 parent 59c1553 commit cdb38ad
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 1 deletion.
1 change: 1 addition & 0 deletions ci/test_python.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ set +e
rapids-logger "pytest dask-cuda"
pushd dask_cuda
DASK_CUDA_TEST_SINGLE_GPU=1 \
DASK_CUDA_WAIT_WORKERS_MIN_TIMEOUT=20 \
UCXPY_IFNAME=eth0 \
UCX_WARN_UNUSED_ENV_VARS=n \
UCX_MEMTYPE_CACHE=n \
Expand Down
6 changes: 5 additions & 1 deletion dask_cuda/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -446,7 +446,9 @@ def wait_workers(
client: distributed.Client
Instance of client, used to query for number of workers connected.
min_timeout: float
Minimum number of seconds to wait before timeout.
Minimum number of seconds to wait before timeout. This value may be
overridden by setting the `DASK_CUDA_WAIT_WORKERS_MIN_TIMEOUT` with
a positive integer.
seconds_per_gpu: float
Seconds to wait for each GPU on the system. For example, if its
value is 2 and there is a total of 8 GPUs (workers) being started,
Expand All @@ -463,6 +465,8 @@ def wait_workers(
-------
True if all workers were started, False if a timeout occurs.
"""
min_timeout_env = os.environ.get("DASK_CUDA_WAIT_WORKERS_MIN_TIMEOUT", None)
min_timeout = min_timeout if min_timeout_env is None else int(min_timeout_env)
n_gpus = n_gpus or get_n_gpus()
timeout = max(min_timeout, seconds_per_gpu * n_gpus)

Expand Down

0 comments on commit cdb38ad

Please sign in to comment.