Increase default work-stealing interval by 10x #8997

hendrikmakait · 2025-01-30T11:55:50Z

On a normal cloud setup, the staling interval is barely large enough to accomodate the roundtrips required for moving the tasks, not to mention fetching dependencies or performing actual work. I'm increasing the interval to one second (10x) which gives a little more time for actual progress to be made.

Tests added / passed
Passes pre-commit run --all-files

jacobtomlinson

This seems reasonable. I wonder what impact this might have on other deployments like HPC, are things still the same? Or is communication faster there so this is less noticible?

cc @guillaumeeb

hendrikmakait · 2025-01-30T12:09:20Z

I think this change should be generally beneficial. IIRC, our current advice is that tasks should take at least 100 ms to avoid overhead from becoming too large. With the current default that would mean that we balance tasks after every iteration. This seems like overkill also given that balancing isn't cheap.

hendrikmakait · 2025-01-30T12:10:46Z

FWIW, the impact is somewhat hard to establish because of a bug in the balancing logic that I will address in another PR.

fjetter · 2025-01-30T12:47:18Z

To expand on this a bit, I think an appropriate lower bound for the stealing interval is something like this

stealing interval ~ 3 x (network latency + server latency) + C * average task duration

3: The number of network hops and server responses. Check in with designated victim, victim replies, if victim confirms assign task to thief
C: Arbitrary constant (in this case 10ish assuming very fast tasks) that scales the task duration. Adapting significantly faster than a task duration is probably complete overkill since we'd run the balance step much more often than anything in the system changed.

Especially when very busy, the server latency can be quite high (hundreds ms? scheduler and workers may be busy for different reasons).

1s sounds like a reasonable default.

github-actions · 2025-01-30T12:50:55Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

27 files ±0 27 suites ±0 11h 39m 22s ⏱️ + 7m 46s
4 116 tests ±0 3 997 ✅ - 5 111 💤 ±0 7 ❌ +4 1 🔥 +1
51 616 runs +1 49 310 ✅ - 6 2 297 💤 +1 8 ❌ +5 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 7133528. ± Comparison against base commit fd3722d.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.

distributed.tests.test_steal ‑ test_parse_stealing_interval[None-100]

distributed.tests.test_steal ‑ test_parse_stealing_interval[None-1000]

♻️ This comment has been updated with latest results.

jacobtomlinson

Thanks for diving deeper here, sounds good to me!

guillaumeeb · 2025-01-31T16:12:39Z

This seems reasonable. I wonder what impact this might have on other deployments like HPC, are things still the same? Or is communication faster there so this is less noticible?

Just answering as I've been tagged: even on HPC, Dask is generally using TCP over IB, which means high bandwith, but not the latency you could have with real IB protocol. Nevertheless, I thing that 1s between work stealing is really short enough for all workflows intended for Dask!

hendrikmakait requested a review from fjetter as a code owner January 30, 2025 11:55

Increase work-stealing interval

241ab33

hendrikmakait force-pushed the increase-stealing-interval branch from e5c2909 to 241ab33 Compare January 30, 2025 11:57

jacobtomlinson reviewed Jan 30, 2025

View reviewed changes

fjetter approved these changes Jan 30, 2025

View reviewed changes

hendrikmakait added 2 commits January 30, 2025 14:24

fix tests

68a3eb1

fix tests

d7adf6e

jacobtomlinson approved these changes Jan 30, 2025

View reviewed changes

hendrikmakait added 2 commits January 30, 2025 16:15

fix tests

d466ff5

fix tests

7133528

hendrikmakait merged commit 5589049 into dask:main Jan 30, 2025
25 of 33 checks passed

hendrikmakait deleted the increase-stealing-interval branch January 30, 2025 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default work-stealing interval by 10x #8997

Increase default work-stealing interval by 10x #8997

hendrikmakait commented Jan 30, 2025

jacobtomlinson left a comment

hendrikmakait commented Jan 30, 2025 •

edited

Loading

hendrikmakait commented Jan 30, 2025

fjetter commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025 •

edited

Loading

jacobtomlinson left a comment

guillaumeeb commented Jan 31, 2025

Increase default work-stealing interval by 10x #8997

Increase default work-stealing interval by 10x #8997

Conversation

hendrikmakait commented Jan 30, 2025

jacobtomlinson left a comment

Choose a reason for hiding this comment

hendrikmakait commented Jan 30, 2025 • edited Loading

hendrikmakait commented Jan 30, 2025

fjetter commented Jan 30, 2025 • edited Loading

github-actions bot commented Jan 30, 2025 • edited Loading

Unit Test Results

jacobtomlinson left a comment

Choose a reason for hiding this comment

guillaumeeb commented Jan 31, 2025

hendrikmakait commented Jan 30, 2025 •

edited

Loading

fjetter commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025 •

edited

Loading