Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default work-stealing interval by 10x #8997

Merged
merged 5 commits into from
Jan 30, 2025

Conversation

hendrikmakait
Copy link
Member

On a normal cloud setup, the staling interval is barely large enough to accomodate the roundtrips required for moving the tasks, not to mention fetching dependencies or performing actual work. I'm increasing the interval to one second (10x) which gives a little more time for actual progress to be made.

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait force-pushed the increase-stealing-interval branch from e5c2909 to 241ab33 Compare January 30, 2025 11:57
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. I wonder what impact this might have on other deployments like HPC, are things still the same? Or is communication faster there so this is less noticible?

cc @guillaumeeb

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Jan 30, 2025

I think this change should be generally beneficial. IIRC, our current advice is that tasks should take at least 100 ms to avoid overhead from becoming too large. With the current default that would mean that we balance tasks after every iteration. This seems like overkill also given that balancing isn't cheap.

@hendrikmakait
Copy link
Member Author

FWIW, the impact is somewhat hard to establish because of a bug in the balancing logic that I will address in another PR.

@fjetter
Copy link
Member

fjetter commented Jan 30, 2025

To expand on this a bit, I think an appropriate lower bound for the stealing interval is something like this

stealing interval ~ 3 x (network latency + server latency) + C * average task duration

  • 3: The number of network hops and server responses. Check in with designated victim, victim replies, if victim confirms assign task to thief
  • C: Arbitrary constant (in this case 10ish assuming very fast tasks) that scales the task duration. Adapting significantly faster than a task duration is probably complete overkill since we'd run the balance step much more often than anything in the system changed.

Especially when very busy, the server latency can be quite high (hundreds ms? scheduler and workers may be busy for different reasons).

1s sounds like a reasonable default.

Copy link
Contributor

github-actions bot commented Jan 30, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ±0      27 suites  ±0   11h 39m 22s ⏱️ + 7m 46s
 4 116 tests ±0   3 997 ✅  - 5    111 💤 ±0  7 ❌ +4  1 🔥 +1 
51 616 runs  +1  49 310 ✅  - 6  2 297 💤 +1  8 ❌ +5  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 7133528. ± Comparison against base commit fd3722d.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
distributed.tests.test_steal ‑ test_parse_stealing_interval[None-100]
distributed.tests.test_steal ‑ test_parse_stealing_interval[None-1000]

♻️ This comment has been updated with latest results.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diving deeper here, sounds good to me!

@hendrikmakait hendrikmakait merged commit 5589049 into dask:main Jan 30, 2025
25 of 33 checks passed
@hendrikmakait hendrikmakait deleted the increase-stealing-interval branch January 30, 2025 16:57
@guillaumeeb
Copy link
Member

This seems reasonable. I wonder what impact this might have on other deployments like HPC, are things still the same? Or is communication faster there so this is less noticible?

Just answering as I've been tagged: even on HPC, Dask is generally using TCP over IB, which means high bandwith, but not the latency you could have with real IB protocol. Nevertheless, I thing that 1s between work stealing is really short enough for all workflows intended for Dask!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants