-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default work-stealing interval by 10x #8997
Increase default work-stealing interval by 10x #8997
Conversation
e5c2909
to
241ab33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable. I wonder what impact this might have on other deployments like HPC, are things still the same? Or is communication faster there so this is less noticible?
cc @guillaumeeb
I think this change should be generally beneficial. IIRC, our current advice is that tasks should take at least 100 ms to avoid overhead from becoming too large. With the current default that would mean that we balance tasks after every iteration. This seems like overkill also given that balancing isn't cheap. |
FWIW, the impact is somewhat hard to establish because of a bug in the balancing logic that I will address in another PR. |
To expand on this a bit, I think an appropriate lower bound for the stealing interval is something like this
Especially when very busy, the server latency can be quite high (hundreds ms? scheduler and workers may be busy for different reasons). 1s sounds like a reasonable default. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ±0 27 suites ±0 11h 39m 22s ⏱️ + 7m 46s For more details on these failures and errors, see this check. Results for commit 7133528. ± Comparison against base commit fd3722d. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for diving deeper here, sounds good to me!
Just answering as I've been tagged: even on HPC, Dask is generally using TCP over IB, which means high bandwith, but not the latency you could have with real IB protocol. Nevertheless, I thing that 1s between work stealing is really short enough for all workflows intended for Dask! |
On a normal cloud setup, the staling interval is barely large enough to accomodate the roundtrips required for moving the tasks, not to mention fetching dependencies or performing actual work. I'm increasing the interval to one second (10x) which gives a little more time for actual progress to be made.
pre-commit run --all-files