-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Add per-request replica failure cache in LB to reduce redundant retries #3916
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @cblmemo! Looks mostly good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @cblmemo!
Co-authored-by: Zhanghao Wu <[email protected]>
@Michaelvll bump for this |
bump for review @Michaelvll |
sky/serve/load_balancer.py
Outdated
while True: | ||
retry_cnt += 1 | ||
with self._client_pool_lock: | ||
# If all replicas are failed, clear the record and retry them | ||
# again as some of them might be transient networking issues. | ||
if (len(failed_replica_urls) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering how effective the new failed_repica_urls
is compared to the original globally increased index? Can we simulate the case when a replica went down and there is a different load and see the success rate / latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, did not mean to approve this PR. I feel we need some benchmark to see whether this handling is actually effective
This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
I conducted a local simulation benchmark, and here are some results:
For details, please check the benchmark code: andylizf/skypilot-lb-benchmark. Cc'ing @cblmemo |
Previously, the retry on our load balancer is just select a ready replica, which is possible to select the same replica that failed last time, and its status might not be updated to the LB as the sync with controller only happens every 20s. This PR try to avoid those replica that failed once and trying other instead.
Also, update the controller LB sync interval to 10s to keep align with the probing interval (which is also 10s).
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh