Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Add per-request replica failure cache in LB to reduce redundant retries #3916

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Sep 5, 2024

Previously, the retry on our load balancer is just select a ready replica, which is possible to select the same replica that failed last time, and its status might not be updated to the LB as the sync with controller only happens every 20s. This PR try to avoid those replica that failed once and trying other instead.

Also, update the controller LB sync interval to 10s to keep align with the probing interval (which is also 10s).

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @cblmemo! Looks mostly good to me.

sky/serve/load_balancer.py Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cblmemo!

sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
sky/serve/load_balancer.py Outdated Show resolved Hide resolved
@cblmemo cblmemo requested a review from Michaelvll September 19, 2024 19:50
@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 1, 2024

@Michaelvll bump for this

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 10, 2024

bump for review @Michaelvll

sky/serve/load_balancer.py Outdated Show resolved Hide resolved
while True:
retry_cnt += 1
with self._client_pool_lock:
# If all replicas are failed, clear the record and retry them
# again as some of them might be transient networking issues.
if (len(failed_replica_urls) ==
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how effective the new failed_repica_urls is compared to the original globally increased index? Can we simulate the case when a replica went down and there is a different load and see the success rate / latency?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not mean to approve this PR. I feel we need some benchmark to see whether this handling is actually effective

Copy link
Contributor

github-actions bot commented Feb 9, 2025

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Feb 9, 2025
@cblmemo cblmemo removed the Stale label Feb 10, 2025
@andylizf andylizf self-assigned this Feb 15, 2025
@andylizf andylizf changed the title [Serve] Not using previously failed replica when retry a failed request [Serve] Add per-request replica failure cache in LB to reduce redundant retries Feb 15, 2025
@andylizf
Copy link
Collaborator

andylizf commented Feb 15, 2025

I conducted a local simulation benchmark, and here are some results:

  • With purely random failures, the new per-request failed-replica skipping shows a small improvement (~+1%–2% success).
  • In scenarios with a permanently failing node, the new approach achieves higher success (up to +2%–3%) and slightly lower latency.
  • Under combined spot-like downscaling and a permanently failing node, the new approach shows a clear edge (+1.8% success rate improvement in our tests).

For details, please check the benchmark code: andylizf/skypilot-lb-benchmark. Cc'ing @cblmemo

@andylizf andylizf requested a review from Michaelvll February 15, 2025 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants