[Serve] Add per-request replica failure cache in LB to reduce redundant retries #3916

cblmemo · 2024-09-05T20:49:38Z

Previously, the retry on our load balancer is just select a ready replica, which is possible to select the same replica that failed last time, and its status might not be updated to the LB as the sync with controller only happens every 20s. This PR try to avoid those replica that failed once and trying other instead.

Also, update the controller LB sync interval to 10s to keep align with the probing interval (which is also 10s).

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Michaelvll

Thanks for the fix @cblmemo! Looks mostly good to me.

sky/serve/load_balancer.py

Michaelvll

Thanks for the update @cblmemo!

sky/serve/load_balancer.py

Co-authored-by: Zhanghao Wu <[email protected]>

cblmemo · 2024-10-01T23:01:03Z

@Michaelvll bump for this

cblmemo · 2024-10-10T00:23:54Z

bump for review @Michaelvll

sky/serve/load_balancer.py

Michaelvll · 2024-10-11T07:21:07Z

sky/serve/load_balancer.py

        while True:
            retry_cnt += 1
            with self._client_pool_lock:
+                # If all replicas are failed, clear the record and retry them
+                # again as some of them might be transient networking issues.
+                if (len(failed_replica_urls) ==


I am wondering how effective the new failed_repica_urls is compared to the original globally increased index? Can we simulate the case when a replica went down and there is a different load and see the success rate / latency?

Michaelvll

Sorry, did not mean to approve this PR. I feel we need some benchmark to see whether this handling is actually effective

github-actions · 2025-02-09T02:00:49Z

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

andylizf · 2025-02-15T10:47:25Z

I conducted a local simulation benchmark, and here are some results:

With purely random failures, the new per-request failed-replica skipping shows a small improvement (~+1%–2% success).
In scenarios with a permanently failing node, the new approach achieves higher success (up to +2%–3%) and slightly lower latency.
Under combined spot-like downscaling and a permanently failing node, the new approach shows a clear edge (+1.8% success rate improvement in our tests).

For details, please check the benchmark code: andylizf/skypilot-lb-benchmark. Cc'ing @cblmemo

cblmemo added 2 commits September 5, 2024 13:47

fix

847d78e

change to 10

97e944f

Michaelvll reviewed Sep 9, 2024

View reviewed changes

sky/serve/load_balancer.py Show resolved Hide resolved

cblmemo added 4 commits September 10, 2024 23:04

upd

56ccf63

todo

39444aa

upd comments

0f4c9ea

comment

f94491c

Michaelvll reviewed Sep 13, 2024

View reviewed changes

sky/serve/load_balancer.py Outdated Show resolved Hide resolved

sky/serve/load_balancer.py Outdated Show resolved Hide resolved

sky/serve/load_balancer.py Outdated Show resolved Hide resolved

cblmemo and others added 3 commits September 12, 2024 22:44

apply suggestions from code review

2898816

Update sky/serve/load_balancer.py

a6980b7

Co-authored-by: Zhanghao Wu <[email protected]>

format

615181c

cblmemo requested a review from Michaelvll September 19, 2024 19:50

Michaelvll approved these changes Oct 11, 2024

View reviewed changes

Michaelvll requested changes Oct 11, 2024

View reviewed changes

github-actions bot added the Stale label Feb 9, 2025

cblmemo removed the Stale label Feb 10, 2025

andylizf self-assigned this Feb 15, 2025

andylizf changed the title ~~[Serve] Not using previously failed replica when retry a failed request~~ [Serve] Add per-request replica failure cache in LB to reduce redundant retries Feb 15, 2025

andylizf added 5 commits February 15, 2025 08:09

fix: use set diff to handle ready_replicas concurrently shrinking

c126ac5

refactor: less comments

bdb5699

refactor: on_event('startup') deprecated

4fb611e

Merge branch 'master' into serve-fix-retry

a6558be

fix: register lifespan

39facb2

andylizf requested a review from Michaelvll February 15, 2025 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Add per-request replica failure cache in LB to reduce redundant retries #3916

[Serve] Add per-request replica failure cache in LB to reduce redundant retries #3916

cblmemo commented Sep 5, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll left a comment

cblmemo commented Oct 1, 2024

cblmemo commented Oct 10, 2024

Michaelvll Oct 11, 2024

Michaelvll left a comment

github-actions bot commented Feb 9, 2025

andylizf commented Feb 15, 2025 •

edited

Loading

[Serve] Add per-request replica failure cache in LB to reduce redundant retries #3916

Are you sure you want to change the base?

[Serve] Add per-request replica failure cache in LB to reduce redundant retries #3916

Conversation

cblmemo commented Sep 5, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

cblmemo commented Oct 1, 2024

cblmemo commented Oct 10, 2024

Michaelvll Oct 11, 2024

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 9, 2025

andylizf commented Feb 15, 2025 • edited Loading

cblmemo commented Sep 5, 2024 •

edited

Loading

andylizf commented Feb 15, 2025 •

edited

Loading