[API server] accelerate start by slowly start workers #4885

aylei · 2025-03-05T13:09:39Z

For #4843, the number of uvicorn workers and skypilot's background long workers are proportional to available CPU cores, which cause issues if the machine is not dedicated to our API server. e.g. the machine has 128 cores but the API server can only get 10 due to contention. The CPU pressure of launching 128 + 256 worker processes simultaneously would severely slow the startup.

This PR introduces TCP-like start for unvicorn workers and long workers. The launch delay is set to fix value based on profiling (2 seconds now) for simplicity.

Test on simulated high resource contention env on my 12 core laptop:

# Cheat API server that the env has 64c256g resources
$ export SKYPILOT_POD_CPU_CORE_LIMIT=64
$ export SKYPILOT_POD_MEMORY_GB_LIMIT=256

# master, timeout
/usr/bin/time sky api start --deploy
sky.exceptions.ApiServerConnectionError: Could not connect to SkyPilot API server at http://127.0.0.1:46580. Please ensure that the server is running. Try: curl http://127.0.0.1:46580/api/health

During handling of the above exception, another exception occurred:
....

# PR branch, start in 4.2s
/usr/bin/time sky api start --deploy
Failed to connect to SkyPilot API server at http://0.0.0.0:46580. Starting a local server.
✓ SkyPilot API server started.
├── SkyPilot API server: http://0.0.0.0:46580
└── View API server logs at: ~/.sky/api_server/server.log
        4,20 real         1,06 user         0,59 sys

For environment without any resource contention, this PR also accelerates the start time. The start time was accelerated from ~7s->~3.5s on my laptop, achieving a 100% acceleration in daily usage:

# master
$ /usr/bin/time sky api start --deploy
...
        6,96 real         0,54 user         0,56 sys

# PR branch
$ /usr/bin/time sky api start --deploy
...
        3,28 real         0,63 user         0,56 sys

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (see above)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Appendix

Profiling, launching 1 uvicorn server process takes about ~1s (sky api start):

Launching 12 uvicorn server processes, each takes about 2~3s (sky api start --deploy)

Signed-off-by: Aylei <[email protected]>

aylei · 2025-03-05T13:15:24Z

/quicktest-core

aylei · 2025-03-05T14:54:25Z

ref: #4886

Michaelvll

Thanks for the fix @aylei! It looks mostly good to me. The strategy of exponentially starting workers may need some discussion.

Michaelvll · 2025-03-05T20:24:46Z

sky/server/requests/executor.py

+        workers_left -= current_batch
+        if workers_left <= 0:
+            break
+        time.sleep(delay)


Just wondering if the delay is a bit too large?

The delay is conservative intentionally to ensure that the processes in previous batch have completed the CPU-intensive module loading phase when we launch the next batch. So that we don't have to bother probing the process status and CPU usage.

Profiling results were added to the PR description

Michaelvll · 2025-03-05T20:38:37Z

sky/server/requests/executor.py

+        if workers_left <= 0:
+            break
+        time.sleep(delay)
+        batch_size *= 2


Just wondering why we need to increase the batch size exponentially, it seems if we don't have a congestion control like TCP, the exponential backoff will still overwhelm the CPU at the end. Can we just use batch size the same as the CPU cores or half of the CPU cores?

the exponential backoff will still overwhelm the CPU at the end

Good catch! I think we can keep the exponential thing and cap the batch size to half of the CPU cores.

Just wondering why we need to increase the batch size exponentially

Just a heuristic idea to start slow and accelerate fast since the server had a significant CPU contention at start time and I don't want to the time of reaching the max parallelism be O(N) to CPU cores.

Actually I think this case has some similarities with congestion control: both have a goal for max throughput while not overwhelming the CPU/bandwidth. The difference is that TCP slow start has a feedback mechanism based on whether the packets sent can be ACKed, but here I feel like it might be an overkill to probe the launched process so just picked a delay based on profiling and hope all the processes launched previously can complete the CPU-intensive module loading phase within the delay.

Michaelvll · 2025-03-05T20:52:14Z

sky/server/uvicorn.py

+            if processes_left <= 0:
+                break
+            time.sleep(self.slow_start_delay)
+            batch_size *= 2


Similar to the comment above.

Michaelvll · 2025-03-05T20:53:19Z

sky/server/uvicorn.py

@@ -0,0 +1,73 @@
+"""Uvicorn wrapper for Sky server."""


Suggested change

"""Uvicorn wrapper for Sky server."""

"""Uvicorn wrapper for SkyPilot API server."""

Would be good to include the benchmark numbers in the docstr as well. : )

Sounds good!

Michaelvll · 2025-03-05T20:53:36Z

sky/server/server.py

+            if sub_proc.is_alive():
+                sub_proc.terminate()
+                sub_proc.join()


Wondering if we need to clean up the dead processes.

[API server] accelerate start by slowly start workers

a210dbe

Signed-off-by: Aylei <[email protected]>

aylei marked this pull request as ready for review March 5, 2025 14:23

Merge branch 'master' into accelerate-start

aeb8887

aylei requested a review from Michaelvll March 5, 2025 16:20

Michaelvll reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API server] accelerate start by slowly start workers #4885

[API server] accelerate start by slowly start workers #4885

aylei commented Mar 5, 2025 •

edited

Loading

aylei commented Mar 5, 2025

aylei commented Mar 5, 2025

Michaelvll left a comment

Michaelvll Mar 5, 2025

aylei Mar 6, 2025

aylei Mar 6, 2025

Michaelvll Mar 5, 2025

aylei Mar 6, 2025

Michaelvll Mar 5, 2025

Michaelvll Mar 5, 2025

aylei Mar 6, 2025

Michaelvll Mar 5, 2025

	"""Uvicorn wrapper for Sky server."""
	"""Uvicorn wrapper for SkyPilot API server."""

[API server] accelerate start by slowly start workers #4885

Are you sure you want to change the base?

[API server] accelerate start by slowly start workers #4885

Conversation

aylei commented Mar 5, 2025 • edited Loading

Appendix

aylei commented Mar 5, 2025

aylei commented Mar 5, 2025

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aylei commented Mar 5, 2025 •

edited

Loading