Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API server] accelerate start by slowly start workers #4885

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aylei
Copy link
Collaborator

@aylei aylei commented Mar 5, 2025

For #4843, the number of uvicorn workers and skypilot's background long workers are proportional to available CPU cores, which cause issues if the machine is not dedicated to our API server. e.g. the machine has 128 cores but the API server can only get 10 due to contention. The CPU pressure of launching 128 + 256 worker processes simultaneously would severely slow the startup.

This PR introduces TCP-like start for unvicorn workers and long workers. The launch delay is set to fix value based on profiling (2 seconds now) for simplicity.

Test on simulated high resource contention env on my 12 core laptop:

# Cheat API server that the env has 64c256g resources
$ export SKYPILOT_POD_CPU_CORE_LIMIT=64
$ export SKYPILOT_POD_MEMORY_GB_LIMIT=256

# master, timeout
/usr/bin/time sky api start --deploy
sky.exceptions.ApiServerConnectionError: Could not connect to SkyPilot API server at http://127.0.0.1:46580. Please ensure that the server is running. Try: curl http://127.0.0.1:46580/api/health

During handling of the above exception, another exception occurred:
....

# PR branch, start in 4.2s
/usr/bin/time sky api start --deploy
Failed to connect to SkyPilot API server at http://0.0.0.0:46580. Starting a local server.
✓ SkyPilot API server started.
├── SkyPilot API server: http://0.0.0.0:46580
└── View API server logs at: ~/.sky/api_server/server.log
        4,20 real         1,06 user         0,59 sys

For environment without any resource contention, this PR also accelerates the start time. The start time was accelerated from ~7s->~3.5s on my laptop, achieving a 100% acceleration in daily usage:

# master
$ /usr/bin/time sky api start --deploy
...
        6,96 real         0,54 user         0,56 sys

# PR branch
$ /usr/bin/time sky api start --deploy
...
        3,28 real         0,63 user         0,56 sys

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (see above)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Appendix

Profiling, launching 1 uvicorn server process takes about ~1s (sky api start):

image

Launching 12 uvicorn server processes, each takes about 2~3s (sky api start --deploy)

image

@aylei
Copy link
Collaborator Author

aylei commented Mar 5, 2025

/quicktest-core

@aylei aylei marked this pull request as ready for review March 5, 2025 14:23
@aylei
Copy link
Collaborator Author

aylei commented Mar 5, 2025

ref: #4886

@aylei aylei requested a review from Michaelvll March 5, 2025 16:20
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @aylei! It looks mostly good to me. The strategy of exponentially starting workers may need some discussion.

workers_left -= current_batch
if workers_left <= 0:
break
time.sleep(delay)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if the delay is a bit too large?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delay is conservative intentionally to ensure that the processes in previous batch have completed the CPU-intensive module loading phase when we launch the next batch. So that we don't have to bother probing the process status and CPU usage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiling results were added to the PR description

if workers_left <= 0:
break
time.sleep(delay)
batch_size *= 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why we need to increase the batch size exponentially, it seems if we don't have a congestion control like TCP, the exponential backoff will still overwhelm the CPU at the end. Can we just use batch size the same as the CPU cores or half of the CPU cores?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the exponential backoff will still overwhelm the CPU at the end

Good catch! I think we can keep the exponential thing and cap the batch size to half of the CPU cores.

Just wondering why we need to increase the batch size exponentially

Just a heuristic idea to start slow and accelerate fast since the server had a significant CPU contention at start time and I don't want to the time of reaching the max parallelism be O(N) to CPU cores.

Actually I think this case has some similarities with congestion control: both have a goal for max throughput while not overwhelming the CPU/bandwidth. The difference is that TCP slow start has a feedback mechanism based on whether the packets sent can be ACKed, but here I feel like it might be an overkill to probe the launched process so just picked a delay based on profiling and hope all the processes launched previously can complete the CPU-intensive module loading phase within the delay.

if processes_left <= 0:
break
time.sleep(self.slow_start_delay)
batch_size *= 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above.

@@ -0,0 +1,73 @@
"""Uvicorn wrapper for Sky server."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Uvicorn wrapper for Sky server."""
"""Uvicorn wrapper for SkyPilot API server."""

Would be good to include the benchmark numbers in the docstr as well. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Comment on lines +1123 to +1125
if sub_proc.is_alive():
sub_proc.terminate()
sub_proc.join()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we need to clean up the dead processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants