-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API server] accelerate start by slowly start workers #4885
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Aylei <[email protected]>
/quicktest-core |
ref: #4886 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @aylei! It looks mostly good to me. The strategy of exponentially starting workers may need some discussion.
workers_left -= current_batch | ||
if workers_left <= 0: | ||
break | ||
time.sleep(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if the delay is a bit too large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delay is conservative intentionally to ensure that the processes in previous batch have completed the CPU-intensive module loading phase when we launch the next batch. So that we don't have to bother probing the process status and CPU usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Profiling results were added to the PR description
if workers_left <= 0: | ||
break | ||
time.sleep(delay) | ||
batch_size *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why we need to increase the batch size exponentially, it seems if we don't have a congestion control like TCP, the exponential backoff will still overwhelm the CPU at the end. Can we just use batch size the same as the CPU cores or half of the CPU cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the exponential backoff will still overwhelm the CPU at the end
Good catch! I think we can keep the exponential thing and cap the batch size to half of the CPU cores.
Just wondering why we need to increase the batch size exponentially
Just a heuristic idea to start slow and accelerate fast since the server had a significant CPU contention at start time and I don't want to the time of reaching the max parallelism be O(N) to CPU cores.
Actually I think this case has some similarities with congestion control: both have a goal for max throughput while not overwhelming the CPU/bandwidth. The difference is that TCP slow start has a feedback mechanism based on whether the packets sent can be ACKed, but here I feel like it might be an overkill to probe the launched process so just picked a delay based on profiling and hope all the processes launched previously can complete the CPU-intensive module loading phase within the delay.
if processes_left <= 0: | ||
break | ||
time.sleep(self.slow_start_delay) | ||
batch_size *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment above.
@@ -0,0 +1,73 @@ | |||
"""Uvicorn wrapper for Sky server.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Uvicorn wrapper for Sky server.""" | |
"""Uvicorn wrapper for SkyPilot API server.""" |
Would be good to include the benchmark numbers in the docstr as well. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
if sub_proc.is_alive(): | ||
sub_proc.terminate() | ||
sub_proc.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we need to clean up the dead processes.
For #4843, the number of uvicorn workers and skypilot's background long workers are proportional to available CPU cores, which cause issues if the machine is not dedicated to our API server. e.g. the machine has 128 cores but the API server can only get 10 due to contention. The CPU pressure of launching 128 + 256 worker processes simultaneously would severely slow the startup.
This PR introduces TCP-like start for unvicorn workers and long workers. The launch delay is set to fix value based on profiling (2 seconds now) for simplicity.
Test on simulated high resource contention env on my 12 core laptop:
For environment without any resource contention, this PR also accelerates the start time. The start time was accelerated from ~7s->~3.5s on my laptop, achieving a 100% acceleration in daily usage:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh
Appendix
Profiling, launching 1 uvicorn server process takes about ~1s (sky api start):
Launching 12 uvicorn server processes, each takes about 2~3s (sky api start --deploy)