-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Server] Fix admin policy enforcement on validate
and optimize
#4820
Changes from 10 commits
616bfbf
57a6ab2
714e65c
6423aad
5010494
0b65617
d37644b
c5164dc
41f4d26
aa33534
20828be
6362328
6f94e3b
78169ce
0d9da0d
478b2a4
37cd372
0c7c078
07f2f3d
2382722
ff27bcc
6691dc1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ | |
|
||
import pydantic | ||
|
||
from sky import admin_policy | ||
from sky import serve | ||
from sky import sky_logging | ||
from sky import skypilot_config | ||
|
@@ -116,12 +117,14 @@ class CheckBody(RequestBody): | |
class ValidateBody(RequestBody): | ||
"""The request body for the validate endpoint.""" | ||
dag: str | ||
request_options: admin_policy.RequestOptions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this be made optional and we can error out on the server side if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, made optional now. When None, we will now follow the default behavior in (Current behavior is it will silently accept request_options as None, but we should update that behavior in a future PR. That's a deeper change since |
||
|
||
|
||
class OptimizeBody(RequestBody): | ||
"""The request body for the optimize endpoint.""" | ||
dag: str | ||
minimize: common_lib.OptimizeTarget = common_lib.OptimizeTarget.COST | ||
request_options: admin_policy.RequestOptions | ||
|
||
def to_kwargs(self) -> Dict[str, Any]: | ||
# Import here to avoid requirement of the whole SkyPilot dependency on | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,10 +24,8 @@ | |
from sky import check as sky_check | ||
from sky import clouds | ||
from sky import core | ||
from sky import exceptions | ||
from sky import execution | ||
from sky import global_user_state | ||
from sky import optimizer | ||
from sky import sky_logging | ||
from sky.clouds import service_catalog | ||
from sky.data import storage_utils | ||
|
@@ -44,7 +42,6 @@ | |
from sky.usage import usage_lib | ||
from sky.utils import common as common_lib | ||
from sky.utils import common_utils | ||
from sky.utils import dag_utils | ||
from sky.utils import status_lib | ||
|
||
# pylint: disable=ungrouped-imports | ||
|
@@ -254,24 +251,26 @@ async def list_accelerator_counts( | |
|
||
|
||
@app.post('/validate') | ||
async def validate(validate_body: payloads.ValidateBody) -> None: | ||
async def validate(request: fastapi.Request, | ||
validate_body: payloads.ValidateBody) -> None: | ||
"""Validates the user's DAG.""" | ||
# TODO(SKY-1035): validate if existing cluster satisfies the requested | ||
# resources, e.g. sky exec --gpus V100:8 existing-cluster-with-no-gpus | ||
|
||
# We make the validation a separate request as it may require expensive | ||
# network calls if an admin policy is applied. | ||
# TODO: Our current launch process is broken down into three calls: | ||
romilbhardwaj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# validate, optimize, and launch. This requires us to apply the admin policy | ||
# in each step, which may be an expensive operation. We should consolidate | ||
# these into a single call or have a TTL cache for (task, admin_policy) | ||
# pairs. | ||
logger.debug(f'Validating tasks: {validate_body.dag}') | ||
try: | ||
dag = dag_utils.load_chain_dag_from_yaml_str(validate_body.dag) | ||
for task in dag.tasks: | ||
# Will validate workdir and file_mounts in the backend, as those | ||
# need to be validated after the files are uploaded to the SkyPilot | ||
# API server with `upload_mounts_to_api_server`. | ||
task.validate_name() | ||
task.validate_run() | ||
for r in task.resources: | ||
r.validate() | ||
except Exception as e: # pylint: disable=broad-except | ||
raise fastapi.HTTPException( | ||
status_code=400, detail=exceptions.serialize_exception(e)) from e | ||
executor.schedule_request(request_id=request.state.request_id, | ||
request_name='validate', | ||
request_body=validate_body, | ||
ignore_return_value=True, | ||
func=core.validate_dag, | ||
schedule_type=requests_lib.ScheduleType.SHORT) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to double check how much additional overhead we add to a normal launch with this modification. It may not worth it to slow down all the launch requests just for the admin policy, which is a less common codepath : ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for flagging, it does add a significant penalty:
I'm going to revert this change in favor of keeping the codepath fast for the common case when there's no admin policy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Benchmark script:
|
||
|
||
|
||
@app.post('/optimize') | ||
|
@@ -283,7 +282,7 @@ async def optimize(optimize_body: payloads.OptimizeBody, | |
request_name='optimize', | ||
request_body=optimize_body, | ||
ignore_return_value=True, | ||
func=optimizer.Optimizer.optimize, | ||
func=core.optimize, | ||
schedule_type=requests_lib.ScheduleType.SHORT, | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emphasize that these are only required when
admin_policy
is specified, or more aggressively we just make this:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep good idea, I was also thinking about it but wasn't sure if we want to expose
admin_policy.RequestOptions
in the SDK to end users. Should be ok.