Skip to content

Commit

Permalink
A Workaround for Launching Non-root Customized Docker Images on RunPod (
Browse files Browse the repository at this point in the history
#4683)

* refactor: inform that `docker_config` existence conditions

* feat: support user-specified ssh username for runpod docker

* fix: format and list Resources

* fix: do not copy resources if docker_ssh_username not exists

* fix: add a space

* style: format

* refactor: naming, with runpod stressed and ssh understated

* docs: also mention this env in task spec

* docs: apply suggestions from code review

Co-authored-by: Tian Xia <[email protected]>

* docs: mention this `env` as a note

* docs: remove issue

---------

Co-authored-by: Tian Xia <[email protected]>
  • Loading branch information
andylizf and cblmemo authored Feb 13, 2025
1 parent 21214ce commit 57137e4
Show file tree
Hide file tree
Showing 9 changed files with 89 additions and 5 deletions.
15 changes: 14 additions & 1 deletion docs/source/examples/docker-containers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ SkyPilot can run a container either as a task, or as the runtime environment of

.. note::

Running docker containers is `not supported on RunPod <https://docs.runpod.io/references/faq#can-i-run-my-own-docker-daemon-on-runpod>`_. To use RunPod, either use your docker image (the username should be ``root`` for RunPod) :ref:`as a runtime environment <docker-containers-as-runtime-environments>` or use ``setup`` and ``run`` to configure your environment. See `GitHub issue <https://github.com/skypilot-org/skypilot/issues/3096#issuecomment-2150559797>`_ for more.
Running docker containers is `not supported on RunPod <https://docs.runpod.io/references/faq#can-i-run-my-own-docker-daemon-on-runpod>`_. To use RunPod, either use your docker image :ref:`as a runtime environment <docker-containers-as-runtime-environments>` or use ``setup`` and ``run`` to configure your environment. See `GitHub issue <https://github.com/skypilot-org/skypilot/issues/3096#issuecomment-2150559797>`_ for more.


.. _docker-containers-as-tasks:
Expand Down Expand Up @@ -122,6 +122,19 @@ For example, to use the :code:`ubuntu:20.04` image from Docker Hub:
run: |
# Commands to run inside the container
.. note::
For **non-root** docker images on RunPod, you must manually set the :code:`SKYPILOT_RUNPOD_DOCKER_USERNAME` environment variable to match the login user of the docker image (set by the last `USER` instruction in the Dockerfile).

You can set this environment variable in the :code:`envs` section of your task YAML file:

.. code-block:: yaml
envs:
SKYPILOT_RUNPOD_DOCKER_USERNAME: <ssh-user>
It's a workaround for RunPod's limitation that we can't get the login user for the created pods, and even `runpodctl` uses a hardcoded `root` for SSH access.
But for other clouds, the login users for the created docker containers are automatically fetched and used.

As another example, here's how to use `NVIDIA's PyTorch NGC Container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`_:

.. code-block:: yaml
Expand Down
4 changes: 4 additions & 0 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,10 @@ Available fields:
# Values set here can be overridden by a CLI flag:
# `sky launch/exec --env ENV=val` (if ENV is present).
#
# For costumized non-root docker image in RunPod, you need to set
# `SKYPILOT_RUNPOD_DOCKER_USERNAME` to specify the login username for the
# docker image. See :ref:`docker-containers-as-runtime-environments` for more.
#
# If you want to use a docker image as runtime environment in a private
# registry, you can specify your username, password, and registry server as
# task environment variable. For example:
Expand Down
6 changes: 6 additions & 0 deletions sky/clouds/runpod.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,13 +177,19 @@ def make_deploy_resources_variables(
hourly_cost = self.instance_type_to_hourly_cost(
instance_type=instance_type, use_spot=use_spot)

# default to root
docker_username_for_runpod = (resources.docker_username_for_runpod
if resources.docker_username_for_runpod
is not None else 'root')

return {
'instance_type': instance_type,
'custom_resources': custom_resources,
'region': region.name,
'image_id': image_id,
'use_spot': use_spot,
'bid_per_gpu': str(hourly_cost),
'docker_username_for_runpod': docker_username_for_runpod,
}

def _get_feasible_launchable_resources(
Expand Down
7 changes: 7 additions & 0 deletions sky/provision/provisioner.py
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,13 @@ def _post_provision_setup(
logger.info(f'{indent_str}{colorama.Style.DIM}{vm_str}{plural} {verb} '
f'up.{colorama.Style.RESET_ALL}')

# It's promised by the cluster config that docker_config does not
# exist for docker-native clouds, i.e. they provide docker containers
# instead of full VMs, like Kubernetes and RunPod, as it requires some
# special handlings to run docker inside their docker virtualization.
# For their Docker image settings, we do them when provisioning the
# cluster. See provision/{cloud}/instance.py:get_cluster_info for more
# details.
if docker_config:
status.update(
ux_utils.spinner_message(
Expand Down
7 changes: 4 additions & 3 deletions sky/provision/runpod/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ def delete_pod_template(template_name: str) -> None:
runpod.runpod.api.graphql.run_graphql_query(
f'mutation {{deleteTemplate(templateName: "{template_name}")}}')
except runpod.runpod.error.QueryError as e:
logger.warning(f'Failed to delete template {template_name}: {e}'
logger.warning(f'Failed to delete template {template_name}: {e} '
'Please delete it manually.')


Expand All @@ -195,8 +195,9 @@ def delete_register_auth(registry_auth_id: str) -> None:
try:
runpod.runpod.delete_container_registry_auth(registry_auth_id)
except runpod.runpod.error.QueryError as e:
logger.warning(f'Failed to delete registry auth {registry_auth_id}: {e}'
'Please delete it manually.')
logger.warning(
f'Failed to delete registry auth {registry_auth_id}: {e} '
'Please delete it manually.')


def _create_template_for_docker_login(
Expand Down
26 changes: 26 additions & 0 deletions sky/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def __init__(
# Internal use only.
# pylint: disable=invalid-name
_docker_login_config: Optional[docker_utils.DockerLoginConfig] = None,
_docker_username_for_runpod: Optional[str] = None,
_is_image_managed: Optional[bool] = None,
_requires_fuse: Optional[bool] = None,
_cluster_config_overrides: Optional[Dict[str, Any]] = None,
Expand Down Expand Up @@ -148,6 +149,9 @@ def __init__(
_docker_login_config: the docker configuration to use. This includes
the docker username, password, and registry server. If None, skip
docker login.
_docker_username_for_runpod: the login username for the docker
containers. This is used by RunPod to set the ssh user for the
docker containers.
_requires_fuse: whether the task requires FUSE mounting support. This
is used internally by certain cloud implementations to do additional
setup for FUSE mounting. This flag also safeguards against using
Expand Down Expand Up @@ -234,6 +238,12 @@ def __init__(

self._docker_login_config = _docker_login_config

# TODO(andyl): This ctor param seems to be unused.
# We always use `Task.set_resources` and `Resources.copy` to set the
# `docker_username_for_runpod`. But to keep the consistency with
# `_docker_login_config`, we keep it here.
self._docker_username_for_runpod = _docker_username_for_runpod

self._requires_fuse = _requires_fuse

self._cluster_config_overrides = _cluster_config_overrides
Expand Down Expand Up @@ -479,6 +489,10 @@ def cluster_config_overrides(self) -> Dict[str, Any]:
def requires_fuse(self, value: Optional[bool]) -> None:
self._requires_fuse = value

@property
def docker_username_for_runpod(self) -> Optional[str]:
return self._docker_username_for_runpod

def _set_cpus(
self,
cpus: Union[None, int, float, str],
Expand Down Expand Up @@ -1065,6 +1079,10 @@ def make_deploy_variables(self, cluster_name: resources_utils.ClusterName,
cloud_specific_variables = self.cloud.make_deploy_resources_variables(
self, cluster_name, region, zones, num_nodes, dryrun)

# TODO(andyl): Should we print some warnings if users' envs share
# same names with the cloud specific variables, but not enabled
# since it's not on the particular cloud?

# Docker run options
docker_run_options = skypilot_config.get_nested(
('docker', 'run_options'),
Expand Down Expand Up @@ -1277,6 +1295,9 @@ def copy(self, **override) -> 'Resources':
labels=override.pop('labels', self.labels),
_docker_login_config=override.pop('_docker_login_config',
self._docker_login_config),
_docker_username_for_runpod=override.pop(
'_docker_username_for_runpod',
self._docker_username_for_runpod),
_is_image_managed=override.pop('_is_image_managed',
self._is_image_managed),
_requires_fuse=override.pop('_requires_fuse', self._requires_fuse),
Expand Down Expand Up @@ -1438,6 +1459,8 @@ def _from_yaml_config_single(cls, config: Dict[str, str]) -> 'Resources':
resources_fields['labels'] = config.pop('labels', None)
resources_fields['_docker_login_config'] = config.pop(
'_docker_login_config', None)
resources_fields['_docker_username_for_runpod'] = config.pop(
'_docker_username_for_runpod', None)
resources_fields['_is_image_managed'] = config.pop(
'_is_image_managed', None)
resources_fields['_requires_fuse'] = config.pop('_requires_fuse', None)
Expand Down Expand Up @@ -1486,6 +1509,9 @@ def add_if_not_none(key, value):
if self._docker_login_config is not None:
config['_docker_login_config'] = dataclasses.asdict(
self._docker_login_config)
if self._docker_username_for_runpod is not None:
config['_docker_username_for_runpod'] = (
self._docker_username_for_runpod)
add_if_not_none('_cluster_config_overrides',
self._cluster_config_overrides)
if self._is_image_managed is not None:
Expand Down
2 changes: 2 additions & 0 deletions sky/skylet/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@
DOCKER_SERVER_ENV_VAR,
}

RUNPOD_DOCKER_USERNAME_ENV_VAR = 'SKYPILOT_RUNPOD_DOCKER_USERNAME'

# Commands for disable GPU ECC, which can improve the performance of the GPU
# for some workloads by 30%. This will only be applied when a user specify
# `nvidia_gpus.disable_ecc: true` in ~/.sky/config.yaml.
Expand Down
25 changes: 25 additions & 0 deletions sky/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,9 @@ def _check_docker_login_config(task_envs: Dict[str, str]) -> bool:
If any of the docker login env vars is set, all of them must be set.
Returns:
True if there is a valid docker login config in task_envs.
False otherwise.
Raises:
ValueError: if any of the docker login env vars is set, but not all of
them are set.
Expand Down Expand Up @@ -168,6 +171,23 @@ def _add_docker_login_config(resources: 'resources_lib.Resources'):
return type(resources)(new_resources)


def _with_docker_username_for_runpod(
resources: Union[Set['resources_lib.Resources'],
List['resources_lib.Resources']],
task_envs: Dict[str, str],
) -> Union[Set['resources_lib.Resources'], List['resources_lib.Resources']]:
docker_username_for_runpod = task_envs.get(
constants.RUNPOD_DOCKER_USERNAME_ENV_VAR)

# We should not call r.copy() if docker_username_for_runpod is None,
# to prevent `DummyResources` instance becoming a `Resources` instance.
if docker_username_for_runpod is None:
return resources
return (type(resources)(
r.copy(_docker_username_for_runpod=docker_username_for_runpod)
for r in resources))


class Task:
"""Task: a computation to be run on the cloud."""

Expand Down Expand Up @@ -582,6 +602,8 @@ def update_envs(
if _check_docker_login_config(self._envs):
self.resources = _with_docker_login_config(self.resources,
self._envs)
self.resources = _with_docker_username_for_runpod(
self.resources, self._envs)
return self

@property
Expand Down Expand Up @@ -647,6 +669,9 @@ def set_resources(
resources = {resources}
# TODO(woosuk): Check if the resources are None.
self.resources = _with_docker_login_config(resources, self.envs)
# Only have effect on RunPod.
self.resources = _with_docker_username_for_runpod(
self.resources, self.envs)

# Evaluate if the task requires FUSE and set the requires_fuse flag
for _, storage_obj in self.storage_mounts.items():
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/runpod-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ provider:
{%- endif %}

auth:
ssh_user: root
ssh_user: {{docker_username_for_runpod}}
ssh_private_key: {{ssh_private_key}}

available_node_types:
Expand Down

0 comments on commit 57137e4

Please sign in to comment.