Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Multiple k8s support #4586

Merged
merged 28 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
8ca950c
Show multiple kubernetes in the optimizer table
Michaelvll Sep 27, 2024
de8a688
Add docs for multiple kubernetes
Michaelvll Sep 27, 2024
55c26c3
Add dynamic update
Michaelvll Sep 27, 2024
92b4ded
format
Michaelvll Sep 27, 2024
161ceff
Merge branch 'master' of github.com:skypilot-org/skypilot into multi-…
Michaelvll Sep 27, 2024
6a48757
Add new button
Michaelvll Sep 27, 2024
867a239
Add to index
Michaelvll Sep 27, 2024
fe4c8c4
fix
Michaelvll Sep 27, 2024
aee21f5
Merge branch 'master' of github.com:skypilot-org/skypilot into multi-…
Michaelvll Jan 17, 2025
254b393
Add figure for multi-k8s docs
Michaelvll Jan 22, 2025
903a7ee
Fix new badge
Michaelvll Jan 22, 2025
b6f6c0f
Update docs/source/reference/kubernetes/multi-kubernetes.rst
Michaelvll Jan 31, 2025
fd097f4
update
Michaelvll Jan 31, 2025
8b53611
Merge branch 'master' of github.com:skypilot-org/skypilot into multi-…
Michaelvll Jan 31, 2025
03d9122
Update docs/source/reference/kubernetes/multi-kubernetes.rst
Michaelvll Jan 31, 2025
c6cf614
update
Michaelvll Jan 31, 2025
e70d837
Merge branch 'multi-k8s-docs-v2' of github.com:skypilot-org/skypilot …
Michaelvll Jan 31, 2025
bd6eeb6
fix
Michaelvll Jan 31, 2025
e358437
fix
Michaelvll Jan 31, 2025
9dd445a
Update docs/source/reference/kubernetes/multi-kubernetes.rst
Michaelvll Feb 3, 2025
9a906a4
rename
Michaelvll Feb 3, 2025
935c086
Update docs/source/reference/kubernetes/multi-kubernetes.rst
Michaelvll Feb 3, 2025
bb0a981
Update docs/source/reference/kubernetes/multi-kubernetes.rst
Michaelvll Feb 3, 2025
34d6bbe
Update docs/source/reference/kubernetes/multi-kubernetes.rst
Michaelvll Feb 3, 2025
2797067
Merge branch 'multi-k8s-docs-v2' of github.com:skypilot-org/skypilot …
Michaelvll Feb 3, 2025
4368e4f
revert to console, fix comment color
Michaelvll Feb 3, 2025
7aac76a
new badge
Michaelvll Feb 3, 2025
eb5b161
use comma instead
Michaelvll Feb 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ document.addEventListener('DOMContentLoaded', () => {
{ selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' },
{ selector: '.toctree-l1 > a', text: 'Admin Policy Enforcement' },
{ selector: '.toctree-l1 > a', text: 'Using Existing Machines' },
{ selector: '.toctree-l1 > a', text: 'Multiple Kubernetes Clusters' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
1 change: 1 addition & 0 deletions docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ Read the research:
../reservations/reservations
Using Existing Machines <../reservations/existing-machines>
../reference/kubernetes/index
../reference/kubernetes/multi-kubernetes
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. toctree::
:hidden:
Expand Down
1 change: 1 addition & 0 deletions docs/source/images/multi-kubernetes.svg
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
131 changes: 131 additions & 0 deletions docs/source/reference/kubernetes/multi-kubernetes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
.. _multi-kubernetes:

Multiple Kubernetes Clusters
=============================


SkyPilot allows you to manage dev pods, jobs and services across multiple Kubernetes clusters in a single pane of glass.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

You may have multiple Kubernetes clusters for different:

* **Use cases:** e.g., a production cluster and a development/testing cluster.
* **Regions or clouds:** e.g., US and EU regions; or AWS and Lambda clouds.
* **Accelerators:** e.g., NVIDIA H100 cluster and a Google TPU cluster.
* **Configurations:** e.g., a small cluster for a single node and a large cluster for multiple nodes.
* **Kubernetes versions:** e.g., to upgrade a cluster from Kubernetes 1.20 to 1.21, you may create a new Kubernetes cluster to avoid downtime or unexpected errors.


.. image:: /images/multi-kubernetes.svg
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved


Set Up Credentials for Multiple Kubernetes Clusters
---------------------------------------------------

To work with multiple Kubernetes clusters, you need to ensure you have the necessary credentials for each cluster.
Check that your local ``~/.kube/config`` file has the credentials for each cluster. For setting up clusters and their credentials,
see :ref:`kubernetes-setup-deploy`.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

For example, a ``~/.kube/config`` file may look like this:

.. code-block:: yaml

apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
...
server: https://xx.xx.xx.xx:45819
name: my-h100-cluster
- cluster:
certificate-authority-data:
...
server: https://yy.yy.yy.yy:45819
name: my-tpu-cluster
contexts:
- context:
cluster: my-h100-cluster
user: my-h100-cluster
name: my-h100-cluster
- context:
cluster: my-tpu-cluster
namespace: my-namespace
user: my-tpu-cluster
name: my-tpu-cluster
current-context: my-h100-cluster
...


In this example, we have two Kubernetes clusters: ``my-h100-cluster`` and ``my-tpu-cluster``, and each Kubernetes cluster has a context for it.

Point to a Kubernetes Cluster and Launch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a section on adding contexts to allowed_contexts". If allowed_contexts is not set, I believe --region will not work, we will only use the active context.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to clearly mark section titles:

Step 1 - Set Up Credentials for Multiple Kubernetes Clusters

Step 2 - Configure SkyPilot to access multiple clusters

Then have sections on launching, show-gpus, failover etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the final check, we have the users run sky check kubernetes as a way to verify which contexts are available to SkyPilot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated the doc. PTAL.

-----------------------------------------

SkyPilot borrows the ``region`` concept from clouds to denote a Kubernetes cluster. You can point to a Kubernetes cluster
by specifying the ``--region`` with the context name for that cluster.

.. code-block:: console

# Check the GPUs available in a Kubernetes cluster
$ sky show-gpus --cloud kubernetes --region my-h100-cluster

Kubernetes GPUs (Context: my-h100-cluster)
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
H100 1, 2, 3, 4, 5, 6, 7, 8 8 8

Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-h100-cluster-hbzn H100 8 8
my-h100-cluster-w5x7 None 0 0

When launching a SkyPilot cluster or task, you can also specify the context name with ``--region`` to launch the cluster or task in.

.. code-block:: console

$ sky launch --cloud kubernetes --region my-tpu-cluster echo 'Hello World'


.. note::

When you don't specify a region, SkyPilot will use the current context.


Failover across Multiple Kubernetes Clusters
--------------------------------------------

SkyPilot enables you to failover across multiple Kubernetes clusters. It is useful when you want to launch a task in any of the clusters with available GPUs.

Different from cloud providers, SkyPilot does not failover through different regions (contexts) by default, because multiple
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
Kubernetes clusters can be for different purposes.

To enable the failover, you can specify the ``kubernetes.allowed_contexts`` in SkyPilot config, ``~/.sky/config.yaml`` (See config YAML spec: :ref:`config-yaml`).

.. code-block:: yaml

kubernetes:
allowed_contexts:
- my-h100-cluster-gke
- my-h100-cluster-eks

With this global config, SkyPilot will failover through the Kubernetes clusters in the ``allowed_contexts`` with in the same
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
order as they are specified.


.. code-block:: console

$ sky launch --cloud kubernetes echo 'Hello World'
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

Considered resources (1 node):
------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
------------------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-gke 0.00 ✔
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-eks 0.00
------------------------------------------------------------------------------------------------------------



Dynamically Update Kubernetes Clusters to Use
----------------------------------------------

To see how to dynamically update Kubernetes clusters to use, refer to :ref:`dynamic-kubernetes-contexts-update-policy`.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

Loading