Skip to content

Latest commit

 

History

History
224 lines (154 loc) · 11.2 KB

README.md

File metadata and controls

224 lines (154 loc) · 11.2 KB

Soperator – Kubernetes Operator for Slurm

tag-machine-learning tag-model-training tag-high-performance-computing

github-last-commit github-license

Run Slurm in Kubernetes and enjoy the benefits of both systems. You can learn more about Soperator, its prerequisites, and architecture in the Medium article.

Slurm in Kubernetes

📋 Table of contents

💡 Rationale

Both Slurm and Kubernetes can serve as workload managers for distributed model training and high-performance computing (HPC) in general.

Each of these systems has its strengths and weaknesses, and the trade-offs between them are significant. Slurm offers advanced and effective scheduling, granular hardware control, and accounting, but lacks universality. On the other hand, Kubernetes can be used for purposes other than training (e.g. inference) and provides good auto-scaling and self-healing capabilities. For a detailed comparison, see the Nebius AI blog post.

It's unfortunate that there is no way to combine the benefits of both solutions. And since many big tech companies use Kubernetes as their default infrastructure layer without supporting a dedicated model training system, some ML engineers don't even have a choice.

That's why we decided to marry these systems, taking a "Kubernetes-first" approach. We implemented a Kubernetes operator, which is a software component that runs and manages Slurm clusters as Kubernetes resources.

Solution Architecture

This allowed us to reuse the autoscaling and self-healing of Kubernetes in Slurm, and implement some unique features, while maintaining the usual way of interacting with it.

⭐ Features

Shared root filesystem

When users interact with a Slurm cluster, they see a shared filesystem as their root "/" directory. With this approach, you can keep using Slurm in a familiar way (e.g. you don't need to run all jobs in containers).

It also means that you don't have to keep nodes in an identical state. Changes that you make on one node—e.g. install new software packages, create Linux users, write job outputs, or download datasets—can be immediately reflected on all other nodes.

GPU health checks

If your Kubernetes cluster contains NVIDIA GPUs, the operator will perform regular GPU health checks. If a Slurm node fails a health check, the operator “drains” it, so that new jobs are not scheduled on it.

Easy scaling

Each stage of building an ML product has its own requirements for computing resources.

Soperator allows Slurm to reuse the unique Kubernetes feature of scaling automatically according to current needs. You can simply change a single value in the YAML manifest and watch the cluster change in size.

High availability

Kubernetes comes with some level of HA out of the box. If a pod or container, such as a Slurm controller, fails, Kubernetes recreates it.

Soperator takes this even further, continuously bringing the entire cluster up to the desired state.

Isolation of user actions

All user actions are isolated within a dedicated container-like environment, so that an action can't break the Slurm cluster itself by accident. This defines a clear boundary between operator and user responsibility.

Accounting

Slurm's accounting system records detailed job information such as:

  • CPU and memory consumption
  • User and group identities
  • Job start/end times
  • Resource requests and allocations

This helps cluster administrators and users monitor resource utilization, enforce quotas, and generate usage reports for performance optimization or billing purposes.

❌ Limitations

  • GPUs are required. Although support for CPU-only clusters or partitions seems pretty straightforward, we haven't implemented it yet.
  • Single-partition clusters. Slurm's ability to split clusters into several partitions isn't supported now.
  • Software versions. The list of software versions we currently support is quite short.
  • Linux: Ubuntu 20.04 and 22.04.
    • Slurm: versions 23.11.6 and 24.05.3.
    • CUDA: version 12.2.2.
    • Kubernetes: >= 1.28.
    • Versions of some preinstalled software packages can't be changed.

🚀 Installation

The steps required to deploy Soperator to your Kubernetes cluster depend on whether you are using Kubernetes on premises or in a cloud.

Nebius AI

For Nebius AI, we provide a Terraform recipe that creates everything itself, which includes:

Everything specific to Nebius AI is contained in a separate repository: nebius/soperator-terraform.

Other clouds and on-premises

Important

When using the soperator, it is important that the CNI supports preserving the client source IP. Therefore, if kube-proxy is configured in IPVS mode, or if you're using CNI plugins like kube-router or Antrea Proxy, the operator will not work. This operator has been tested with the Cilium network plugin running in kube-proxy replacement mode.

In general, you need to follow these steps:

  1. Decide on the shared storage technology you would like to use. At least one shared filesystem is necessary, because it will store the environment shared by Slurm nodes. The only thing the Soperator requires is the PVC name. Consider using NFS as the simplest option, or something more advanced like OpenEBS or GlusterFS.
  2. Install the NVIDIA GPU Operator.
  3. If you use InfiniBand, install the NVIDIA Network Operator.
  4. Install Soperator by applying the soperator Helm chart.
  5. Create a Slurm cluster in a namespace with the same name as the slurm cluster by applying the slurm-cluster Helm chart.
  6. Wait until the slurm.nebius.ai/SlurmCluster resource becomes Available.

Warning

Although Soperator should be compatible with any Kubernetes installation in principle, we haven't tested it anywhere outside Nebius AI, so it's likely that something won't work out of the box or will require additional configuration. If you're facing issues, create an issue in this repository, and we will help you install Soperator to your Kubernetes and update these docs.

📈 Future plans

  • 💡 On-demand nodes. The easy scaling can be improved further by provisioning new Kubernetes nodes only when there are queued jobs that need them.
  • 💡 Network topology-aware job scheduling. Thanks to the Slurm topology feature, we can support detailed configuration of the network topology for more efficient scheduling.
  • 💡 Automatic replacement of bad-performing nodes. Currently, Soperator just drains the Slurm nodes that have problems. We have a plan to replace such nodes automatically.
  • 💡 More system checks. Soperator only checks GPUs at the moment, but there are more things to check: software issues, storage performance, network connectivity, etc. So we're going to continue adding new checks.
  • 💡 Jail backups. This means backing up the shared storage to improve durability.
  • 💡 Automatic external checkpointing. We consider using NVIDIA's cuda-checkpoint for dumping and resuming job processes externally.

📚 Documentation

The detailed documentation is located in the docs directory of this repository.

It includes, among other things:

🤬 Feedback

If you like this project, star it on GitHub. So we will see that the community is interested in it and continue developing it further, openly and publicly.

If you failed to install Soperator to your Kubernetes cluster or encounter any other issue with it, create a GitHub issue and write details about your problem. We will try to help.

Note

This project is very new and quite raw - it was started in May 2024. And if it already works stably in Nebius AI, this may not be the case for other clouds.

🤝 Contribution

Unfortunately, at the moment we don't have development docs for outside developers who want to participate in this project. If you are interested in contribution, create a GitHub issue, and we'll figure something out.

Also, pay attention to the list of future plans we have. The tasks we are currently working on are marked there. Maybe you need just one of these.

🏛 License

The Soperator itself is licensed under Apache 2.0, a permissive free software license that allows you to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions under specific terms.

Please note that various pieces of software it installs in your cluster may have other licenses.