There are many container/accelerator orchestration solutions - many of which are open source.
So far I have been working with SLURM:
- SLURM - Simple Linux Utility for Resource Management, which you're guaranteed to find on most HPC environments and typically it's supported by most cloud providers. It has been around for more than 2 decades
The other most popular orchestrator is Kubernetes:
- Kubernetes - also known as K8s, is an open source system for automating deployment, scaling, and management of containerized applications. Here is a good comparison between SLURM and K8s.
Here are various other less popular, but still very mighty orchestration solutions:
- dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
- SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.
- OpenHPC provides a variety of common, pre-built ingredients required to deploy and manage an HPC Linux cluster including provisioning tools, resource management, I/O clients, runtimes, development tools, containers, and a variety of scientific libraries.
- run.ai - got acquired by NVIDIA and is planned to be open sourced soon.
- Docker Swarm is a container orchestration tool.
- IBM Platform Load Sharing Facility (LSF) Suites is a workload management platform and job scheduler for distributed high performance computing (HPC).