Skip to content

Latest commit

 

History

History

orchestration

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Orchestration

There are many container/accelerator orchestration solutions - many of which are open source.

So far I have been working with SLURM:

  • SLURM - Simple Linux Utility for Resource Management, which you're guaranteed to find on most HPC environments and typically it's supported by most cloud providers. It has been around for more than 2 decades

The other most popular orchestrator is Kubernetes:

Here are various other less popular, but still very mighty orchestration solutions:

  • dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
  • SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.
  • OpenHPC provides a variety of common, pre-built ingredients required to deploy and manage an HPC Linux cluster including provisioning tools, resource management, I/O clients, runtimes, development tools, containers, and a variety of scientific libraries.
  • run.ai - got acquired by NVIDIA and is planned to be open sourced soon.
  • Docker Swarm is a container orchestration tool.
  • IBM Platform Load Sharing Facility (LSF) Suites is a workload management platform and job scheduler for distributed high performance computing (HPC).