As of May 14, 2024, we have deprecated this solution in favor of github.com/crusoecloud/slurm. This solution will remain available in a read-only mode; however, we are not able to provide ongoing maintenance for this solution.
This is a reference design implementation of SLURM on Crusoe Cloud. This implementation has support for multiple paritions and specific nodegroups within those partitions. The cluster also has support to a cluster autoscaler that will provision instances on Crusoe based on demand on the cluster. The terraform script main.tf
is the main entry point which will just provision the headnode and using the SLURM Power Plugin will start additional compute nodes based on jobs submitted to the headnode.
The terraform script will simply provision a headnode, the headnode-bootstrap.sh
script will perform the following:
- Will scan for number of ephemeral drives and mount it as RAID0 for number of drives > 1 at mount point
/raid0
for instances with a single nvme local epehmeral drive it will be mounted as/nvme
and thescratch
directory will inside that path - A NFS server is also setup at
/nfs/slurm
which provides the SLURM binaries, libraries and helper code to the ephemeral compute nodes - Download and install SLURM source tree. The SLURM version is controlled by the bootstrap script to ensure its supported on Crusoe. Changing the version in the repo is NOT supported, unless is validated by Crusoe.
Included in the deployment is support for enroot and Pyxis. Purpose built to support native container orchestration within SLURM to run container images across the cluster.
All enroot images are on the /scratch
directory of each node in the cluster. Adding credentials to access various registries can be done by editing a $HOME/enroot/.credentials
file.
The headnode is hosting a Telegraf-Prometheus-Grafana(TPG)-stack, and each worker runs Telegraf and creates a /metrics
endpoint from which the
headnode Prometheus will poll.
Step 1. Install Terraform On your client machine where you deploy the headnode of the cluster install Terraform following the instructions here.
Step 2. Install the Crusoe Cloud CLI Install the Crusoe Cloud ClI following these instructions, setup the authentication layer by creating ssh keys and API tokens.
Step 3. Clone repo and create a variables.tf
File
git clone https://github.com/crusoecloud/crusoe-hpc-slurm.git
cd crusoe-hpc-slurm
Your variables.tf
contains the following:
variable "access_key" {
description = "Crusoe API Access Key"
type = string
default = "<ACCESS_KEY>"
}
variable "secret_key" {
description = "Crusoe API Secret Key"
type = string
default = "<SECRET_KEY>"
}
Step 4. In the main.tf
file replace the local values with provide a path for the private ssh key and the string of the public key. And choose an instance type for the headnode
locals {
my_ssh_privkey_path="/Users/amrragab/.ssh/id_ed25519"
my_ssh_pubkey="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIdc3Aaj8RP7ru1oSxUuehTRkpYfvxTxpvyJEZqlqyze [email protected]"
headnode_instance_type="a100-80gb.1x"
}
Step 5. Execute the terraform script
terraform init
terraform plan
terraform apply