Skip to content

Template machine learning project using wandb, hydra-zen and submitit on Slurm with Apptainer

License

Notifications You must be signed in to change notification settings

marvinsxtr/ml-project-template

Repository files navigation

ML Project Template

This is a template project for ML experimentation using wandb, hydra-zen, submitit on a Slurm cluster using Docker and Apptainer for containerization.

NOTE: This template is optimized for the specific setup of the ML Group cluster but may be easily adapted to similar settings.

Highlights

  • Python environment in Docker via uv
  • Logging and visualizations via Weights and Biases
  • Reproducibility and modular type-checked configs via hydra-zen
  • Submit Slurm jobs and sweeps directly from Python via submitit
  • No .def or .sh files needed for Apptainer/Slurm

Setup

To be able to run Slurm from within Apptainer, you first have to add the following lines to your .zshrc/.bashrc file:

export APPTAINER_BIND=/opt/slurm-23.2,/opt/slurm,/etc/slurm,/etc/munge,/var/log/munge,/var/run/munge,/lib/x86_64-linux-gnu
export APPTAINERENV_APPEND_PATH=/opt/slurm/bin:/opt/slurm/sbin

You can then use the given Dockerfile to start a shell via

apptainer shell docker://ghcr.io/marvinsxtr/ml-project-template:main

Note: This may take a few minutes on the first run.

WandB Logging

Logging to WandB is optional for running local jobs but mandatory for jobs submitted to the cluster.

WandB is enabled by specifying an API key, the project and entity in a .env file in the root of the repository. You can take the following snippet as a template:

WANDB_API_KEY=
WANDB_ENTITY=
WANDB_PROJECT=

Local

You can run a script locally via

python src/runs/main.py

Hydra should automatically generate a config.yaml in the outputs/<date>/<time>/.hydra folder which you can use to reproduce the same run later. Using the command line arguments, you can override or switch out parts of this config as you will see in the following sections.

To log to WandB, add cfg/wandb=base:

python src/runs/main.py cfg/wandb=base

In order to use WandB in offline mode, add cfg.wandb.mode=offline:

python src/runs/main.py cfg/wandb=base cfg.wandb.mode=offline

Single Job

To run the command as a job in the cluster, run

python src/runs/main.py cfg/job=base

This will automatically add WandB logging for you. See src/configs/runs/base.py to configure the job to your needs.

Distributed Sweep

Run a sweep over two seeds using multiple nodes:

python src/runs/main.py cfg/job=sweep

This will automatically add WandB logging for you. See src/configs/runs/base.py to configure the sweep to your needs.

Docker Image

The Docker image can be built for linux/amd64 by running

docker buildx build -t ml-project-template .

When using VSCode, the Docker image is automatically built when using a Dev Container.

Acknowledgements

This template is based on a previous example project.

About

Template machine learning project using wandb, hydra-zen and submitit on Slurm with Apptainer

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages