GitHub - wvaske/checkpoint_bench: This repo provides a tool for running an AI Checkpoint Benchmark

Introduction

This benchmark uses the DLIO tool from Argonne National Labs to simulate writing a checkpoint from training a large language model.

The model definition is "moderately close" to a real checkpoint. It uses torch.save() to write tensors of the proper size to the selected storage.

Each checkpoint clocks in around 440GB and is made up of 352 files. The files represent model state, 126 layers, and

Installation

Install DLIO from here: https://github.com/wvaske/dlio_benchmark Be sure to do "pip install -r requirements" and "pip install dlio_benchmark"

Install an appropriate MPI library with mpirun

Execution

Start the RPC server.

chmod +x checkpoint_server.py

# Environment Variables may not be required and are dependent on your
# operating environment
OMPI_ALLOW_RUN_AS_ROOT=1 OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
mpirun -n 8 checkpoint_server.py --framework dlio --model llama3-405b \ 
--checkpoint-location <checkpoint_location>

When the server shows the model_state and layer_state it is ready for executionf

The following command will execute a single checkpoint

chmod +x checkpoint_client.py

# Default IP, port and steps are localhost, 8080, and 1
./checkpoint_client.py --server-ip 127.0.0.1 --port 8080 --num-steps 55

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
checkpoint_client.py		checkpoint_client.py
checkpoint_server.py		checkpoint_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Execution

About

Releases

Packages

Languages

License

wvaske/checkpoint_bench

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Execution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages