Arceus

Developed alongside Ishaan Dey, Rajan Agarwal, and Josh Yan

Arceus

Hi everyone! Welcome to our distributed training framework, built from first principles leveraging model parallelism to optimize training on Apple M-series clusters. Arceus now supports model parallelism across multiple devices on the same local network!

Run

To run this version of Arceus, please first initialize a virtual environment and install all required modules

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

To run Arceus across multiple machines on your local network, first initialize the Flask server to handle device registration and training orchestration

python -m api.server

Training Options

1. Distributed Neural Network Training

First create register training job through the API, outlining the model and dataset configurations

curl -X POST http://localhost:4000/api/jobs \
     -H "Content-Type: application/json" \
     -d '{
       "model_config": {
         "layer_sizes": [784, 128, 64, 10]
       },
       "dataset_config": {
         "name": "MNIST",
         "batch_size": 256,
         "val_batch_size": 1000,
         "normalize": {
           "mean": [0.1307],
           "std": [0.3081]
         }
       }
     }'

This returns a response with a job ID.

Now, register your devices. There must be atleast 1 and at most the number equal to the layers of your model. On each of your devices, run

python -m nn.device_client --api-url <flask-server-ip> --job-id <job-id>

Then, initialize your devices with the layers of the model by specifying the job

curl -X POST http://<flask-server-ip>/api/network/initialize/<job-id>

Finally, train with the following request

curl -X POST http://<flask-server-ip>/api/network/train/<job-id> \
     -H "Content-Type: application/json" \
     -d '{"epochs": 10, "learning_rate": 0.1}'

2. Transformer Model Training

To train a transformer model (GPT), first register a training job with the transformer configuration:

curl -X POST http://localhost:4000/api/jobs \
     -H "Content-Type: application/json" \
     -d '{
       "model_config": {
         "type": "transformer"
       },
       "dataset_config": {
         "source": "gpt/data.txt"
       }
     }'

Then start training directly (no device registration or initialization needed):

curl -X POST http://<flask-server-ip>/api/network/train/<job-id> \
     -H "Content-Type: application/json" \
     -d '{}'

The transformer model will use the hyperparameters defined in gpt/model.py and train on the data from the specified source file.

Common Messages

You will see a message on your server when training starts:

127.0.0.1 - - [26/Nov/2024 10:40:56] "POST /api/network/train HTTP/1.1" 200 -

Starting distributed training across devices...
Learning rate: 0.1
Quantization bits: 8
Device: mps

If that didn't work and you got an SSL Certificate error, run this then try again

cd /Applications/Python\ 3.11/
./Install\ Certificates.command

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
frontend		frontend
gpt		gpt
nn		nn
protos		protos
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
completions.txt		completions.txt
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arceus

Run

Training Options

1. Distributed Neural Network Training

2. Transformer Model Training

Common Messages

About

Releases

Packages

Languages

simerusm/distributed-training-platform

Folders and files

Latest commit

History

Repository files navigation

Arceus

Run

Training Options

1. Distributed Neural Network Training

2. Transformer Model Training

Common Messages

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages