enable non-distributed training and MPS support #769

peter-sk · 2024-12-21T09:26:36Z

Current OLMo is locked to DDP or FSDP distributed training on CUDA accelerators.

This PR does two things:

It adds a single accelerator mode (distributed_strategy: single) through a simple wrapper module.
It adds support for MPS, allowing development and debugging on ARM64 Mac machines.

The primary motivation is to make and test code changes without having to block GPU resources. In addition, tiny models can actually be trained on a sufficiently equipped system.

…rator

peter-sk added 3 commits December 21, 2024 10:18

enabling single accelerator training (e.g. for the MPS backend on Macs)

bb9bd69

removed code duplication

4b32b63

backward compatibility for checkpoints

311286c

peter-sk mentioned this pull request Dec 21, 2024

Single Accelerator training and MPS support (PR #769) #770

Closed

peter-sk added 2 commits December 21, 2024 11:15

reversed logic to ensure checkpointing is unsharded for single accele…

539f64a

…rator

should probably do this

d8f68ea

aman-17 assigned dirkgr and aman-17 Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable non-distributed training and MPS support #769

enable non-distributed training and MPS support #769

peter-sk commented Dec 21, 2024

enable non-distributed training and MPS support #769

Are you sure you want to change the base?

enable non-distributed training and MPS support #769

Conversation

peter-sk commented Dec 21, 2024