gaudi support training #330

JamesKunstle · 2024-11-12T07:49:00Z

reorg main function; add 'hpu' option w/o implementation
implement training for HPU

JamesKunstle · 2024-11-12T07:50:46Z

Builds on changes from #329. Still needs to be tested.

The `main` training function needed to be broken down into smaller functions for readability/testability. WORLD_SIZE, LOCAL_RANK, and RANK have also been extracted and made global constants since they are set by the administrating multiprocessing launcher (torchrun, in our case). HPU configuration options and checks are also added. Signed-off-by: James Kunstle <[email protected]>

HPU cards (Gaudi 2 and 3) can't use Accelerate code path. This contribution adds the training setup and loop for FSDP-only training. Minor modifications required for HPUs specifically. Signed-off-by: James Kunstle <[email protected]>

mergify bot added the ci-failure label Nov 12, 2024

This was linked to issues Nov 12, 2024

Intel Gaudi Multi-GPU, single-node training #207

Closed

Implement multi-hpu training with FSDP for Gaudi 3 cards #294

Open

JamesKunstle force-pushed the gaudi-support-training branch from c79c2e1 to f9192d1 Compare November 12, 2024 22:57

mergify bot added ci-failure and removed ci-failure labels Nov 12, 2024

JamesKunstle added 2 commits November 12, 2024 15:06

implement training for HPU

aa192e8

HPU cards (Gaudi 2 and 3) can't use Accelerate code path. This contribution adds the training setup and loop for FSDP-only training. Minor modifications required for HPUs specifically. Signed-off-by: James Kunstle <[email protected]>

JamesKunstle force-pushed the gaudi-support-training branch from f9192d1 to aa192e8 Compare November 12, 2024 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gaudi support training #330

gaudi support training #330

JamesKunstle commented Nov 12, 2024

JamesKunstle commented Nov 12, 2024

gaudi support training #330

Are you sure you want to change the base?

gaudi support training #330

Conversation

JamesKunstle commented Nov 12, 2024

JamesKunstle commented Nov 12, 2024