Small fixes when resuming training #245

NouamaneTazi · 2024-11-19T14:49:40Z

adapted config to be able to resume training with nanotron
made sure optimizer states reshard correctly when going from TP=1 to TP>1
fixed optimizer states dtype when loading from checkpoint (load directly in fp32 instead of loading in bf16 then recasting to fp32)
fix small bug where LR is set to initial_lr only in 1st step instead of resuming from last lr (never trust pytorch too much)
small fix in case of resuming training with DP>1 and TP>1
fix loading non sharded optim states

NouamaneTazi added 2 commits November 19, 2024 14:19

fix optim states dtype + fix lr scheduler initial value when resuming

4cc62ae

fix ".module" issue in load_optimizer by passing unwrapped_model

8e3c5e6

NouamaneTazi requested a review from xrsrke November 19, 2024 14:57

NouamaneTazi added 4 commits November 20, 2024 08:59

.

e967e78

fix non sharded optim states when loading checkpoint

013a153

added test to highlight previous bug

9aedd84

add small warning for PP case

ab8c145

NouamaneTazi merged commit 42040ae into main Nov 21, 2024
3 of 4 checks passed

NouamaneTazi deleted the nouamane/fix-optim-states-resuming branch November 21, 2024 12:25

Lauler mentioned this pull request Nov 22, 2024

Fix initial_lr when resuming training #243

Closed

Provide feedback