Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small fixes when resuming training #245

Merged
merged 6 commits into from
Nov 21, 2024

Conversation

NouamaneTazi
Copy link
Member

@NouamaneTazi NouamaneTazi commented Nov 19, 2024

  • adapted config to be able to resume training with nanotron
  • made sure optimizer states reshard correctly when going from TP=1 to TP>1
  • fixed optimizer states dtype when loading from checkpoint (load directly in fp32 instead of loading in bf16 then recasting to fp32)
  • fix small bug where LR is set to initial_lr only in 1st step instead of resuming from last lr (never trust pytorch too much)
  • small fix in case of resuming training with DP>1 and TP>1
  • fix loading non sharded optim states

@NouamaneTazi NouamaneTazi merged commit 42040ae into main Nov 21, 2024
3 of 4 checks passed
@NouamaneTazi NouamaneTazi deleted the nouamane/fix-optim-states-resuming branch November 21, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant