Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeRO-3 + MP8 Universal Checkpoint #6724

Open
jeromeku opened this issue Nov 7, 2024 · 0 comments
Open

ZeRO-3 + MP8 Universal Checkpoint #6724

jeromeku opened this issue Nov 7, 2024 · 0 comments

Comments

@jeromeku
Copy link

jeromeku commented Nov 7, 2024

Is it possible to convert a model trained using ZeRO-3 and MP=8 to a universal checkpoint?

Tracing through the universal checkpointing conversion tool (ds_to_universal), the model states remained unmerged, with 8 model parallel shards per each data parallel rank. E.g., with world_size = 2048, there are 2048 model state files,zero_pp_rank_{0-255}_{0-7} before and after the conversion.

When converting a model with ZeRO <= 2, MP > 1, the model state files are merged into a single file through merge_tp_slices.

If this is not possible, how would one extract and merge only the Z3 / MP checkpointed model states (along both z3 and model parallel partitions) to a single file?

The zero_to_fp32 script does not work since it only handles ZeRO-{2,3} without model parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant