Add more options to the `unshard_checkpoint` function to help scale #145

epwalsh · 2025-01-23T23:07:10Z

I was getting worried about unsharding really big checkpoints like for the 32B, which we'll need to do soon. The main issue at the moment is that in order to unshard we need to load the entire model (or optimizer state) in memory, which clearly isn't scalable. So I've added an option to unshard the checkpoint into chunks of a given size which helps scale because only a single chunk (which could be as small as a single tensor) needs to load into memory at a time. Each chunk gets written to a unique file. I think HuggingFace does something similar.

Note: this is not supported for optimizer state yet. But, speaking of optimizer state, this PR also adds a function called load_keys() for loading (and unsharding) specific keys from a checkpoint. So if you want to inspect part of the optimizer state, you could use that function without having to unshard the whole optimizer state.

2015aroras · 2025-01-24T17:18:02Z

src/olmo_core/distributed/checkpoint/__init__.py

+            chunks = []
+            for chunk_num, key in enumerate(metadata.state_dict_metadata.keys()):
+                if key.startswith(f"{prefix}."):
+                    chunks.append((path / f"chunk-{chunk_num:05d}.{ext}", [key]))


thought: it could be nice to name it as chunk-{key}, maybe with some cleanup of the key. Then one could easily examine the weights of specific parameters. Would possibly be slightly cleaner for hypothetical future converter logic too.

done: bc79ab5

epwalsh added 2 commits January 23, 2025 13:59

Make checkpoint unsharding function more scalable

669bd10

add another strategy

3f097bc

epwalsh requested review from dirkgr and 2015aroras January 23, 2025 23:07

epwalsh added 6 commits January 23, 2025 15:09

make mypy happy

94d756b

Add function to load specific keys

2050369

consolidate some code

7778cbf

changelog

bcce34d

improve docs and logging

740aa65

better error handling

2adddfd

2015aroras approved these changes Jan 24, 2025

View reviewed changes

epwalsh added 5 commits January 24, 2025 09:23

fix

6207bf1

fix

842e927

clean up

537f41d

fix

f34fd94

name by key

bc79ab5

epwalsh merged commit b4a195b into main Jan 24, 2025
14 checks passed

epwalsh deleted the epwalsh/big-unshard branch January 24, 2025 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more options to the `unshard_checkpoint` function to help scale #145

Add more options to the `unshard_checkpoint` function to help scale #145

epwalsh commented Jan 23, 2025 •

edited

Loading

2015aroras Jan 24, 2025

epwalsh Jan 24, 2025

Add more options to the unshard_checkpoint function to help scale #145

Add more options to the unshard_checkpoint function to help scale #145

Conversation

epwalsh commented Jan 23, 2025 • edited Loading

2015aroras Jan 24, 2025

Choose a reason for hiding this comment

epwalsh Jan 24, 2025

Choose a reason for hiding this comment

Add more options to the `unshard_checkpoint` function to help scale #145

Add more options to the `unshard_checkpoint` function to help scale #145

epwalsh commented Jan 23, 2025 •

edited

Loading