[QUESTION] Asynchronous Checkpoint Saving #1095

zhaoyang-star · 2024-08-05T09:42:37Z

zhaoyang-star
Aug 5, 2024

I saw Megatron-LM has supported asynchronous checkpoint saving since v0.7.0.
@sbak5 I did some test on this feature and saw it benefits a lot. I tried to dive into it and found the ckpt's format has changed a lot compared to the synchronous saving.

Just 3 questions:

What is the meaning of __0_0.distcp and __0_1.distcp? There is no readme or blog about this feature. Could you please explain it?
How to convert this format to the synchronous saving format? Such as distrib_optim.pt and model_optim_rng.pt.
How to convert this format to HuggingFace .bin format in order to do inference?

Thanks for your help ^_^

root@inp11049767626817836924-2-1:/home/Megatron-LM/fp8_async_save_te1.7_outputs/checkpoint/8B-lr1e-4-tp1-pp4# tree -lh
.
├── [ 325]  iter_0000010
│   ├── [2.4G]  __0_0.distcp
│   ├── [ 31G]  __0_1.distcp
│   ├── [1.8G]  __2_0.distcp
│   ├── [ 23G]  __2_1.distcp
│   ├── [1.8G]  __4_0.distcp
│   ├── [ 23G]  __4_1.distcp
│   ├── [2.4G]  __6_0.distcp
│   ├── [ 31G]  __6_1.distcp
│   ├── [ 15K]  common.pt
│   └── [ 119]  metadata.json
├── [ 325]  iter_0000020
│   ├── [2.4G]  __0_0.distcp
│   ├── [ 31G]  __0_1.distcp
│   ├── [1.8G]  __2_0.distcp
│   ├── [ 23G]  __2_1.distcp
│   ├── [1.8G]  __4_0.distcp
│   ├── [ 23G]  __4_1.distcp
│   ├── [2.4G]  __6_0.distcp
│   ├── [ 31G]  __6_1.distcp
│   ├── [ 15K]  common.pt
│   └── [ 119]  metadata.json
├── [ 325]  iter_0000030
│   ├── [2.4G]  __0_0.distcp
│   ├── [ 31G]  __0_1.distcp
│   ├── [1.8G]  __2_0.distcp
│   ├── [ 23G]  __2_1.distcp
│   ├── [1.8G]  __4_0.distcp
│   ├── [ 23G]  __4_1.distcp
│   ├── [2.4G]  __6_0.distcp
│   ├── [ 31G]  __6_1.distcp
│   ├── [ 15K]  common.pt
│   └── [ 119]  metadata.json
└── [   2]  latest_checkpointed_iteration.txt

sbak5 · 2024-08-05T18:04:49Z

sbak5
Aug 5, 2024

What is the meaning of __0_0.distcp and __0_1.distcp? There is no readme or blog about this feature. Could you please explain it?

It has the format as .distcp, global rank is straightforward. It's global rank in the default process group created by Pytorch. We create multiple writer processes to make checkpoint writing with asynchrony. writer process ID simply indicates where the corresponding checkpoint is from.

How to convert this format to the synchronous saving format? Such as distrib_optim.pt and model_optim_rng.pt.

The previous checkpoint format in Megatron-LM was converted due to the introduction of dist_checkpointing by @mikolajblaz. So, synchronous checkpointing (--use-dist-ckpt only) also generates the same checkpoint.
If your question is how to convert the checkpoint in the new dist_checkpointing format to the previous legacy format, we don't have a converter for this.

How to convert this format to HuggingFace .bin format in order to do inference?

Megatron-Core doesn't provide a converter. Any production framework based on Megatron-Core may have that such as NeMo.

0 replies

zhaoyang-star · 2024-08-06T02:29:58Z

zhaoyang-star
Aug 6, 2024
Author

It has the format as .distcp, global rank is straightforward. It's global rank in the default process group created by Pytorch. We create multiple writer processes to make checkpoint writing with asynchrony. writer process ID simply indicates where the corresponding checkpoint is from.

I noticed the file is named as __x_y.distcp. What do the x/y mean if TP=1/PP=4/DP=2? x will be 0/2/4/6 and y will be 0/1.

The previous checkpoint format in Megatron-LM was converted due to the introduction of dist_checkpointing by @mikolajblaz. So, synchronous checkpointing (--use-dist-ckpt only) also generates the same checkpoint.
If your question is how to convert the checkpoint in the new dist_checkpointing format to the previous legacy format, we don't have a converter for this.

It is common we already got a ckpt with legacy format (such as distrib_optim.pt and model_optim_rng.pt) and want to do continuous training using --use-mcore-models --use-dist-ckpt --async-save. Is it possible to support this case?

Megatron-Core doesn't provide a converter. Any production framework based on Megatron-Core may have that such as NeMo.

Could you please provide a demo link for converting *.distcp to HF *.bin? @sbak5

0 replies

sbak5 · 2024-08-06T05:16:26Z

sbak5
Aug 6, 2024

I noticed the file is named as __x_y.distcp. What do the x/y mean if TP=1/PP=4/DP=2? x will be 0/2/4/6 and y will be 0/1.

Recently, we've introduced fully-parallel saving(FPS, --ckpt-fully-parallel-save) as well as async-parallel saving. without FPS, as you said, rank [0,2,4,6] will save checkpoints as before with 2 writer processes, which lead to [0, 1] for y.
If you turn on FPS, it leverages duplicate states of models in data-parallelism to parallelize checkpoint saving in inter-node level as described in the article below. In this case, x will be [0-8] and y is [0-1].
Megatron-Core Tech blog article

It is common we already got a ckpt with legacy format (such as distrib_optim.pt and model_optim_rng.pt) and want to do continuous training using --use-mcore-models --use-dist-ckpt --async-save. Is it possible to support this case?

The load_checkpoint basically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option

Could you please provide a demo link for converting *.distcp to HF *.bin? @sbak5

You can easily find examples in NeMo

0 replies

zhaoyang-star · 2024-08-06T08:09:53Z

zhaoyang-star
Aug 6, 2024
Author

The load_checkpoint basically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option

I tested and the results are as followings. Note that:

dist_ckpt format cannot be saved back as legacy format.
The blog said Most importantly, Megatron-Core enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training. But I found it failed due to mismatch TP/PP for DistributedOptimizer. From the codebase we can see TODO: add DistributedOptimizer support for differing TPxPP is a TODO list. SO could I think this feature has not been implemented yet?

Please correct me if I misunderstand anything.

192.169.125.13: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): RNG state will be ignored
192.169.125.13: [rank3]: Traceback (most recent call last):
192.169.125.13: [rank3]:   File "/home/Megatron-LM/pretrain_gpt.py", line 271, in <module>
192.169.125.13: [rank3]:     pretrain(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 227, in pretrain
192.169.125.13: [rank3]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 530, in setup_model_and_optimizer
192.169.125.13: [rank3]:     args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/checkpointing.py", line 715, in load_checkpoint
192.169.125.13: [rank3]:     raise RuntimeError("{}: not supported for DistributedOptimizer".format(mismatch_msg))
192.169.125.13: [rank3]: RuntimeError: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): not supported for DistributedOptimizer

| Coversion | Supported or Not |
|-----------|------------------|
| lagency format -> load -> training -> save -> dist_ckpt format | √ |
| dist_ckpt format -> load -> training -> save -> lagency format | x |
| dist_ckpt format (with TP=1/PP=4/DP=2) -> load -> training -> save -> dist_ckpt format (with TP=2/PP=2/DP=2) | x |

You can easily find examples in NeMo

From the NeMo docs, there is a demo script converting ckpt trained by Megatron-LM into the NeMo compatible formats. And Using another script to convert NeMo format to HuggingFace format. This link is the support matrix and the way that directly converting from MLM to HF format is not found. So is there any easy way to convert dist_ckpt from MLM into HF format? Thanks for you kind help @sbak5

0 replies

syx11237744 · 2024-08-06T13:56:15Z

syx11237744
Aug 6, 2024

The load_checkpoint basically tries to see if the checkpoint in the passed path is legacy or dist-ckpt format. So, you can load the legacy checkpoint and save in the newer format with the option

I tested and the results are as followings. Note that:

dist_ckpt format cannot be saved back as legacy format.

The blog said Most importantly, Megatron-Core enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training. But I found it failed due to mismatch TP/PP for DistributedOptimizer. From the codebase we can see TODO: add DistributedOptimizer support for differing TPxPP is a TODO list. SO could I think this feature has not been implemented yet?

Please correct me if I misunderstand anything.
192.169.125.13: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): RNG state will be ignored
192.169.125.13: [rank3]: Traceback (most recent call last):
192.169.125.13: [rank3]:   File "/home/Megatron-LM/pretrain_gpt.py", line 271, in <module>
192.169.125.13: [rank3]:     pretrain(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 227, in pretrain
192.169.125.13: [rank3]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/training.py", line 530, in setup_model_and_optimizer
192.169.125.13: [rank3]:     args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
192.169.125.13: [rank3]:   File "/home/Megatron-LM/megatron/training/checkpointing.py", line 715, in load_checkpoint
192.169.125.13: [rank3]:     raise RuntimeError("{}: not supported for DistributedOptimizer".format(mismatch_msg))
192.169.125.13: [rank3]: RuntimeError: (TP, PP) mismatch after resume ((1, 4) vs (2, 2) from checkpoint): not supported for DistributedOptimizer
| Coversion | Supported or Not |
|-----------|------------------|
| lagency format -> load -> training -> save -> dist_ckpt format | √ |
| dist_ckpt format -> load -> training -> save -> lagency format | x |
| dist_ckpt format (with TP=1/PP=4/DP=2) -> load -> training -> save -> dist_ckpt format (with TP=2/PP=2/DP=2) | x |
You can easily find examples in NeMo

From the NeMo docs, there is a demo script converting ckpt trained by Megatron-LM into the NeMo compatible formats. And Using another script to convert NeMo format to HuggingFace format. This link is the support matrix and the way that directly converting from MLM to HF format is not found. So is there any easy way to convert dis_ckpt from MLM into HF format? Thanks for you kind help @sbak5

@zhaoyang-star Could you please provide a reference for your conversion to a Nemo script? I’m encountering this issue：

File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 380, in load_from_checkpoint
model = ptl_load_state(cls, checkpoint, strict=strict, cfg=cfg, kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py", line 144, in _load_state
obj = cls(_cls_kwargs)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 243, in init
super().init(cfg, trainer=trainer, no_lm_init=True)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 118, in init
with open_dict(cfg):
File "/usr/lib/python3.10/contextlib.py", line 135, in enter
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/omegaconf/omegaconf.py", line 983, in open_dict
prev_state = config._get_node_flag("struct")
AttributeError: 'dict' object has no attribute '_get_node_flag'

thanks

0 replies

zhaoyang-star · 2024-08-07T01:14:38Z

zhaoyang-star
Aug 7, 2024
Author

@syx11237744 This link is about converting from MLM to NeMo. But I haven't tested it yet.

0 replies

syx11237744 · 2024-08-07T02:17:14Z

syx11237744
Aug 7, 2024

@syx11237744 This link is about converting from MLM to NeMo. But I haven't tested it yet.

Thank you! I’m using this script, but I’m encountering the above error. Do you know of any other methods to convert to the HuggingFace format besides the one mentioned above?

0 replies

zhaoyang-star · 2024-08-12T09:14:54Z

zhaoyang-star
Aug 12, 2024
Author

@sbak5 @lmcafee-nvidia It is great to see that there are convert tools: tools/checkpoint/convert.py in Megatron-LM repo. Is there any docs for converting from torch_dist format into huggingface format? Thanks.
dist_checkpointing is introduced by @mikolajblaz , how could I convert from torch_dist format into torch format?

0 replies

mikolajblaz · 2024-08-21T09:56:19Z

mikolajblaz
Aug 21, 2024

@sbak5 @lmcafee-nvidia It is great to see that there are convert tools: tools/checkpoint/convert.py in Megatron-LM repo. Is there any docs for converting from torch_dist format into huggingface format? Thanks. dist_checkpointing is introduced by @mikolajblaz , how could I convert from torch_dist format into torch format?

There is no converters from MLM directly to HF, you have to go through NeMo.

Converting from torch_dist to torch is a step backwards and is not recommended. However if you need it for some reason, it believe recent tools/checkpoint/convert.py should already support such conversion

0 replies

SkanderBS2024 · 2024-08-27T15:03:25Z

SkanderBS2024
Aug 27, 2024

Any updates about the conversion ? the Nemo converter does not support distcp format it uses the legacy format apparently Code

0 replies

SkanderBS2024 · 2024-08-27T16:25:44Z

SkanderBS2024
Aug 27, 2024

Usefull : DCP to Torch

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Asynchronous Checkpoint Saving #1095

{{title}}

Replies: 11 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[QUESTION] Asynchronous Checkpoint Saving #1095

Replies: 11 comments

zhaoyang-star Aug 6, 2024 Author

zhaoyang-star Aug 6, 2024 Author

zhaoyang-star Aug 7, 2024 Author

zhaoyang-star Aug 12, 2024 Author

zhaoyang-star
Aug 6, 2024
Author

zhaoyang-star
Aug 6, 2024
Author

zhaoyang-star
Aug 7, 2024
Author

zhaoyang-star
Aug 12, 2024
Author