[QUESTION] Asynchronous Checkpoint Saving #1095
Replies: 11 comments
-
It has the format as .distcp, global rank is straightforward. It's global rank in the default process group created by Pytorch. We create multiple writer processes to make checkpoint writing with asynchrony. writer process ID simply indicates where the corresponding checkpoint is from.
The previous checkpoint format in Megatron-LM was converted due to the introduction of
Megatron-Core doesn't provide a converter. Any production framework based on Megatron-Core may have that such as NeMo. |
Beta Was this translation helpful? Give feedback.
-
I noticed the file is named as
It is common we already got a ckpt with legacy format (such as
Could you please provide a demo link for converting |
Beta Was this translation helpful? Give feedback.
-
Recently, we've introduced fully-parallel saving(FPS,
The
You can easily find examples in NeMo |
Beta Was this translation helpful? Give feedback.
-
I tested and the results are as followings. Note that:
Please correct me if I misunderstand anything.
From the NeMo docs, there is a demo script converting ckpt trained by Megatron-LM into the NeMo compatible formats. And Using another script to convert NeMo format to HuggingFace format. This link is the support matrix and the way that directly converting from MLM to HF format is not found. So is there any easy way to convert dist_ckpt from MLM into HF format? Thanks for you kind help @sbak5 |
Beta Was this translation helpful? Give feedback.
-
@zhaoyang-star Could you please provide a reference for your conversion to a Nemo script? I’m encountering this issue: File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 380, in load_from_checkpoint thanks |
Beta Was this translation helpful? Give feedback.
-
@syx11237744 This link is about converting from MLM to NeMo. But I haven't tested it yet. |
Beta Was this translation helpful? Give feedback.
-
Thank you! I’m using this script, but I’m encountering the above error. Do you know of any other methods to convert to the HuggingFace format besides the one mentioned above? |
Beta Was this translation helpful? Give feedback.
-
@sbak5 @lmcafee-nvidia It is great to see that there are convert tools: tools/checkpoint/convert.py in Megatron-LM repo. Is there any docs for converting from |
Beta Was this translation helpful? Give feedback.
-
There is no converters from MLM directly to HF, you have to go through NeMo. Converting from torch_dist to torch is a step backwards and is not recommended. However if you need it for some reason, it believe recent tools/checkpoint/convert.py should already support such conversion |
Beta Was this translation helpful? Give feedback.
-
Any updates about the conversion ? the Nemo converter does not support distcp format it uses the legacy format apparently Code |
Beta Was this translation helpful? Give feedback.
-
Usefull : DCP to Torch |
Beta Was this translation helpful? Give feedback.
-
I saw Megatron-LM has supported asynchronous checkpoint saving since v0.7.0.
@sbak5 I did some test on this feature and saw it benefits a lot. I tried to dive into it and found the ckpt's format has changed a lot compared to the synchronous saving.
Just 3 questions:
__0_0.distcp
and__0_1.distcp
? There is no readme or blog about this feature. Could you please explain it?distrib_optim.pt
andmodel_optim_rng.pt
.Thanks for your help ^_^
Beta Was this translation helpful? Give feedback.
All reactions