Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug checkpoint IO bug once https://github.com/NVIDIA/NeMo/pull/1024 lands #134

Open
jstjohn opened this issue Sep 4, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@jstjohn
Copy link
Collaborator

jstjohn commented Sep 4, 2024

Describe the bug
PR 10241 in NeMomakes a few improvements to checkpointing. In particular, it:

  1. separates artifacts and checkpoint weights into separate directories
  2. adds an option to only save model weights (skipping optimizer states) at the end of training (defaults to true)
  3. removes save_best_model arg because it wasn't working anyway and having it present was misleading.

If there's demand for that arg on your end, we can add it to our list to support it properly. As of that PR save best checkpoint is getting dropped.

Make a change like the following in our checkpoint settings: NVIDIA/NeMo@10e4f88

This is tracked on the nemo side (internal to NVIDIA) with: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/issues/648

@jstjohn jstjohn added the bug Something isn't working label Sep 4, 2024
@jstjohn
Copy link
Collaborator Author

jstjohn commented Sep 4, 2024

@skothenhill-nv FYI they're dropping save best since it's broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant