Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to save intermediate checkpoints as caikit models #229

Open
gkumbhat opened this issue Oct 10, 2023 · 0 comments
Open

Add ability to save intermediate checkpoints as caikit models #229

gkumbhat opened this issue Oct 10, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@gkumbhat
Copy link
Collaborator

Description

Currently the save of the model only happens when we have trained the full model and executed the save command. However, there are scenarios where the training is going on and we might want to cancel the training due to reasons like the model is looking "good enough" (based on loss or eval metrics). In those cases, currently we would loose the model entirely.

There are multiple problems with this kind of implementation that we need to figure out, both from caikit-nlp's perspective and caikit's perspective. Issues like:

  1. caikit-nlp: we don't get the output directory information at the training time. We could add that parameter, but then, that parameter doesn't get passed on to the save function in the usual design of the modules, so that means, we would have to pass that parameter both at train time and save time.
  2. caikit-nlp: The model.save function from transformers and other underlying libraries create a model that is not directly recognizable by caikit. So one solution to adding this capability might involve calling the save function from the model.save when the training is going on. Which can become super gross with multi-processing scenarios.
  3. caikit: In the context of server, the cancellation request comes from the user and is handled by caikit.runtime. So the solution to handle this might need to be at the runtime level.

Some potential options could be:

  1. caikit-nlp: Add signal handles inside train function to handle termination request and then setting a flag that can be used to call in destructor which could then call the save function.
  2. caikit: After getting the cancellation signal, caikit could automatically call the save function. But this would require proper error handling for unsavable models in caikit-nlp's .train function so might potentially would require some changes in .train function design, specially along the lines of saving checkpoints and how to know if there is a checkpoint to save.
@gkumbhat gkumbhat changed the title Add ability to save checkpoints as caikit models Add ability to save intermediate checkpoints as caikit models Oct 10, 2023
@gkumbhat gkumbhat added the enhancement New feature or request label Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ToDo
Development

No branches or pull requests

1 participant