Add ability to save intermediate checkpoints as caikit models #229

gkumbhat · 2023-10-10T00:22:40Z

Description

Currently the save of the model only happens when we have trained the full model and executed the save command. However, there are scenarios where the training is going on and we might want to cancel the training due to reasons like the model is looking "good enough" (based on loss or eval metrics). In those cases, currently we would loose the model entirely.

There are multiple problems with this kind of implementation that we need to figure out, both from caikit-nlp's perspective and caikit's perspective. Issues like:

caikit-nlp: we don't get the output directory information at the training time. We could add that parameter, but then, that parameter doesn't get passed on to the save function in the usual design of the modules, so that means, we would have to pass that parameter both at train time and save time.
caikit-nlp: The model.save function from transformers and other underlying libraries create a model that is not directly recognizable by caikit. So one solution to adding this capability might involve calling the save function from the model.save when the training is going on. Which can become super gross with multi-processing scenarios.
caikit: In the context of server, the cancellation request comes from the user and is handled by caikit.runtime. So the solution to handle this might need to be at the runtime level.

Some potential options could be:

caikit-nlp: Add signal handles inside train function to handle termination request and then setting a flag that can be used to call in destructor which could then call the save function.
caikit: After getting the cancellation signal, caikit could automatically call the save function. But this would require proper error handling for unsavable models in caikit-nlp's .train function so might potentially would require some changes in .train function design, specially along the lines of saving checkpoints and how to know if there is a checkpoint to save.

The text was updated successfully, but these errors were encountered:

gkumbhat changed the title ~~Add ability to save checkpoints as caikit models~~ Add ability to save intermediate checkpoints as caikit models Oct 10, 2023

gkumbhat added this to caikit ecosystem Oct 10, 2023

github-project-automation bot moved this to ToDo in caikit ecosystem Oct 10, 2023

gkumbhat added the enhancement New feature or request label Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to save intermediate checkpoints as caikit models #229

Add ability to save intermediate checkpoints as caikit models #229

gkumbhat commented Oct 10, 2023

Add ability to save intermediate checkpoints as caikit models #229

Add ability to save intermediate checkpoints as caikit models #229

Comments

gkumbhat commented Oct 10, 2023

Description