You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the save of the model only happens when we have trained the full model and executed the save command. However, there are scenarios where the training is going on and we might want to cancel the training due to reasons like the model is looking "good enough" (based on loss or eval metrics). In those cases, currently we would loose the model entirely.
There are multiple problems with this kind of implementation that we need to figure out, both from caikit-nlp's perspective and caikit's perspective. Issues like:
caikit-nlp: we don't get the output directory information at the training time. We could add that parameter, but then, that parameter doesn't get passed on to the save function in the usual design of the modules, so that means, we would have to pass that parameter both at train time and save time.
caikit-nlp: The model.save function from transformers and other underlying libraries create a model that is not directly recognizable by caikit. So one solution to adding this capability might involve calling the save function from the model.save when the training is going on. Which can become super gross with multi-processing scenarios.
caikit: In the context of server, the cancellation request comes from the user and is handled by caikit.runtime. So the solution to handle this might need to be at the runtime level.
Some potential options could be:
caikit-nlp: Add signal handles inside train function to handle termination request and then setting a flag that can be used to call in destructor which could then call the save function.
caikit: After getting the cancellation signal, caikit could automatically call the save function. But this would require proper error handling for unsavable models in caikit-nlp's .train function so might potentially would require some changes in .train function design, specially along the lines of saving checkpoints and how to know if there is a checkpoint to save.
The text was updated successfully, but these errors were encountered:
gkumbhat
changed the title
Add ability to save checkpoints as caikit models
Add ability to save intermediate checkpoints as caikit models
Oct 10, 2023
Description
Currently the save of the model only happens when we have trained the full model and executed the
save
command. However, there are scenarios where the training is going on and we might want to cancel the training due to reasons like the model is looking "good enough" (based on loss or eval metrics). In those cases, currently we would loose the model entirely.There are multiple problems with this kind of implementation that we need to figure out, both from caikit-nlp's perspective and
caikit
's perspective. Issues like:caikit-nlp
: we don't get the output directory information at the training time. We could add that parameter, but then, that parameter doesn't get passed on to thesave
function in the usual design of the modules, so that means, we would have to pass that parameter both at train time and save time.caikit-nlp
: Themodel.save
function from transformers and other underlying libraries create a model that is not directly recognizable bycaikit
. So one solution to adding this capability might involve calling thesave
function from themodel.save
when the training is going on. Which can become super gross with multi-processing scenarios.caikit
: In the context of server, thecancellation
request comes from the user and is handled bycaikit.runtime
. So the solution to handle this might need to be at the runtime level.Some potential options could be:
caikit-nlp
: Add signal handles inside train function to handle termination request and then setting a flag that can be used to call in destructor which could then call thesave
function.caikit
: After getting the cancellation signal, caikit could automatically call thesave
function. But this would require proper error handling for unsavable models incaikit-nlp
's.train
function so might potentially would require some changes in.train
function design, specially along the lines of saving checkpoints and how to know if there is a checkpoint to save.The text was updated successfully, but these errors were encountered: