-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved handling of SIGTERM #660
Conversation
When marian receives signal SIGTERM and exits gracefully (save model & exit), it should then exit with a non-zero exit code, to signal to any parent process that it did not exit "naturally".
…atus after SIGTERM.
…marian-dev into ug-graceful-shutdown
…ning. Prior to this bug fix, BatchGenerator::fetchBatches, which runs in a separate thread, would ignore SIGTERM during training (training uses a custom signal handler for SIGTERM, which simply sets a global flag, to enable graceful shutdown (i.e., save models and current state of training before shutting down). The changes in this commit also facilitate custom handling of other signals in the future by providing a general singal handler for all signals with a signal number below 32 (setSignalFlag) and a generic flag checking function (getSignalFlag(sig)) for checking such flags.
…code easier This does not introduce any new functionality, just moves code around, so that future PRs are easier to compare. Moving old GraphGroup code to training/deprecated. Once it is clear there is nothing in there that's worth saving, this will be deleted. Replace -Ofast with -O3 and make sure ffinite-math is turned off.
…in/max quantization to avoid overflow 1. Change the weight matrix quantization to use 7-bit min/max quantization -> This resolves all the overflow issue, because weight and activations are quantized by min/max range. 2. Clip fp16 quantization to avoid overflow 3. Fix windows build errors (cmake options, vcproj file) 4. int8 pack model (encoder -> fp16)
…encoders as well For int8 quantized model, use int8 quantization for encoders as well. The quality difference between fp16 encoder and int8 encoder is small, but they have quite amount of speed difference.
* Add basic support for TSV inputs * Fix mini-batch-fit for TSV inputs * Abort if shuffling data from stdin * Fix terminating training with data from STDIN * Allow creating vocabs from TSV files * Add comments; clean creation of vocabs from TSV files * Guess --tsv-size based on the model type * Add shortcut for STDIN inputs * Rename --tsv-size to --tsv-fields * Allow only one 'stdin' in --train-sets * Properly create separate vocabularies from a TSV file * Clearer logging message * Add error message for wrong number of valid sets if --tsv is used * Use --no-shuffle instead of --shuffle in the error message * Fix continuing training from STDIN * Update CHANGELOG * Support both 'stdin' and '-' * Guess --tsv-fields from dim-vocabs if special:model.yml available * Update error messages * Move variable outside the loop * Refactorize utils::splitTsv; add unit tests * Support '-' as stdin; refactorize; add comments * Abort if excessive field(s) in the TSV input * Add a TODO on passing one vocab with fully-tied embeddings * Remove the unit test with excessive tab-separated fields
* fix 0 * nan behavior in concatention * bump patch * change epsilon to margin
* Refactorize processPaths * Fix relative paths for shortlist and sqlite options * Rename InterpolateEnvVars to interpolateEnvVars * Update CHANGELOG
…anch Cherry pick a few improvements/fixes from Frank's branch * Adds Frank's fix for label-based mini-batch sizing from Frank's current experimental branch. * Also copies minor improvements and a few comments.
* Fix server build with current boost, move simple-websocket-server to submodule * Change submodule to marian-nmt/Simple-WebSocket-Server * Update submodule simple-websocket-server Co-authored-by: Gleb Tv <[email protected]>
* Use cblas_sgemm_batch when available * Merge with master, add comments and describe contribution
…ning. Prior to this bug fix, BatchGenerator::fetchBatches, which runs in a separate thread, would ignore SIGTERM during training (training uses a custom signal handler for SIGTERM, which simply sets a global flag, to enable graceful shutdown (i.e., save models and current state of training before shutting down). The changes in this commit also facilitate custom handling of other signals in the future by providing a general singal handler for all signals with a signal number below 32 (setSignalFlag) and a generic flag checking function (getSignalFlag(sig)) for checking such flags.
…in/max quantization to avoid overflow 1. Change the weight matrix quantization to use 7-bit min/max quantization -> This resolves all the overflow issue, because weight and activations are quantized by min/max range. 2. Clip fp16 quantization to avoid overflow 3. Fix windows build errors (cmake options, vcproj file) 4. int8 pack model (encoder -> fp16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work. Thanks.
src/data/batch_generator.h
Outdated
if (bufferedBatches_.empty() // i.e., end of Epoch | ||
|| getSignalFlag(SIGTERM)) { // process received SIGTERM, abandon ship ... | ||
return nullptr; | ||
if(runAsync_) { // by default we will run in asynchronous mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment to self and @frankseide: this is git being bad at doing a three-way diff. The runAsync_
part is from the master
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ugermann I see what you meant with the 3-way diff.
@frankseide this looks good to me now and ready for another look from you. |
Cosmetic fixes as per code review. - removed superfluous empty line - but default value for --sigterm on a new line
Move option --segterm to General Options for training, as suggested by @snukky.
Remove log message from signal handler. Note that the log still records that training was interrupted by a signal both in the log message in Scheduler::finished() and in the exit code at the end of marian_train.cpp.
Fix formatting of --sigterm option specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor feedback.
One is that the term "graceful" is not concise. E.g. users who send SIGTERM do not request a graceful exit. Marian gracefully exits (i.e. cleans up GPU resources etc.) when training is done. What the user requests is an early exit. Let us rethink the naming.
src/common/config_parser.cpp
Outdated
// (e.g., dump model but continue training upon SIGUSR1, or report current | ||
// training status upon SIGINFO.) | ||
cli.add<std::string>("--sigterm", | ||
"What to do with SIGTERM: 'graceful' => save and exit (default); 'immediate' => exit immediately.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"graceful" is not descriptive. Please say in the option string what it does, e.g. "save-and-exit"
"immediate" could be "terminate"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings changed, but SIGTERM is widely considered a request for a graceful shutdown.
See here: https://qph.fs.quoracdn.net/main-qimg-1180ef2465c309928b02481f02580c6a
and here: https://www.booleanworld.com/kill-process-linux/#:~:text=By%20default%2C%20all%20the%20process,gracefully%22.
and here: python-trio/snekomatic#6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all of these examples, graceful shutdown means to leave the world in a well-defined state (e.g. finish in-flight requests to a service). The examples do not suggest saving state, and note that the PR saves state in a way that is inconsistent with configured behavior, because the configuration says to save every save-freq
updates, and not at a time of receiving SIGTERM, which is a random update.
In my mind, you don't want a "graceful shutdown" as given in these examples, but more like an "early exit" from the training process.
Alternatively, you may be wanting "preemption," i.e. save state that allows to restart the training. Your PR accomplishes that as well (right?), and such feature would be useful in our internal server farm as well when running jobs in low-priority preemptable mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving state to resume later is what Microsoft incidentally calls 'graceful' preemption, as opposed to 'immediate' preemption: https://docs.microsoft.com/en-us/powershell/high-performance-computing/understanding-policy-configuration?view=hpc19-ps#:~:text=Immediate%20preemption%20(Default)%3A%20Take,allocated%20to%20another%20job%20immediately.&text=Graceful%20preemption%3A%20Take%20resources%20from,that%20work%20is%20not%20lost.
src/common/signal_handling.cpp
Outdated
bool getSignalFlag(const int sig) { | ||
// sig_atomic_t has 32 bits. We don't accommodate signals beyond that. | ||
ABORT_IF(sig >= 32, "Signal out of range (must be < 32, is {}).", sig); | ||
return sigflags_ & (1<<sig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&
is an integer operation, but you return bool
.
Can you please also explain what this function does? It seems it does two things: It converts the signal, but then also masks it with a class member. I cannot understand the meaning of the function without a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation added in code.
src/common/signal_handling.cpp
Outdated
|
||
bool getSignalFlag(const int sig) { | ||
// sig_atomic_t has 32 bits. We don't accommodate signals beyond that. | ||
ABORT_IF(sig >= 32, "Signal out of range (must be < 32, is {}).", sig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this check? I suggest to remove both the comment and the ABORT, since sig_atomic_t
is, by its name, clearly meant for signals, and therefore defines the valid signal ranges. Marian is not an operating system, we don't need to hedge against every possible invalid argument. You can also remove the check and comment in setSignalFlag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sig_atomic_t is meant for signals in the sense that certain operations on it are guaranteed to be atomic even during signal handling. Essentially it is an int type that I'm using here as a bit field, so that we have a generic signal handler that we can install for arbitrary signals (lower than 30). That check is a check against programming errors. Signals can be higher than 30, but this code does not accommodate installing setSignalFlag for such signals.
src/common/signal_handling.cpp
Outdated
|
||
void requestGracefulExit(int sig) { | ||
setSignalFlag(sig); // keep track of triggering signal | ||
gracefulExitRequested_ = 1; // set flag to exit gracefully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gracefulExitRequested() returns a bool, but here you set it to an integer. Please choose one (bool
if possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relies on implicit type conversion from int types to bool. sig_atomic_t is specifically meant to be used in signal handlers. This is run-of-the-mill signal handling, see "Compliant Solution (Writing volatile sig_atomic_t)" here: https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers.
src/common/signal_handling.h
Outdated
|
||
/// General purpose signal handler that simply sets a flag when a signal is received. | ||
// (only for SIGNAL No. < 32). | ||
void setSignalFlag(const int sig); // custom handler (set flag) for sig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to not use const
in these function signatures. It exposes an internal behavior of the function to the caller. And the functions are so simple that adding const
does not make them easier to udnerstand or verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Const removed.
@@ -16,6 +16,7 @@ template <class ModelWrapper> | |||
class Train : public ModelTask { | |||
private: | |||
Ptr<Options> options_; | |||
void installCustomSignalHandlers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it installing multiple signal handlers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially in the future. I prefer to keep API changes rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename to installEarlyExitSignalHandler()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly disagree on this one. installCustomSignalHandlers() is like parseOptions() a generic concept: install whatever custom signal handlers may be specified in the options. Either my pull request goes in as-is now, or I'll withdraw it. @emjotde
- Update comments - Explicit conversion from int to bool in functions that return bool.
@frankseide I think I have now addressed all your concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename "graceful exit" to "early exit," as "graceful exit" means something else. After that change, I agree to merge.
@@ -16,6 +16,7 @@ template <class ModelWrapper> | |||
class Train : public ModelTask { | |||
private: | |||
Ptr<Options> options_; | |||
void installCustomSignalHandlers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename to installEarlyExitSignalHandler()
@frankseide renamed 'graceful exit' to 'save and exit'. |
@frankseide can you take another look if you concerns are satisfied? If yes, please approve, I will merge then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last change, then this can go in.
I will rename this quickly and pull it in. |
@snukky I love the new continuous integration :) |
Aren't the checks a necessary condition before merging? Now it allows me to merge despite the checks not having finished yet after my update? Should that not be a blocker? |
@ugermann merged, thanks. |
Description
How to test
start training, then send SIGTERM to process, it should save the current state of affairs and exit
Checklist