-
Notifications
You must be signed in to change notification settings - Fork 126
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Return exit code 15 (SIGTERM) after SIGTERM. When marian receives signal SIGTERM and exits gracefully (save model & exit), it should then exit with a non-zero exit code, to signal to any parent process that it did not exit "naturally". * Added explanatory comment about exiting marian_train with non-zero status after SIGTERM. * Bug fix: better handling of SIGTERM for graceful shutdown during training. Prior to this bug fix, BatchGenerator::fetchBatches, which runs in a separate thread, would ignore SIGTERM during training (training uses a custom signal handler for SIGTERM, which simply sets a global flag, to enable graceful shutdown (i.e., save models and current state of training before shutting down). The changes in this commit also facilitate custom handling of other signals in the future by providing a general singal handler for all signals with a signal number below 32 (setSignalFlag) and a generic flag checking function (getSignalFlag(sig)) for checking such flags.
- Loading branch information
Showing
10 changed files
with
144 additions
and
63 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
#include "common/logging.h" | ||
#include "signal_handling.h" | ||
|
||
// The simplest (and recommended) way to handle signals is to simply set a flag | ||
// in the signal handler and check that flag later. | ||
// | ||
// We provide setSignalFlag as the most generic signal handler. This handler uses a | ||
// single sig_atomic_t as a bit field. On Linux, sig_atomic_t is equivalent to a signed int, | ||
// theoretically providing 32 binary flags; in practice, most likely signals for which we may | ||
// want to install signal handlers are | ||
// - SIGTERM (15): which by default signals the request for a graceful shutdown | ||
// - SIGUSR1 (10): intended for custom use, default action in Linux is termination | ||
// - SIGUSR2 (12): intended for custom use, default action in Linux is termination | ||
// - SIGINT (2): interrupt from the console | ||
// Just to be safe, we accommodate signals up to signal No. 30. | ||
|
||
// In addition, we also provide requestSaveAndExit() and saveAndExit() as a signal | ||
// handler/checker for graceful shutdown requests during training. | ||
constexpr int maxSignalForSetSignalFlag{30}; | ||
|
||
// Make sure sig_atomic_t is large enough as a bit field for our purposes. | ||
// That said, I'm not aware of any platform where this would be a problem. | ||
static_assert(SIG_ATOMIC_MAX > (1U<<maxSignalForSetSignalFlag), | ||
"sig_atomic_type is too small for signal flags on this platform."); | ||
|
||
namespace marian{ | ||
volatile std::sig_atomic_t sigflags_{0}; | ||
volatile std::sig_atomic_t saveAndExit_{0}; | ||
|
||
void setSignalFlag(int sig) { | ||
// sigflags_ is an int type serving as a bit filed for flags corresponding | ||
// to signals (lower or equeal to maxSignalForSetSignalFlag). We set the | ||
// flag by a binary or (|=) of the bit field and an int value with exactly | ||
// one bit set (s^sig). | ||
sigflags_ |= (1<<sig); | ||
} | ||
|
||
// Check if the flag for the signal sig is set in the bit field sigflags_ | ||
bool getSignalFlag(const int sig) { | ||
ABORT_IF(sig > maxSignalForSetSignalFlag, | ||
"Signal out of range (must be < {}, is {}).", maxSignalForSetSignalFlag, sig); | ||
// Do bitwise AND between sigflags_ and an int value that has exactly one bit set that | ||
// corresponds to the signal in question. If the bit is set (see setSignalFlag above), | ||
// the bitwise AND will return a non-zero integer, if it is not set, the result will | ||
// be zero. | ||
return (sigflags_ & (1<<sig)) != 0; | ||
} | ||
|
||
void requestSaveAndExit(int sig) { | ||
setSignalFlag(sig); // keep track of triggering signal | ||
saveAndExit_ = 1; // set flag to exit gracefully | ||
} | ||
|
||
bool saveAndExitRequested() { | ||
return saveAndExit_ == 1; | ||
} | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
#pragma once | ||
#include <csignal> | ||
#include <string> | ||
|
||
// SIGNAL HANDLING | ||
|
||
// The signal handlers (and checkers) here are implemented in line with with the recommendations | ||
// for signal handling in the SEI CERT C Coding Standard, specifically | ||
// | ||
// - SIG30-C: | ||
// https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers | ||
// | ||
// - SIG31-C: | ||
// https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers | ||
// | ||
// The exact behavior of 'graceful exit' depends on the application; for training, it means 'save model and exit', | ||
// for a server (not implemented yet): 'block new requests but serve pending requests and then exit'. | ||
// | ||
// Graceful exit for training is useful for training on clusters with time limits on jobs. Slurm, for example, can be | ||
// set up to send a custom signal at a set time before the end of the time slot, giving Marian time to save its current | ||
// state before getting killed. | ||
|
||
namespace marian { | ||
|
||
|
||
/// Request graceful exit (signal handler) | ||
void requestSaveAndExit(int sig); | ||
|
||
/// Check if graceful exit was requested. | ||
bool saveAndExitRequested(); | ||
|
||
/// General purpose signal handler that simply sets a flag when a signal is received. | ||
// (only for SIGNAL No. < 32). | ||
void setSignalFlag(int sig); // custom handler (set flag) for sig | ||
|
||
/// Check if a setSignalFlag was triggered for this signal | ||
bool getSignalFlag(int sig); | ||
|
||
} // End of namespace marian |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters