Skip to content

Commit

Permalink
Improved handling of SIGTERM (#660)
Browse files Browse the repository at this point in the history
* Return exit code 15 (SIGTERM) after SIGTERM.
When marian receives signal SIGTERM and exits gracefully (save model & exit),
it should then exit with a non-zero exit code, to signal to any parent process
that it did not exit "naturally".
* Added explanatory comment about exiting marian_train with non-zero status after SIGTERM.
* Bug fix: better handling of SIGTERM for graceful shutdown during training.
Prior to this bug fix, BatchGenerator::fetchBatches, which runs in a separate
thread, would ignore SIGTERM during training (training uses a custom signal handler
for SIGTERM, which simply sets a global flag, to enable graceful shutdown (i.e.,
save models and current state of training before shutting down).

The changes in this commit also facilitate custom handling of other signals in the
future by providing a general singal handler for all signals with a signal number
below 32 (setSignalFlag) and a generic flag checking function (getSignalFlag(sig))
for checking such flags.
  • Loading branch information
ugermann authored Sep 1, 2020
1 parent 96738fe commit 044b416
Show file tree
Hide file tree
Showing 10 changed files with 144 additions and 63 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Properly record cmake variables in the cmake build directory instead of the source tree.
- Added default "none" for option shuffle in BatchGenerator, so that it works in executables where shuffle is not an option.
- Added a few missing header files in shortlist.h and beam_search.h.
- Improved handling for receiving SIGTERM during training. By default, SIGTERM triggers 'save (now) and exit'. Prior to this fix, batch pre-fetching did not check for this sigal, potentially delaying exit considerably. It now pays attention to that. Also, the default behaviour of save-and-exit can now be disabled on the command line with --sigterm exit-immediately.

### Changed
- Move Simple-WebSocket-Server to submodule
Expand Down
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ add_library(marian STATIC
common/filesystem.cpp
common/file_stream.cpp
common/file_utils.cpp
common/signal_handling.cpp
common/types.cpp

data/alignment.cpp
Expand Down Expand Up @@ -99,7 +100,6 @@ add_library(marian STATIC
training/graph_group_singleton.cpp
training/validator.cpp
training/communicator.cpp
training/scheduler.cpp

# this is only compiled to catch build errors, but not linked
microsoft/quicksand.cpp
Expand Down
11 changes: 5 additions & 6 deletions src/command/marian_train.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#include <signal.h>
#include "marian.h"

#include "common/signal_handling.h"
#include "training/graph_group_async.h"
#include "training/graph_group_singleton.h"
#include "training/graph_group_sync.h"
Expand Down Expand Up @@ -42,14 +43,12 @@ int main(int argc, char** argv) {
New<Train<AsyncGraphGroup>>(options)->run();
}
}

// If we exit due to SIGTERM, exit with 128 + the signal number, as suggested
// for bash in http://tldp.org/LDP/abs/html/exitcodes.html. This allows parent
// If we exit due to a graceful exit request via SIGTERM, exit with 128 + SIGTERM,
// as suggested for bash in http://tldp.org/LDP/abs/html/exitcodes.html. This allows parent
// scripts to determine if training terminated naturally or via SIGTERM.
// Whith this approach we can accommodate additional signals in the future.
// An alternative would be to return 124, which is what the timeout command
// An alternative would be to exit with code 124, which is what the timeout command
// returns for timeout -s SIGTERM <seconds> ...., because exiting after SIGTERM
// is not technically a fatal error (which is what the 128+x convention usually
// stands for).
return getSigtermFlag() ? (128 + SIGTERM) : 0;
exit(getSignalFlag(SIGTERM) ? 128 + SIGTERM : EXIT_SUCCESS);
}
9 changes: 9 additions & 0 deletions src/common/config_parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,15 @@ void ConfigParser::addOptionsGeneral(cli::CLIWrapper& cli) {
cli.add<std::string>("--dump-config",
"Dump current (modified) configuration to stdout and exit. Possible values: full, minimal, expand")
->implicit_val("full");
if(mode_ == cli::mode::training) {
// --sigterm is deliberately not a boolean, to allow for a consistent
// pattern of specifying custom signal handling in the future.
// (e.g., dump model but continue training upon SIGUSR1, or report current
// training status upon SIGINFO.)
cli.add<std::string>("--sigterm",
"What to do with SIGTERM: save-and-exit or exit-immediately.",
"save-and-exit");
}
// clang-format on
}

Expand Down
58 changes: 58 additions & 0 deletions src/common/signal_handling.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#include "common/logging.h"
#include "signal_handling.h"

// The simplest (and recommended) way to handle signals is to simply set a flag
// in the signal handler and check that flag later.
//
// We provide setSignalFlag as the most generic signal handler. This handler uses a
// single sig_atomic_t as a bit field. On Linux, sig_atomic_t is equivalent to a signed int,
// theoretically providing 32 binary flags; in practice, most likely signals for which we may
// want to install signal handlers are
// - SIGTERM (15): which by default signals the request for a graceful shutdown
// - SIGUSR1 (10): intended for custom use, default action in Linux is termination
// - SIGUSR2 (12): intended for custom use, default action in Linux is termination
// - SIGINT (2): interrupt from the console
// Just to be safe, we accommodate signals up to signal No. 30.

// In addition, we also provide requestSaveAndExit() and saveAndExit() as a signal
// handler/checker for graceful shutdown requests during training.
constexpr int maxSignalForSetSignalFlag{30};

// Make sure sig_atomic_t is large enough as a bit field for our purposes.
// That said, I'm not aware of any platform where this would be a problem.
static_assert(SIG_ATOMIC_MAX > (1U<<maxSignalForSetSignalFlag),
"sig_atomic_type is too small for signal flags on this platform.");

namespace marian{
volatile std::sig_atomic_t sigflags_{0};
volatile std::sig_atomic_t saveAndExit_{0};

void setSignalFlag(int sig) {
// sigflags_ is an int type serving as a bit filed for flags corresponding
// to signals (lower or equeal to maxSignalForSetSignalFlag). We set the
// flag by a binary or (|=) of the bit field and an int value with exactly
// one bit set (s^sig).
sigflags_ |= (1<<sig);
}

// Check if the flag for the signal sig is set in the bit field sigflags_
bool getSignalFlag(const int sig) {
ABORT_IF(sig > maxSignalForSetSignalFlag,
"Signal out of range (must be < {}, is {}).", maxSignalForSetSignalFlag, sig);
// Do bitwise AND between sigflags_ and an int value that has exactly one bit set that
// corresponds to the signal in question. If the bit is set (see setSignalFlag above),
// the bitwise AND will return a non-zero integer, if it is not set, the result will
// be zero.
return (sigflags_ & (1<<sig)) != 0;
}

void requestSaveAndExit(int sig) {
setSignalFlag(sig); // keep track of triggering signal
saveAndExit_ = 1; // set flag to exit gracefully
}

bool saveAndExitRequested() {
return saveAndExit_ == 1;
}

}
39 changes: 39 additions & 0 deletions src/common/signal_handling.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#pragma once
#include <csignal>
#include <string>

// SIGNAL HANDLING

// The signal handlers (and checkers) here are implemented in line with with the recommendations
// for signal handling in the SEI CERT C Coding Standard, specifically
//
// - SIG30-C:
// https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers
//
// - SIG31-C:
// https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers
//
// The exact behavior of 'graceful exit' depends on the application; for training, it means 'save model and exit',
// for a server (not implemented yet): 'block new requests but serve pending requests and then exit'.
//
// Graceful exit for training is useful for training on clusters with time limits on jobs. Slurm, for example, can be
// set up to send a custom signal at a set time before the end of the time slot, giving Marian time to save its current
// state before getting killed.

namespace marian {


/// Request graceful exit (signal handler)
void requestSaveAndExit(int sig);

/// Check if graceful exit was requested.
bool saveAndExitRequested();

/// General purpose signal handler that simply sets a flag when a signal is received.
// (only for SIGNAL No. < 32).
void setSignalFlag(int sig); // custom handler (set flag) for sig

/// Check if a setSignalFlag was triggered for this signal
bool getSignalFlag(int sig);

} // End of namespace marian
9 changes: 7 additions & 2 deletions src/data/batch_generator.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#pragma once

#include "common/options.h"
#include "common/signal_handling.h"
#include "data/batch_stats.h"
#include "data/rng_engine.h"
#include "training/training_state.h"
Expand Down Expand Up @@ -136,6 +137,8 @@ class BatchGenerator : public RNGEngine {
}
size_t sets = 0;
while(current_ != data_->end() && maxiBatch->size() < maxSize) { // loop over data
if (saveAndExitRequested()) // stop generating batches
return std::deque<BatchPtr>();
maxiBatch->push(*current_);
sets = current_->size();
// do not consume more than required for the maxi batch as this causes
Expand All @@ -161,6 +164,8 @@ class BatchGenerator : public RNGEngine {
if (stats_)
cachedStatsIter = stats_->begin();
while(!maxiBatch->empty()) { // while there are sentences in the queue
if (saveAndExitRequested()) // stop generating batches
return std::deque<BatchPtr>();
// push item onto batch
batchVector.push_back(maxiBatch->top());
maxiBatch->pop(); // fetch next-shortest
Expand Down Expand Up @@ -249,15 +254,15 @@ class BatchGenerator : public RNGEngine {
"If you have changed the training corpus, add --no-restore-corpus to the training command and run it again.");
bufferedBatches_ = std::move(futureBufferedBatches_.get());
// if bg thread returns an empty swath, we hit the end of the epoch
if (bufferedBatches_.empty()) {
if (bufferedBatches_.empty() || saveAndExitRequested()) {
return nullptr;
}
// and kick off the next bg operation
fetchBatchesAsync();
} else { // don't spawn any threads, i.e. batch fetching is blocking.
bufferedBatches_ = fetchBatches();
// if bufferedBatches is empty we hit the end of the epoch
if (bufferedBatches_.empty()) {
if (bufferedBatches_.empty() || saveAndExitRequested()) {
return nullptr;
}
}
Expand Down
43 changes: 0 additions & 43 deletions src/training/scheduler.cpp

This file was deleted.

19 changes: 8 additions & 11 deletions src/training/scheduler.h
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
#pragma once

#include "common/options.h"
#include "common/signal_handling.h"
#include "training/training_state.h"
#include "training/validator.h"
#include "training/communicator.h"
#include "layers/loss.h"

namespace marian {

bool getSigtermFlag();
void installSignalHandlers();

class Scheduler : public TrainingObserver {
private:
Ptr<Options> options_;
Expand Down Expand Up @@ -154,11 +152,10 @@ class Scheduler : public TrainingObserver {
: options_(options), state_(state) {
ABORT_IF(state_->factor != 1, "state.factor unexpectedly not 1 at this point??");
updateLearningRate(*state);
installSignalHandlers();
}

bool keepGoing() {
if(getSigtermFlag()) // received signal SIGERM => exit gracefully
if(saveAndExitRequested()) // via SIGTERM
return false;

// stop if it reached the maximum number of epochs
Expand Down Expand Up @@ -192,13 +189,12 @@ class Scheduler : public TrainingObserver {

void started() { LOG(info, "Training started"); }
void finished() {
if (getSigtermFlag())
LOG(info, "Training interrupted (SIGTERM).");
if (saveAndExitRequested())
LOG(info, "Training interrupted (via signal).");
else
LOG(info, "Training finished");
}


void addValidator(Ptr<ValidatorBase> validator) {
validators_.push_back(validator);

Expand All @@ -223,9 +219,10 @@ class Scheduler : public TrainingObserver {

void validate(const std::vector<Ptr<ExpressionGraph>>& graphs,
bool isFinal = false) {
// Do not validate if already validated (for instance, after the model is
// loaded) or if validation is scheduled for another update, or when signal SIGTERM was received
if(getSigtermFlag() // SIGTERM was received
// Do not validate if already validated (for instance, after the model is loaded)
// or if validation is scheduled for another update, or when a graceful shutdown
// was requested.
if(saveAndExitRequested()
|| state_->validated // already validated (in resumed training, for example)
|| (!state_->enteredNewPeriodOf(options_->get<std::string>("valid-freq")) && !isFinal)) // not now
return;
Expand Down
16 changes: 16 additions & 0 deletions src/training/training.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ template <class ModelWrapper>
class Train : public ModelTask {
private:
Ptr<Options> options_;
void installCustomSignalHandlers();

public:
Train(Ptr<Options> options) : options_(options) {}
Expand Down Expand Up @@ -77,6 +78,9 @@ class Train : public ModelTask {
bool restored = !options_->get<bool>("no-restore-corpus")
&& batchGenerator->restore(trainState);

// We only want custom behavior once training starts.
installCustomSignalHandlers();

// -- main training loop
scheduler->started();
while(scheduler->keepGoing()) {
Expand Down Expand Up @@ -107,4 +111,16 @@ class Train : public ModelTask {
finalizeMPI(std::move(mpi));
}
};

template <class ModelWrapper>
void Train<ModelWrapper>::installCustomSignalHandlers(){
const std::string sigTermAction = options_->get<std::string>("sigterm");
if (sigTermAction == "save-and-exit") {
LOG(debug, "Will save before exiting upon SIGTERM.");
signal(SIGTERM, requestSaveAndExit);
}
else if (sigTermAction != "exit-immediately")
ABORT("Unrecognized value '{}' for --sigterm", sigTermAction);
}

} // namespace marian

0 comments on commit 044b416

Please sign in to comment.