Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of SIGTERM #660

Merged
merged 72 commits into from
Sep 1, 2020
Merged

Improved handling of SIGTERM #660

merged 72 commits into from
Sep 1, 2020

Conversation

ugermann
Copy link
Collaborator

@ugermann ugermann commented Jun 17, 2020

Description

  • moved signal handling out of scheduler code (was already global there, but still in the files)
  • pay attention to the signal in the threads that prepare batches in the background, which can take quite a while and prevented Marian from shutting down quickly after receiving SIGTERM in some circumstances

How to test

start training, then send SIGTERM to process, it should save the current state of affairs and exit

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

ugermann and others added 30 commits November 19, 2019 22:49
When marian receives signal SIGTERM and exits gracefully (save model & exit),
it should then exit with a non-zero exit code, to signal to any parent process
that it did not exit "naturally".
…ning.

Prior to this bug fix, BatchGenerator::fetchBatches, which runs in a separate
thread, would ignore SIGTERM during training (training uses a custom signal handler
for SIGTERM, which simply sets a global flag, to enable graceful shutdown (i.e.,
save models and current state of training before shutting down).

The changes in this commit also facilitate custom handling of other signals in the
future by providing a general singal handler for all signals with a signal number
below 32 (setSignalFlag) and a generic flag checking function (getSignalFlag(sig))
for checking such flags.
…code easier

This does not introduce any new functionality, just moves code around, so that future PRs are easier to compare. Moving old GraphGroup code to training/deprecated. Once it is clear there is nothing in there that's worth saving, this will be deleted.

Replace -Ofast with -O3 and make sure ffinite-math is turned off.
…in/max quantization to avoid overflow

1. Change the weight matrix quantization to use 7-bit min/max quantization
-> This resolves all the overflow issue, because weight and activations are quantized by min/max range.
2. Clip fp16 quantization to avoid overflow
3. Fix windows build errors (cmake options, vcproj file)
4. int8 pack model (encoder -> fp16)
…encoders as well

For int8 quantized model, use int8 quantization for encoders as well. The quality difference between fp16 encoder and int8 encoder is small, but they have quite amount of speed difference.
* Add basic support for TSV inputs
* Fix mini-batch-fit for TSV inputs
* Abort if shuffling data from stdin
* Fix terminating training with data from STDIN
* Allow creating vocabs from TSV files
* Add comments; clean creation of vocabs from TSV files
* Guess --tsv-size based on the model type
* Add shortcut for STDIN inputs
* Rename --tsv-size to --tsv-fields
* Allow only one 'stdin' in --train-sets
* Properly create separate vocabularies from a TSV file
* Clearer logging message
* Add error message for wrong number of valid sets if --tsv is used
* Use --no-shuffle instead of --shuffle in the error message
* Fix continuing training from STDIN
* Update CHANGELOG
* Support both 'stdin' and '-'
* Guess --tsv-fields from dim-vocabs if special:model.yml available
* Update error messages
* Move variable outside the loop
* Refactorize utils::splitTsv; add unit tests
* Support '-' as stdin; refactorize; add comments
* Abort if excessive field(s) in the TSV input
* Add a TODO on passing one vocab with fully-tied embeddings
* Remove the unit test with excessive tab-separated fields
* fix 0 * nan behavior in concatention
* bump patch
* change epsilon to margin
* Refactorize processPaths
* Fix relative paths for shortlist and sqlite options
* Rename InterpolateEnvVars to interpolateEnvVars
* Update CHANGELOG
…anch

Cherry pick a few improvements/fixes from Frank's branch
* Adds Frank's fix for label-based mini-batch sizing from Frank's current experimental branch.
* Also copies minor improvements and a few comments.
* python3 shebang from #620
* Add changelog entry for python3 change
* Fix server build with current boost, move simple-websocket-server to submodule
* Change submodule to marian-nmt/Simple-WebSocket-Server
* Update submodule simple-websocket-server

Co-authored-by: Gleb Tv <[email protected]>
* Use cblas_sgemm_batch when available
* Merge with master, add comments and describe contribution
…ning.

Prior to this bug fix, BatchGenerator::fetchBatches, which runs in a separate
thread, would ignore SIGTERM during training (training uses a custom signal handler
for SIGTERM, which simply sets a global flag, to enable graceful shutdown (i.e.,
save models and current state of training before shutting down).

The changes in this commit also facilitate custom handling of other signals in the
future by providing a general singal handler for all signals with a signal number
below 32 (setSignalFlag) and a generic flag checking function (getSignalFlag(sig))
for checking such flags.
…in/max quantization to avoid overflow

1. Change the weight matrix quantization to use 7-bit min/max quantization
-> This resolves all the overflow issue, because weight and activations are quantized by min/max range.
2. Clip fp16 quantization to avoid overflow
3. Fix windows build errors (cmake options, vcproj file)
4. int8 pack model (encoder -> fp16)
Copy link
Member

@emjotde emjotde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. Thanks.

src/common/config_parser.cpp Outdated Show resolved Hide resolved
src/common/config_parser.cpp Outdated Show resolved Hide resolved
src/common/config_parser.cpp Outdated Show resolved Hide resolved
src/common/signal_handling.cpp Outdated Show resolved Hide resolved
if (bufferedBatches_.empty() // i.e., end of Epoch
|| getSignalFlag(SIGTERM)) { // process received SIGTERM, abandon ship ...
return nullptr;
if(runAsync_) { // by default we will run in asynchronous mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment to self and @frankseide: this is git being bad at doing a three-way diff. The runAsync_ part is from the master branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ugermann I see what you meant with the 3-way diff.

src/data/batch_generator.h Outdated Show resolved Hide resolved
src/data/batch_generator.h Outdated Show resolved Hide resolved
@emjotde
Copy link
Member

emjotde commented Aug 26, 2020

@frankseide this looks good to me now and ready for another look from you.

Cosmetic fixes as per code review.

- removed superfluous empty line
- but default value for --sigterm on a new line
Move option --segterm to General Options for training, as suggested by @snukky.
Remove log message from signal handler.

Note that the log still records that training was interrupted by a signal both in the log message in Scheduler::finished() and in the exit code at the end of marian_train.cpp.
Fix formatting of --sigterm option specification.
Copy link
Contributor

@frankseide frankseide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor feedback.

One is that the term "graceful" is not concise. E.g. users who send SIGTERM do not request a graceful exit. Marian gracefully exits (i.e. cleans up GPU resources etc.) when training is done. What the user requests is an early exit. Let us rethink the naming.

// (e.g., dump model but continue training upon SIGUSR1, or report current
// training status upon SIGINFO.)
cli.add<std::string>("--sigterm",
"What to do with SIGTERM: 'graceful' => save and exit (default); 'immediate' => exit immediately.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"graceful" is not descriptive. Please say in the option string what it does, e.g. "save-and-exit"

"immediate" could be "terminate"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all of these examples, graceful shutdown means to leave the world in a well-defined state (e.g. finish in-flight requests to a service). The examples do not suggest saving state, and note that the PR saves state in a way that is inconsistent with configured behavior, because the configuration says to save every save-freq updates, and not at a time of receiving SIGTERM, which is a random update.

In my mind, you don't want a "graceful shutdown" as given in these examples, but more like an "early exit" from the training process.

Alternatively, you may be wanting "preemption," i.e. save state that allows to restart the training. Your PR accomplishes that as well (right?), and such feature would be useful in our internal server farm as well when running jobs in low-priority preemptable mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool getSignalFlag(const int sig) {
// sig_atomic_t has 32 bits. We don't accommodate signals beyond that.
ABORT_IF(sig >= 32, "Signal out of range (must be < 32, is {}).", sig);
return sigflags_ & (1<<sig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& is an integer operation, but you return bool.

Can you please also explain what this function does? It seems it does two things: It converts the signal, but then also masks it with a class member. I cannot understand the meaning of the function without a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation added in code.


bool getSignalFlag(const int sig) {
// sig_atomic_t has 32 bits. We don't accommodate signals beyond that.
ABORT_IF(sig >= 32, "Signal out of range (must be < 32, is {}).", sig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this check? I suggest to remove both the comment and the ABORT, since sig_atomic_t is, by its name, clearly meant for signals, and therefore defines the valid signal ranges. Marian is not an operating system, we don't need to hedge against every possible invalid argument. You can also remove the check and comment in setSignalFlag

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig_atomic_t is meant for signals in the sense that certain operations on it are guaranteed to be atomic even during signal handling. Essentially it is an int type that I'm using here as a bit field, so that we have a generic signal handler that we can install for arbitrary signals (lower than 30). That check is a check against programming errors. Signals can be higher than 30, but this code does not accommodate installing setSignalFlag for such signals.


void requestGracefulExit(int sig) {
setSignalFlag(sig); // keep track of triggering signal
gracefulExitRequested_ = 1; // set flag to exit gracefully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gracefulExitRequested() returns a bool, but here you set it to an integer. Please choose one (bool if possible)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on implicit type conversion from int types to bool. sig_atomic_t is specifically meant to be used in signal handlers. This is run-of-the-mill signal handling, see "Compliant Solution (Writing volatile sig_atomic_t)" here: https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers.

src/common/signal_handling.h Show resolved Hide resolved

/// General purpose signal handler that simply sets a flag when a signal is received.
// (only for SIGNAL No. < 32).
void setSignalFlag(const int sig); // custom handler (set flag) for sig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to not use const in these function signatures. It exposes an internal behavior of the function to the caller. And the functions are so simple that adding const does not make them easier to udnerstand or verify.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Const removed.

@@ -16,6 +16,7 @@ template <class ModelWrapper>
class Train : public ModelTask {
private:
Ptr<Options> options_;
void installCustomSignalHandlers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it installing multiple signal handlers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially in the future. I prefer to keep API changes rare.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename to installEarlyExitSignalHandler()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly disagree on this one. installCustomSignalHandlers() is like parseOptions() a generic concept: install whatever custom signal handlers may be specified in the options. Either my pull request goes in as-is now, or I'll withdraw it. @emjotde

@ugermann
Copy link
Collaborator Author

@frankseide I think I have now addressed all your concerns.

Copy link
Contributor

@frankseide frankseide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename "graceful exit" to "early exit," as "graceful exit" means something else. After that change, I agree to merge.

CHANGELOG.md Outdated Show resolved Hide resolved
src/common/signal_handling.cpp Outdated Show resolved Hide resolved
src/training/scheduler.h Outdated Show resolved Hide resolved
@@ -16,6 +16,7 @@ template <class ModelWrapper>
class Train : public ModelTask {
private:
Ptr<Options> options_;
void installCustomSignalHandlers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename to installEarlyExitSignalHandler()

@ugermann ugermann changed the title Improved signal handling for graceful shutdown Improved handling of SIGTERM Aug 28, 2020
@ugermann
Copy link
Collaborator Author

@frankseide renamed 'graceful exit' to 'save and exit'.

@emjotde
Copy link
Member

emjotde commented Aug 31, 2020

@frankseide can you take another look if you concerns are satisfied? If yes, please approve, I will merge then.

Copy link
Contributor

@frankseide frankseide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last change, then this can go in.

src/training/scheduler.h Outdated Show resolved Hide resolved
@emjotde
Copy link
Member

emjotde commented Sep 1, 2020

I will rename this quickly and pull it in.

@emjotde
Copy link
Member

emjotde commented Sep 1, 2020

@snukky I love the new continuous integration :)

@emjotde
Copy link
Member

emjotde commented Sep 1, 2020

Aren't the checks a necessary condition before merging? Now it allows me to merge despite the checks not having finished yet after my update? Should that not be a blocker?

@emjotde emjotde merged commit 044b416 into master Sep 1, 2020
@emjotde
Copy link
Member

emjotde commented Sep 1, 2020

@ugermann merged, thanks.

@snukky snukky deleted the ug-graceful-shutdown branch January 24, 2022 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants