Client agent hangs indefinitely during shutdown #18129
Labels
hcc/jira
stage/accepted
Confirmed, and intend to work on. No timeline committment though.
theme/client
type/bug
Nomad version
v1.5.5
Operating system and Environment details
Unix
Issue
There is a potential race between client shutdown and an allocation being preempted (e.g. due to priority, or due to a job update) that causes the client to hang indefinitely.
While in this state the client is still considered "up" to the servers, and so allocations continue to be scheduled on the client although it will not actually run them. This thus causes the running of those allocations to be delayed.
Relevant Code
The rough shape of this race is as follows:
New allocrunner's
Run
function is blocked on the previous allocation becoming terminalWhen the client receives a new allocation that replaces an old allocation on the client, it creates an alloc watcher that watches the old allocation(s) which the new allocation replaces:
nomad/client/client.go
Lines 2602 to 2621 in 3d63bc6
This causes the new allocrunner's upstream-allocs prestart hook to block waiting for the old allocations to become terminal:
nomad/client/allocrunner/upstream_allocs_hook.go
Lines 30 to 31 in 3d63bc6
nomad/client/allocwatcher/alloc_watcher.go
Lines 274 to 277 in 3d63bc6
Which blocks the allocrunner's
Run
function from making progress:nomad/client/allocrunner/alloc_runner.go
Line 343 in 3d63bc6
This in turn blocks the allocrunner's
Shutdown
function from finishing:nomad/client/allocrunner/alloc_runner.go
Line 1172 in 3d63bc6
Which in turn blocks shutdown of the client:
nomad/client/client.go
Lines 858 to 872 in 3d63bc6
Race causes previous allocation to be left in non-terminal state
There are a couple of places in an taskrunner's
Run
function where a shutdown of the taskrunner can race with a kill:nomad/client/allocrunner/taskrunner/task_runner.go
Lines 600 to 605 in 3d63bc6
nomad/client/allocrunner/taskrunner/task_runner.go
Lines 672 to 678 in 3d63bc6
nomad/client/allocrunner/taskrunner/task_runner.go
Lines 672 to 678 in 3d63bc6
My understanding is that in Go, if multiple cases of a switch statement are satisfied at time of evaluation then there is no guarantee in which case will be picked. So, in all of these cases, if both the
shutdownCtx
andkillCtx
are bothdone
then we might end up taking theshutdown
case and return immediately without correctly finishing thekill
.Most importantly, the part which marks the task as
dead
is not reached:nomad/client/allocrunner/taskrunner/task_runner.go
Line 698 in 3d63bc6
Since the task is still considered live, the allocation is not transitioned to the complete/failed state. This means that the upstream-allocs hook of the new allocrunner would remain blocked.
Reproduction steps
First, to increase the chance of the race described above happening, I injected a sleep after the prestart hooks to simulate the hooks taking some time to complete, increasing the window for the race. I also added a bunch more
shutdownCtx
cases to increase the probability of hitting that branch:Then I ran the following:
Here are some interesting bits of the client agent's logs:
I also managed to repro a similar race in the part that blocks waiting for the task to finish. Similarly, I injected a sleep to increase the probability of a race:
I also managed to repro this by updating a job, rather than by preemption.
Expected Result
The client should shut down in a timely manner.
Actual Result
The client blocks during shutdown, hanging indefinitely.
During this state, the client is still considered "up to the server, and thus the server continues to schedule new allocations onto the client. But because the client is stuck it never actually tries to run them. This causes the allocations to be stuck in pending state.
The text was updated successfully, but these errors were encountered: