Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client agent hangs indefinitely during shutdown #18129

Open
marvinchin opened this issue Aug 2, 2023 · 1 comment
Open

Client agent hangs indefinitely during shutdown #18129

marvinchin opened this issue Aug 2, 2023 · 1 comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug

Comments

@marvinchin
Copy link
Contributor

Nomad version

v1.5.5

Operating system and Environment details

Unix

Issue

There is a potential race between client shutdown and an allocation being preempted (e.g. due to priority, or due to a job update) that causes the client to hang indefinitely.

While in this state the client is still considered "up" to the servers, and so allocations continue to be scheduled on the client although it will not actually run them. This thus causes the running of those allocations to be delayed.

Relevant Code

The rough shape of this race is as follows:

  1. Client receives a new allocation that replaces another old allocation that is local to the client
  2. The allocrunner blocks waiting on the termination of the old allocation
  3. Client shuts down the allocrunner for the old allocation, which in turn shuts down the taskrunner for its tasks
  4. Client receives a shutdown request
  5. Depending on which part of the lifecycle the taskrunners in (4) are when the shutdown request is received, it might leave the task in a non-terminal state causing (2) to never be satisfied
  6. Shutdown of the client blocks on shutdown of all allocrunners, and since (2) is blocked indefinitely it never shuts down and the client also doesn't shut down

New allocrunner's Run function is blocked on the previous allocation becoming terminal

When the client receives a new allocation that replaces an old allocation on the client, it creates an alloc watcher that watches the old allocation(s) which the new allocation replaces:

nomad/client/client.go

Lines 2602 to 2621 in 3d63bc6

var preemptedAllocs map[string]allocwatcher.AllocRunnerMeta
if len(alloc.PreemptedAllocations) > 0 {
preemptedAllocs = make(map[string]allocwatcher.AllocRunnerMeta)
for _, palloc := range alloc.PreemptedAllocations {
preemptedAllocs[palloc] = c.allocs[palloc]
}
}
// Since only the Client has access to other AllocRunners and the RPC
// client, create the previous allocation watcher here.
watcherConfig := allocwatcher.Config{
Alloc: alloc,
PreviousRunner: c.allocs[alloc.PreviousAllocation],
PreemptedRunners: preemptedAllocs,
RPC: c,
Config: c.GetConfig(),
MigrateToken: migrateToken,
Logger: c.logger,
}
prevAllocWatcher, prevAllocMigrator := allocwatcher.NewAllocWatcher(watcherConfig)

This causes the new allocrunner's upstream-allocs prestart hook to block waiting for the old allocations to become terminal:

// Wait for a previous alloc - if any - to terminate
return h.allocWatcher.Wait(context.Background())

case prevAlloc, ok := <-p.prevListener.Ch():
if !ok || prevAlloc.Terminated() {
return nil
}

Which blocks the allocrunner's Run function from making progress:

if err := ar.prerun(); err != nil {

This in turn blocks the allocrunner's Shutdown function from finishing:

Which in turn blocks shutdown of the client:

nomad/client/client.go

Lines 858 to 872 in 3d63bc6

arGroup := group{}
if c.GetConfig().DevMode {
// In DevMode destroy all the running allocations.
for _, ar := range c.getAllocRunners() {
ar.Destroy()
arGroup.AddCh(ar.DestroyCh())
}
} else {
// In normal mode call shutdown
for _, ar := range c.getAllocRunners() {
ar.Shutdown()
arGroup.AddCh(ar.ShutdownCh())
}
}
arGroup.Wait()

Race causes previous allocation to be left in non-terminal state

There are a couple of places in an taskrunner's Run function where a shutdown of the taskrunner can race with a kill:

select {
case <-tr.killCtx.Done():
break MAIN
case <-tr.shutdownCtx.Done():
// TaskRunner was told to exit immediately
return

case <-tr.killCtx.Done():
tr.logger.Trace("task killed between restarts", "delay", restartDelay)
break MAIN
case <-tr.shutdownCtx.Done():
// TaskRunner was told to exit immediately
tr.logger.Trace("gracefully shutting down during restart delay")
return

case <-tr.killCtx.Done():
tr.logger.Trace("task killed between restarts", "delay", restartDelay)
break MAIN
case <-tr.shutdownCtx.Done():
// TaskRunner was told to exit immediately
tr.logger.Trace("gracefully shutting down during restart delay")
return

My understanding is that in Go, if multiple cases of a switch statement are satisfied at time of evaluation then there is no guarantee in which case will be picked. So, in all of these cases, if both the shutdownCtx and killCtx are both done then we might end up taking the shutdown case and return immediately without correctly finishing the kill.

Most importantly, the part which marks the task as dead is not reached:

tr.UpdateState(structs.TaskStateDead, nil)

Since the task is still considered live, the allocation is not transitioned to the complete/failed state. This means that the upstream-allocs hook of the new allocrunner would remain blocked.

Reproduction steps

First, to increase the chance of the race described above happening, I injected a sleep after the prestart hooks to simulate the hooks taking some time to complete, increasing the window for the race. I also added a bunch more shutdownCtx cases to increase the probability of hitting that branch:

diff --git a/client/allocrunner/taskrunner/task_runner.go b/client/allocrunner/taskrunner/task_runner.go
index aae6a31..5143327 100644
--- a/client/allocrunner/taskrunner/task_runner.go
+++ b/client/allocrunner/taskrunner/task_runner.go
@@ -597,9 +597,22 @@ MAIN:
                        goto RESTART
                }
  
+               tr.logger.Info("Sleeping a bit to simulate prestart hooks taking a while to finish")
+               time.Sleep(time.Second * 30)
+               tr.logger.Info("Done sleeping")
+
                select {
                case <-tr.killCtx.Done():
                        break MAIN
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
                case <-tr.shutdownCtx.Done():
                        // TaskRunner was told to exit immediately
                        return

Then I ran the following:

# Start server
$ ./usr/bin/nomad agent -config server.conf
...
 
# Enable preemption
$ ./usr/bin/nomad operator scheduler set-config -preempt-batch-scheduler=true
Scheduler configuration updated!
 
 
# Start (patched) client
$ ./nomad-patched agent -config client-1.conf
...
 
# Submit low-priority job that uses port 10000
$ cat sleep-low-prio.nomad
job "low-prio" {
  datacenters = ["dc1"]
  type        = "batch"
  priority = 1
 
  group "test-group" {
    task "test-task" {
      driver = "raw_exec"
 
      config {
        command = "/usr/bin/sleep"
        args    = ["inf"]
      }
 
      kill_timeout = "10s"
    }
 
    network {
      port "test" {
        static = 10000
      }
    }
 
    reschedule {
      attempts  = 0
      unlimited = false
    }
  }
$ ./usr/bin/nomad job run sleep-low-prio.nomad
==> 2023-07-05T12:07:33+08:00: Monitoring evaluation "08e1dc0b"
    2023-07-05T12:07:33+08:00: Evaluation triggered by job "low-prio"
    2023-07-05T12:07:34+08:00: Allocation "471a4855" created: node "f72e6eda", group "test-group"
    2023-07-05T12:07:34+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-07-05T12:07:34+08:00: Evaluation "08e1dc0b" finished with status "complete"
 
# While it is sleeping, submit high-priority job that uses same port
$ cat sleep-high-prio.nomad
job "high-prio" {
  datacenters = ["dc1"]
  type        = "batch"
  priority = 100
 
  group "test-group" {
    task "test-task" {
      driver = "raw_exec"
 
      config {
        command = "/usr/bin/sleep"
        args    = ["inf"]
      }
 
      kill_timeout = "1m"
    }
 
    network {
      port "test" {
        static = 10000
      }
    }
 
    reschedule {
      attempts  = 0
      unlimited = false
    }
  }
 
$ ./usr/bin/nomad job run sleep-high-prio.nomad
==> 2023-07-05T12:07:39+08:00: Monitoring evaluation "44c29be6"
    2023-07-05T12:07:39+08:00: Evaluation triggered by job "high-prio"
    2023-07-05T12:07:40+08:00: Allocation "673f0b58" created: node "f72e6eda", group "test-group"
    2023-07-05T12:07:40+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-07-05T12:07:40+08:00: Evaluation "44c29be6" finished with status "complete"
 
# Once client has received the high-priority job, signal the client. Note that we are still running (or at least, pretending to) the pre-start hooks in the task runner for the low-priority job at this point.
...
 
# Observe that it hangs
...

Here are some interesting bits of the client agent's logs:

    ...
    # Low-priority alloc is waiting before the select
    2023-07-05T12:07:33.609+0800 [INFO]  client.alloc_runner.task_runner: am waiting to simulate race: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task
    ...
    # High-priority alloc arrives and waits on the low-priority alloc to terminate
    2023-07-05T12:07:39.044+0800 [DEBUG] client.alloc_watcher: waiting for previous alloc to terminate: alloc_id=673f0b58-e039-31be-e9bb-35d864c079ce previous_alloc=471a4855-155b-c5a8-ad31-8e84f58b7faa
    ...
    # Kill requested for low-priority alloc
    2023-07-05T12:07:39.044+0800 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task
    2023-07-05T12:07:39.044+0800 [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task event_type=Killing event_reason=""
    2023-07-05T12:07:39.044+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task type=Killing msg="Sent interrupt. Waiting 10s before force killing" failed=false
    ...
    # Agent is signalled
^C==> Caught signal: interrupt
    2023-07-05T12:07:42.127+0800 [INFO]  agent: requesting shutdown
    2023-07-05T12:07:42.127+0800 [INFO]  client: shutting down
    # Shutdown requested for low-priority alloc
    2023-07-05T12:07:42.127+0800 [TRACE] client.alloc_runner: shutting down: alloc_id=673f0b58-e039-31be-e9bb-35d864c079ce
    2023-07-05T12:07:42.127+0800 [TRACE] client.alloc_runner.task_runner: shutting down: alloc_id=673f0b58-e039-31be-e9bb-35d864c079ce task=test-task
    ...
    # Finished sleeping. Task has been requested to both be killed and shutdown at this point, triggering the race
    2023-07-05T12:08:03.610+0800 [INFO]  client.alloc_runner.task_runner: am before select: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task
    ...
    # Low-priority alloc is in it's final iteration of handleTaskStateUpdates
    2023-07-05T12:08:03.610+0800 [TRACE] client.alloc_runner: handling task state update: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa done=true
    ...
    # Final update for the alloc is sent, note that the client_status is running which is non-terminal
    2023-07-05T12:08:03.610+0800 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa client_status=running desired_status=""
    ...

I also managed to repro a similar race in the part that blocks waiting for the task to finish. Similarly, I injected a sleep to increase the probability of a race:

diff --git a/client/allocrunner/taskrunner/task_runner.go b/client/allocrunner/taskrunner/task_runner.go
index aae6a31..f14327a 100644
--- a/client/allocrunner/taskrunner/task_runner.go
+++ b/client/allocrunner/taskrunner/task_runner.go
@@ -630,6 +630,9 @@ MAIN:
                        if resultCh, err := handle.WaitCh(context.Background()); err != nil {
                                tr.logger.Error("wait task failed", "error", err)
                        } else {
+                               tr.logger.Info("Sleeping a bit to simulate prestart hooks taking a while to finish")
+                               time.Sleep(time.Second * 30)
+                               tr.logger.Info("Done sleeping")
                                select {
                                case <-tr.killCtx.Done():
                                        // We can go through the normal should restart check since

I also managed to repro this by updating a job, rather than by preemption.

Expected Result

The client should shut down in a timely manner.

Actual Result

The client blocks during shutdown, hanging indefinitely.

During this state, the client is still considered "up to the server, and thus the server continues to schedule new allocations onto the client. But because the client is stuck it never actually tries to run them. This causes the allocations to be stuck in pending state.

@tgross
Copy link
Member

tgross commented Aug 2, 2023

This is great debugging @marvinchin, thanks! I suspect that can probably hit a similar problem with ephemeral data migrations. I'll mark this issue for roadmapping.

@tgross tgross added theme/client stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Aug 2, 2023
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants