Client agent hangs indefinitely during shutdown #18129

marvinchin · 2023-08-02T10:17:15Z

Nomad version

v1.5.5

Operating system and Environment details

Unix

Issue

There is a potential race between client shutdown and an allocation being preempted (e.g. due to priority, or due to a job update) that causes the client to hang indefinitely.

While in this state the client is still considered "up" to the servers, and so allocations continue to be scheduled on the client although it will not actually run them. This thus causes the running of those allocations to be delayed.

Relevant Code

The rough shape of this race is as follows:

Client receives a new allocation that replaces another old allocation that is local to the client
The allocrunner blocks waiting on the termination of the old allocation
Client shuts down the allocrunner for the old allocation, which in turn shuts down the taskrunner for its tasks
Client receives a shutdown request
Depending on which part of the lifecycle the taskrunners in (4) are when the shutdown request is received, it might leave the task in a non-terminal state causing (2) to never be satisfied
Shutdown of the client blocks on shutdown of all allocrunners, and since (2) is blocked indefinitely it never shuts down and the client also doesn't shut down

New allocrunner's Run function is blocked on the previous allocation becoming terminal

When the client receives a new allocation that replaces an old allocation on the client, it creates an alloc watcher that watches the old allocation(s) which the new allocation replaces:

nomad/client/client.go

Lines 2602 to 2621 in 3d63bc6

    
           var preemptedAllocs map[string]allocwatcher.AllocRunnerMeta 
        
           if len(alloc.PreemptedAllocations) > 0 { 
        
           	preemptedAllocs = make(map[string]allocwatcher.AllocRunnerMeta) 
        
           	for _, palloc := range alloc.PreemptedAllocations { 
        
           		preemptedAllocs[palloc] = c.allocs[palloc] 
        
           	} 
        
           } 
        
           // Since only the Client has access to other AllocRunners and the RPC 
        
           // client, create the previous allocation watcher here. 
        
           watcherConfig := allocwatcher.Config{ 
        
           	Alloc:            alloc, 
        
           	PreviousRunner:   c.allocs[alloc.PreviousAllocation], 
        
           	PreemptedRunners: preemptedAllocs, 
        
           	RPC:              c, 
        
           	Config:           c.GetConfig(), 
        
           	MigrateToken:     migrateToken, 
        
           	Logger:           c.logger, 
        
           } 
        
           prevAllocWatcher, prevAllocMigrator := allocwatcher.NewAllocWatcher(watcherConfig)

This causes the new allocrunner's upstream-allocs prestart hook to block waiting for the old allocations to become terminal:

nomad/client/allocrunner/upstream_allocs_hook.go

Lines 30 to 31 in 3d63bc6

    
           // Wait for a previous alloc - if any - to terminate 
        
           return h.allocWatcher.Wait(context.Background())

nomad/client/allocwatcher/alloc_watcher.go

Lines 274 to 277 in 3d63bc6

    
           case prevAlloc, ok := <-p.prevListener.Ch(): 
        
           	if !ok || prevAlloc.Terminated() { 
        
           		return nil 
        
           	}

Which blocks the allocrunner's Run function from making progress:

nomad/client/allocrunner/alloc_runner.go

Line 343 in 3d63bc6

if err := ar.prerun(); err != nil {

This in turn blocks the allocrunner's Shutdown function from finishing:

nomad/client/allocrunner/alloc_runner.go

Line 1172 in 3d63bc6

<-ar.waitCh

Which in turn blocks shutdown of the client:

nomad/client/client.go

Lines 858 to 872 in 3d63bc6

    
           arGroup := group{} 
        
           if c.GetConfig().DevMode { 
        
           	// In DevMode destroy all the running allocations. 
        
           	for _, ar := range c.getAllocRunners() { 
        
           		ar.Destroy() 
        
           		arGroup.AddCh(ar.DestroyCh()) 
        
           	} 
        
           } else { 
        
           	// In normal mode call shutdown 
        
           	for _, ar := range c.getAllocRunners() { 
        
           		ar.Shutdown() 
        
           		arGroup.AddCh(ar.ShutdownCh()) 
        
           	} 
        
           } 
        
           arGroup.Wait()

Race causes previous allocation to be left in non-terminal state

There are a couple of places in an taskrunner's Run function where a shutdown of the taskrunner can race with a kill:

nomad/client/allocrunner/taskrunner/task_runner.go

Lines 600 to 605 in 3d63bc6

    
           select { 
        
           case <-tr.killCtx.Done(): 
        
           	break MAIN 
        
           case <-tr.shutdownCtx.Done(): 
        
           	// TaskRunner was told to exit immediately 
        
           	return

nomad/client/allocrunner/taskrunner/task_runner.go

Lines 672 to 678 in 3d63bc6

    
           case <-tr.killCtx.Done(): 
        
           	tr.logger.Trace("task killed between restarts", "delay", restartDelay) 
        
           	break MAIN 
        
           case <-tr.shutdownCtx.Done(): 
        
           	// TaskRunner was told to exit immediately 
        
           	tr.logger.Trace("gracefully shutting down during restart delay") 
        
           	return

nomad/client/allocrunner/taskrunner/task_runner.go

Lines 672 to 678 in 3d63bc6

    
           case <-tr.killCtx.Done(): 
        
           	tr.logger.Trace("task killed between restarts", "delay", restartDelay) 
        
           	break MAIN 
        
           case <-tr.shutdownCtx.Done(): 
        
           	// TaskRunner was told to exit immediately 
        
           	tr.logger.Trace("gracefully shutting down during restart delay") 
        
           	return

My understanding is that in Go, if multiple cases of a switch statement are satisfied at time of evaluation then there is no guarantee in which case will be picked. So, in all of these cases, if both the shutdownCtx and killCtx are both done then we might end up taking the shutdown case and return immediately without correctly finishing the kill.

Most importantly, the part which marks the task as dead is not reached:

nomad/client/allocrunner/taskrunner/task_runner.go

Line 698 in 3d63bc6

tr.UpdateState(structs.TaskStateDead, nil)

Since the task is still considered live, the allocation is not transitioned to the complete/failed state. This means that the upstream-allocs hook of the new allocrunner would remain blocked.

Reproduction steps

First, to increase the chance of the race described above happening, I injected a sleep after the prestart hooks to simulate the hooks taking some time to complete, increasing the window for the race. I also added a bunch more shutdownCtx cases to increase the probability of hitting that branch:

diff --git a/client/allocrunner/taskrunner/task_runner.go b/client/allocrunner/taskrunner/task_runner.go
index aae6a31..5143327 100644
--- a/client/allocrunner/taskrunner/task_runner.go
+++ b/client/allocrunner/taskrunner/task_runner.go
@@ -597,9 +597,22 @@ MAIN:
                        goto RESTART
                }
  
+               tr.logger.Info("Sleeping a bit to simulate prestart hooks taking a while to finish")
+               time.Sleep(time.Second * 30)
+               tr.logger.Info("Done sleeping")
+
                select {
                case <-tr.killCtx.Done():
                        break MAIN
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
+               case <-tr.shutdownCtx.Done():
                case <-tr.shutdownCtx.Done():
                        // TaskRunner was told to exit immediately
                        return

Then I ran the following:

# Start server
$ ./usr/bin/nomad agent -config server.conf
...
 
# Enable preemption
$ ./usr/bin/nomad operator scheduler set-config -preempt-batch-scheduler=true
Scheduler configuration updated!
 
 
# Start (patched) client
$ ./nomad-patched agent -config client-1.conf
...
 
# Submit low-priority job that uses port 10000
$ cat sleep-low-prio.nomad
job "low-prio" {
  datacenters = ["dc1"]
  type        = "batch"
  priority = 1
 
  group "test-group" {
    task "test-task" {
      driver = "raw_exec"
 
      config {
        command = "/usr/bin/sleep"
        args    = ["inf"]
      }
 
      kill_timeout = "10s"
    }
 
    network {
      port "test" {
        static = 10000
      }
    }
 
    reschedule {
      attempts  = 0
      unlimited = false
    }
  }
$ ./usr/bin/nomad job run sleep-low-prio.nomad
==> 2023-07-05T12:07:33+08:00: Monitoring evaluation "08e1dc0b"
    2023-07-05T12:07:33+08:00: Evaluation triggered by job "low-prio"
    2023-07-05T12:07:34+08:00: Allocation "471a4855" created: node "f72e6eda", group "test-group"
    2023-07-05T12:07:34+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-07-05T12:07:34+08:00: Evaluation "08e1dc0b" finished with status "complete"
 
# While it is sleeping, submit high-priority job that uses same port
$ cat sleep-high-prio.nomad
job "high-prio" {
  datacenters = ["dc1"]
  type        = "batch"
  priority = 100
 
  group "test-group" {
    task "test-task" {
      driver = "raw_exec"
 
      config {
        command = "/usr/bin/sleep"
        args    = ["inf"]
      }
 
      kill_timeout = "1m"
    }
 
    network {
      port "test" {
        static = 10000
      }
    }
 
    reschedule {
      attempts  = 0
      unlimited = false
    }
  }
 
$ ./usr/bin/nomad job run sleep-high-prio.nomad
==> 2023-07-05T12:07:39+08:00: Monitoring evaluation "44c29be6"
    2023-07-05T12:07:39+08:00: Evaluation triggered by job "high-prio"
    2023-07-05T12:07:40+08:00: Allocation "673f0b58" created: node "f72e6eda", group "test-group"
    2023-07-05T12:07:40+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-07-05T12:07:40+08:00: Evaluation "44c29be6" finished with status "complete"
 
# Once client has received the high-priority job, signal the client. Note that we are still running (or at least, pretending to) the pre-start hooks in the task runner for the low-priority job at this point.
...
 
# Observe that it hangs
...

Here are some interesting bits of the client agent's logs:

    ...
    # Low-priority alloc is waiting before the select
    2023-07-05T12:07:33.609+0800 [INFO]  client.alloc_runner.task_runner: am waiting to simulate race: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task
    ...
    # High-priority alloc arrives and waits on the low-priority alloc to terminate
    2023-07-05T12:07:39.044+0800 [DEBUG] client.alloc_watcher: waiting for previous alloc to terminate: alloc_id=673f0b58-e039-31be-e9bb-35d864c079ce previous_alloc=471a4855-155b-c5a8-ad31-8e84f58b7faa
    ...
    # Kill requested for low-priority alloc
    2023-07-05T12:07:39.044+0800 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task
    2023-07-05T12:07:39.044+0800 [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task event_type=Killing event_reason=""
    2023-07-05T12:07:39.044+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task type=Killing msg="Sent interrupt. Waiting 10s before force killing" failed=false
    ...
    # Agent is signalled
^C==> Caught signal: interrupt
    2023-07-05T12:07:42.127+0800 [INFO]  agent: requesting shutdown
    2023-07-05T12:07:42.127+0800 [INFO]  client: shutting down
    # Shutdown requested for low-priority alloc
    2023-07-05T12:07:42.127+0800 [TRACE] client.alloc_runner: shutting down: alloc_id=673f0b58-e039-31be-e9bb-35d864c079ce
    2023-07-05T12:07:42.127+0800 [TRACE] client.alloc_runner.task_runner: shutting down: alloc_id=673f0b58-e039-31be-e9bb-35d864c079ce task=test-task
    ...
    # Finished sleeping. Task has been requested to both be killed and shutdown at this point, triggering the race
    2023-07-05T12:08:03.610+0800 [INFO]  client.alloc_runner.task_runner: am before select: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa task=test-task
    ...
    # Low-priority alloc is in it's final iteration of handleTaskStateUpdates
    2023-07-05T12:08:03.610+0800 [TRACE] client.alloc_runner: handling task state update: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa done=true
    ...
    # Final update for the alloc is sent, note that the client_status is running which is non-terminal
    2023-07-05T12:08:03.610+0800 [TRACE] client.alloc_runner: sending updated alloc: alloc_id=471a4855-155b-c5a8-ad31-8e84f58b7faa client_status=running desired_status=""
    ...

I also managed to repro a similar race in the part that blocks waiting for the task to finish. Similarly, I injected a sleep to increase the probability of a race:

diff --git a/client/allocrunner/taskrunner/task_runner.go b/client/allocrunner/taskrunner/task_runner.go
index aae6a31..f14327a 100644
--- a/client/allocrunner/taskrunner/task_runner.go
+++ b/client/allocrunner/taskrunner/task_runner.go
@@ -630,6 +630,9 @@ MAIN:
                        if resultCh, err := handle.WaitCh(context.Background()); err != nil {
                                tr.logger.Error("wait task failed", "error", err)
                        } else {
+                               tr.logger.Info("Sleeping a bit to simulate prestart hooks taking a while to finish")
+                               time.Sleep(time.Second * 30)
+                               tr.logger.Info("Done sleeping")
                                select {
                                case <-tr.killCtx.Done():
                                        // We can go through the normal should restart check since

I also managed to repro this by updating a job, rather than by preemption.

Expected Result

The client should shut down in a timely manner.

Actual Result

The client blocks during shutdown, hanging indefinitely.

During this state, the client is still considered "up to the server, and thus the server continues to schedule new allocations onto the client. But because the client is stuck it never actually tries to run them. This causes the allocations to be stuck in pending state.

The text was updated successfully, but these errors were encountered:

tgross · 2023-08-02T17:24:00Z

This is great debugging @marvinchin, thanks! I suspect that can probably hit a similar problem with ephemeral data migrations. I'll mark this issue for roadmapping.

marvinchin added the type/bug label Aug 2, 2023

tgross added theme/client stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Aug 2, 2023

mmcquillan added the hcc/jira label Jun 24, 2024

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client agent hangs indefinitely during shutdown #18129

Client agent hangs indefinitely during shutdown #18129

marvinchin commented Aug 2, 2023

tgross commented Aug 2, 2023

Client agent hangs indefinitely during shutdown #18129

Client agent hangs indefinitely during shutdown #18129

Comments

marvinchin commented Aug 2, 2023

Nomad version

Operating system and Environment details

Issue

Relevant Code

Reproduction steps

Expected Result

Actual Result

tgross commented Aug 2, 2023