Allocation not stopping on new job deployment #14321

jorgemarey · 2022-08-25T13:43:41Z

Nomad version 1.2.6

When making a new deployment an allocation wouldn't stop so the new allocation was left in pending state and the deployment stuck, we had to manually run a docker rm -f to stop the allocation.

Here are the allocation events:

Aug 25, '22 07:56:41 +0200	Killed	Task successfully killed
Aug 25, '22 07:56:41 +0200	Terminated	Exit Code: 0
Aug 25, '22 07:45:19 +0200	Killing	Sent interrupt. Waiting 5s before force killing
Jul 15, '22 00:06:03 +0200	Started	Task started by client

Relevant logs:

Log Date: 25/08/2022 07:45:24:734
Message: Container 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb failed to exit within 5 seconds of signal 15 - using the force

Log Date: 25/08/2022 07:45:34:767
Message: Container 10a7fec1ed3b failed to exit within 10 seconds of kill - trying direct SIGKILL

Log Date: 25/08/2022 07:45:34:767
Message: stopped container
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

At this point after a few minutes I run:
$ docker rm -f <container-id>

Log Date: 25/08/2022 07:56:30:841
Message: Container 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb failed to exit within 3 seconds of signal 15 - using the force

Log Date: 25/08/2022 07:56:41:122
Message: Container 10a7fec1ed3b failed to exit within 10 seconds of kill - trying direct SIGKILL

Log Date: 25/08/2022 07:56:41:618
Message: failed to inspect container
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb
error: No such container: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

Log Date: 25/08/2022 07:56:41:621
Message: container was removed out of band, will proceed with DestroyTask
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb
error: No such container: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

I don't know if this is a docker problem, but it was killed correctly with the docker rm -f but nomad wasn't able to kill it.

I got a dump of the goroutines before the docker rm and noticed a couple that I think are relevant.

goroutine 146856643 [chan receive, 30035 minutes]:
github.com/fsouza/go-dockerclient.(*Client).hijack.func3()
	github.com/fsouza/[email protected]/client.go:862 +0x28
github.com/fsouza/go-dockerclient.waiterFunc.Wait(0x0)
	github.com/fsouza/[email protected]/client.go:725 +0x1a
github.com/fsouza/go-dockerclient.(*Client).StartExec(0x0, {0xc007535480, 0x0}, {{0x2cfa940, 0xc0006250a0}, {0x2cfa960, 0xc0006250b8}, {0x2cfa960, 0xc0006250c8}, 0x0, ...})
	github.com/fsouza/[email protected]/exec.go:105 +0x53
github.com/hashicorp/nomad/drivers/docker.(*Driver).ExecTaskStreaming(0xc000458280, {0x2d3aef8, 0xc00462ee00}, {0xc0069b7500, 0x3e}, 0xc000ae85a0)
	github.com/hashicorp/nomad/drivers/docker/driver.go:1624 +0x55b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).ExecStreaming(0xc004321b40, {0x2d3aef8, 0xc00462ee00}, {0xc0013cbae0, 0x1, 0x1}, 0x2, {0x2d0b5b8, 0xc0096e3060})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:88 +0x15a
github.com/hashicorp/nomad/client.(*Allocations).execImpl(0xc0000100a8, 0xc0009d1bc0, 0xc0083ea140, {0xc0071a3c80, 0x24})
	github.com/hashicorp/nomad/client/alloc_endpoint.go:265 +0xa23
github.com/hashicorp/nomad/client.(*Allocations).exec(0xc0000100a8, {0x7fa6cc14e350, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/alloc_endpoint.go:149 +0x23e
github.com/hashicorp/nomad/client.(*Client).handleStreamingConn(0xc0009b82c0, {0x2d67ac0, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/rpc.go:407 +0x5b9
github.com/hashicorp/nomad/client.(*Client).handleConn(0xc0009b82c0, {0x2d67ac0, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/rpc.go:339 +0x16f
created by github.com/hashicorp/nomad/client.(*Client).listenConn
	github.com/hashicorp/nomad/client/rpc.go:315 +0xa8


goroutine 115349747 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).handleKill(0xc0067d6580, 0xc004294360)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:963 +0x51b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc0067d6580)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:585 +0xf9f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:375 +0x71

Could a exec session be left there and that is blocking the container from stopping?

The text was updated successfully, but these errors were encountered:

tgross · 2022-08-25T15:17:14Z

Hi @jorgemarey!

In your goroutine dump, we're blocking at the very end of (*TaskRunner) handleKill, at task_runner.go#L963-L968:

	select {
	case result := <-resultCh:
		return result
	case <-tr.shutdownCtx.Done():
		return nil
	}

The result channel that it's waiting on is owned by the task driver. It's a little unclear to me where all your logs are from; some of these look like Nomad log lines and others look like they might be from dockerd? I would be very curious to see if there were any goroutines in (*taskHandle) Kill in the driver's handle.go#L151-L221 code.

It does seem suspicious that the exec stream was left open like that, but I wasn't able to reproduce the behavior even with a badly behaved exec call.

I tried "holding open" the exec stream by having a shell open and that didn't reproduce, and I also tried the following job in hopes that trapping the signal on the exec stream's task would cause the problem, but that didn't reproduce either. I've tested on both Nomad 1.2.6 and the current HEAD of main.

job "example" {
  datacenters = ["dc1"]

  group "group" {

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/var/www"]
      }

      template {
        data        = <<EOT
#!/bin/sh
trap 'echo ok' INT TERM
sleep 3600
EOT

        destination = "local/blocking-script.sh"
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

That being said, I looked into the code and and found that the driver's code is calling out to the docker client with a context driver.go#L1623-L1627:

	startOpts := docker.StartExecOptions{
        // (snipped for clarity)
		Context:      ctx,
	}
	if err := client.StartExec(exec.ID, startOpts); err != nil {
		return nil, fmt.Errorf("failed to start exec: %v", err)
	}

I traced that context all the back up to the client's alloc_endpoint.go#L257-L265 and what it's passing doesn't look right to me. I'd expect that the context would be something we get from the task runner and not some generic background context. So that's something to look into, but maybe not relevant to the symptoms you're seeing.

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	h := ar.GetTaskExecHandler(req.Task)
	if h == nil {
		return helper.Int64ToPtr(404), fmt.Errorf("task %q is not running.", req.Task)
	}

	err = h(ctx, req.Cmd, req.Tty, newExecStream(decoder, encoder))

jorgemarey · 2022-08-25T19:39:59Z

Yeah, sorry. Logs are both from nomad and dockerd. Doesn't seem to be any goroutine on (taskhandle).kill . Regarding taskhandle this is the only one I found (besides others from stats and collectstats)

goroutine 180129549 [chan receive]:
github.com/fsouza/go-dockerclient.(*Client).hijack.func3()
	github.com/fsouza/[email protected]/client.go:862 +0x28
github.com/fsouza/go-dockerclient.waiterFunc.Wait(0x0)
	github.com/fsouza/[email protected]/client.go:725 +0x1a
github.com/fsouza/go-dockerclient.(*Client).StartExec(0x0, {0xc00847c280, 0x0}, {{0x0, 0x0}, {0x2cf3500, 0xc009a70090}, {0x2cf3500, 0xc009a700c0}, 0x0, ...})
	github.com/fsouza/[email protected]/exec.go:105 +0x53
github.com/hashicorp/nomad/drivers/docker.(*taskHandle).Exec(0xc003f6f830, {0x2d3af68, 0xc0024da3c0}, {0xc00535fb40, 0x19}, {0xc005a523d0, 0x2, 0xc00b0a8160})
	github.com/hashicorp/nomad/drivers/docker/handle.go:94 +0x3db
github.com/hashicorp/nomad/drivers/docker.(*Driver).ExecTask(0x21b5320, {0xc004607220, 0x200000003}, {0xc005a523c0, 0x3, 0x3}, 0x0)
	github.com/hashicorp/nomad/drivers/docker/driver.go:1559 +0x135
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).Exec(0xc00695d000, 0x0, {0xc00535fb40, 0x19}, {0xc007a5ae40, 0x2, 0x7de8ad})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:62 +0x136
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*LazyHandle).Exec(0xc0054e0f58, 0x64121b, {0xc00535fb40, 0x19}, {0xc007a5ae40, 0x2, 0x2})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/lazy_handle.go:118 +0xc5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*contextExec).Exec.func1()
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/tasklet.go:47 +0x6c
created by github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*contextExec).Exec
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/tasklet.go:46 +0x1f1

In the "8 minute" time (since the allocation should have stopped) there's only 4 goroutines blocking

goroutine 115349414 [semacquire, 8 minutes]:
sync.runtime_Semacquire(0x0)
	runtime/sema.go:56 +0x25
sync.(*WaitGroup).Wait(0xc002865e30)
	sync/waitgroup.go:130 +0x71
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks(0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:646 +0x537
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).handleAllocUpdate(0xc001a6c480, 0xc0096da780)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:879 +0x385
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).handleAllocUpdates(0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:847 +0x2e
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Run
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:307 +0x105

goroutine 115349747 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).handleKill(0xc0067d6580, 0xc004294360)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:963 +0x51b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc0067d6580)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:585 +0xf9f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:375 +0x71

goroutine 115349413 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).handleTaskStateUpdates(0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:492 +0x188
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Run
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:304 +0xba

goroutine 180118406 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Kill(0xc0067d6580, {0x2d3af30, 0xc0001ac008}, 0xc0071aaf00)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/lifecycle.go:81 +0x1dc
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks.func1({0xc003066f30, 0x10}, 0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:635 +0x152
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:631 +0x525

In the first one there's a context.TODO() just calling Kill on the taskRunner. Could that have something to do?

I'll take a look at that context you're telling be about in alloc_endpoint.go#L257-L265 and see if I can get to something.

jorgemarey added the type/bug label Aug 25, 2022

tgross added the theme/driver/docker label Aug 25, 2022

tgross self-assigned this Aug 25, 2022

tgross added the stage/waiting-reply label Aug 25, 2022

tgross removed the stage/waiting-reply label Aug 25, 2022

tgross removed their assignment Jul 11, 2023

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jul 11, 2023

mmcquillan added the hcc/jira label Jun 24, 2024

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocation not stopping on new job deployment #14321

Allocation not stopping on new job deployment #14321

jorgemarey commented Aug 25, 2022

tgross commented Aug 25, 2022

jorgemarey commented Aug 25, 2022

Allocation not stopping on new job deployment #14321

Allocation not stopping on new job deployment #14321

Comments

jorgemarey commented Aug 25, 2022

tgross commented Aug 25, 2022

jorgemarey commented Aug 25, 2022