Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation not stopping on new job deployment #14321

Open
jorgemarey opened this issue Aug 25, 2022 · 2 comments
Open

Allocation not stopping on new job deployment #14321

jorgemarey opened this issue Aug 25, 2022 · 2 comments
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug

Comments

@jorgemarey
Copy link
Contributor

Nomad version 1.2.6

When making a new deployment an allocation wouldn't stop so the new allocation was left in pending state and the deployment stuck, we had to manually run a docker rm -f to stop the allocation.

Here are the allocation events:

Aug 25, '22 07:56:41 +0200	Killed	Task successfully killed
Aug 25, '22 07:56:41 +0200	Terminated	Exit Code: 0
Aug 25, '22 07:45:19 +0200	Killing	Sent interrupt. Waiting 5s before force killing
Jul 15, '22 00:06:03 +0200	Started	Task started by client

Relevant logs:

Log Date: 25/08/2022 07:45:24:734
Message: Container 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb failed to exit within 5 seconds of signal 15 - using the force

Log Date: 25/08/2022 07:45:34:767
Message: Container 10a7fec1ed3b failed to exit within 10 seconds of kill - trying direct SIGKILL

Log Date: 25/08/2022 07:45:34:767
Message: stopped container
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

At this point after a few minutes I run:
$ docker rm -f <container-id>

Log Date: 25/08/2022 07:56:30:841
Message: Container 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb failed to exit within 3 seconds of signal 15 - using the force

Log Date: 25/08/2022 07:56:41:122
Message: Container 10a7fec1ed3b failed to exit within 10 seconds of kill - trying direct SIGKILL

Log Date: 25/08/2022 07:56:41:618
Message: failed to inspect container
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb
error: No such container: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

Log Date: 25/08/2022 07:56:41:621
Message: container was removed out of band, will proceed with DestroyTask
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb
error: No such container: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

I don't know if this is a docker problem, but it was killed correctly with the docker rm -f but nomad wasn't able to kill it.

I got a dump of the goroutines before the docker rm and noticed a couple that I think are relevant.

goroutine 146856643 [chan receive, 30035 minutes]:
github.com/fsouza/go-dockerclient.(*Client).hijack.func3()
	github.com/fsouza/[email protected]/client.go:862 +0x28
github.com/fsouza/go-dockerclient.waiterFunc.Wait(0x0)
	github.com/fsouza/[email protected]/client.go:725 +0x1a
github.com/fsouza/go-dockerclient.(*Client).StartExec(0x0, {0xc007535480, 0x0}, {{0x2cfa940, 0xc0006250a0}, {0x2cfa960, 0xc0006250b8}, {0x2cfa960, 0xc0006250c8}, 0x0, ...})
	github.com/fsouza/[email protected]/exec.go:105 +0x53
github.com/hashicorp/nomad/drivers/docker.(*Driver).ExecTaskStreaming(0xc000458280, {0x2d3aef8, 0xc00462ee00}, {0xc0069b7500, 0x3e}, 0xc000ae85a0)
	github.com/hashicorp/nomad/drivers/docker/driver.go:1624 +0x55b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).ExecStreaming(0xc004321b40, {0x2d3aef8, 0xc00462ee00}, {0xc0013cbae0, 0x1, 0x1}, 0x2, {0x2d0b5b8, 0xc0096e3060})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:88 +0x15a
github.com/hashicorp/nomad/client.(*Allocations).execImpl(0xc0000100a8, 0xc0009d1bc0, 0xc0083ea140, {0xc0071a3c80, 0x24})
	github.com/hashicorp/nomad/client/alloc_endpoint.go:265 +0xa23
github.com/hashicorp/nomad/client.(*Allocations).exec(0xc0000100a8, {0x7fa6cc14e350, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/alloc_endpoint.go:149 +0x23e
github.com/hashicorp/nomad/client.(*Client).handleStreamingConn(0xc0009b82c0, {0x2d67ac0, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/rpc.go:407 +0x5b9
github.com/hashicorp/nomad/client.(*Client).handleConn(0xc0009b82c0, {0x2d67ac0, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/rpc.go:339 +0x16f
created by github.com/hashicorp/nomad/client.(*Client).listenConn
	github.com/hashicorp/nomad/client/rpc.go:315 +0xa8


goroutine 115349747 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).handleKill(0xc0067d6580, 0xc004294360)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:963 +0x51b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc0067d6580)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:585 +0xf9f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:375 +0x71

Could a exec session be left there and that is blocking the container from stopping?

@tgross
Copy link
Member

tgross commented Aug 25, 2022

Hi @jorgemarey!

In your goroutine dump, we're blocking at the very end of (*TaskRunner) handleKill, at task_runner.go#L963-L968:

	select {
	case result := <-resultCh:
		return result
	case <-tr.shutdownCtx.Done():
		return nil
	}

The result channel that it's waiting on is owned by the task driver. It's a little unclear to me where all your logs are from; some of these look like Nomad log lines and others look like they might be from dockerd? I would be very curious to see if there were any goroutines in (*taskHandle) Kill in the driver's handle.go#L151-L221 code.

It does seem suspicious that the exec stream was left open like that, but I wasn't able to reproduce the behavior even with a badly behaved exec call.

I tried "holding open" the exec stream by having a shell open and that didn't reproduce, and I also tried the following job in hopes that trapping the signal on the exec stream's task would cause the problem, but that didn't reproduce either. I've tested on both Nomad 1.2.6 and the current HEAD of main.

job "example" {
  datacenters = ["dc1"]

  group "group" {

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/var/www"]
      }

      template {
        data        = <<EOT
#!/bin/sh
trap 'echo ok' INT TERM
sleep 3600
EOT

        destination = "local/blocking-script.sh"
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

That being said, I looked into the code and and found that the driver's code is calling out to the docker client with a context driver.go#L1623-L1627:

	startOpts := docker.StartExecOptions{
        // (snipped for clarity)
		Context:      ctx,
	}
	if err := client.StartExec(exec.ID, startOpts); err != nil {
		return nil, fmt.Errorf("failed to start exec: %v", err)
	}

I traced that context all the back up to the client's alloc_endpoint.go#L257-L265 and what it's passing doesn't look right to me. I'd expect that the context would be something we get from the task runner and not some generic background context. So that's something to look into, but maybe not relevant to the symptoms you're seeing.

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	h := ar.GetTaskExecHandler(req.Task)
	if h == nil {
		return helper.Int64ToPtr(404), fmt.Errorf("task %q is not running.", req.Task)
	}

	err = h(ctx, req.Cmd, req.Tty, newExecStream(decoder, encoder))

@jorgemarey
Copy link
Contributor Author

Yeah, sorry. Logs are both from nomad and dockerd. Doesn't seem to be any goroutine on (taskhandle).kill . Regarding taskhandle this is the only one I found (besides others from stats and collectstats)

goroutine 180129549 [chan receive]:
github.com/fsouza/go-dockerclient.(*Client).hijack.func3()
	github.com/fsouza/[email protected]/client.go:862 +0x28
github.com/fsouza/go-dockerclient.waiterFunc.Wait(0x0)
	github.com/fsouza/[email protected]/client.go:725 +0x1a
github.com/fsouza/go-dockerclient.(*Client).StartExec(0x0, {0xc00847c280, 0x0}, {{0x0, 0x0}, {0x2cf3500, 0xc009a70090}, {0x2cf3500, 0xc009a700c0}, 0x0, ...})
	github.com/fsouza/[email protected]/exec.go:105 +0x53
github.com/hashicorp/nomad/drivers/docker.(*taskHandle).Exec(0xc003f6f830, {0x2d3af68, 0xc0024da3c0}, {0xc00535fb40, 0x19}, {0xc005a523d0, 0x2, 0xc00b0a8160})
	github.com/hashicorp/nomad/drivers/docker/handle.go:94 +0x3db
github.com/hashicorp/nomad/drivers/docker.(*Driver).ExecTask(0x21b5320, {0xc004607220, 0x200000003}, {0xc005a523c0, 0x3, 0x3}, 0x0)
	github.com/hashicorp/nomad/drivers/docker/driver.go:1559 +0x135
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).Exec(0xc00695d000, 0x0, {0xc00535fb40, 0x19}, {0xc007a5ae40, 0x2, 0x7de8ad})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:62 +0x136
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*LazyHandle).Exec(0xc0054e0f58, 0x64121b, {0xc00535fb40, 0x19}, {0xc007a5ae40, 0x2, 0x2})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/lazy_handle.go:118 +0xc5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*contextExec).Exec.func1()
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/tasklet.go:47 +0x6c
created by github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*contextExec).Exec
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/tasklet.go:46 +0x1f1

In the "8 minute" time (since the allocation should have stopped) there's only 4 goroutines blocking

goroutine 115349414 [semacquire, 8 minutes]:
sync.runtime_Semacquire(0x0)
	runtime/sema.go:56 +0x25
sync.(*WaitGroup).Wait(0xc002865e30)
	sync/waitgroup.go:130 +0x71
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks(0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:646 +0x537
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).handleAllocUpdate(0xc001a6c480, 0xc0096da780)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:879 +0x385
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).handleAllocUpdates(0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:847 +0x2e
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Run
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:307 +0x105

goroutine 115349747 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).handleKill(0xc0067d6580, 0xc004294360)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:963 +0x51b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc0067d6580)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:585 +0xf9f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:375 +0x71

goroutine 115349413 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).handleTaskStateUpdates(0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:492 +0x188
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Run
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:304 +0xba

goroutine 180118406 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Kill(0xc0067d6580, {0x2d3af30, 0xc0001ac008}, 0xc0071aaf00)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/lifecycle.go:81 +0x1dc
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks.func1({0xc003066f30, 0x10}, 0xc001a6c480)
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:635 +0x152
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:631 +0x525

In the first one there's a context.TODO() just calling Kill on the taskRunner. Could that have something to do?

I'll take a look at that context you're telling be about in alloc_endpoint.go#L257-L265 and see if I can get to something.

@tgross tgross removed their assignment Jul 11, 2023
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jul 11, 2023
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants