-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation not stopping on new job deployment #14321
Comments
Hi @jorgemarey! In your goroutine dump, we're blocking at the very end of select {
case result := <-resultCh:
return result
case <-tr.shutdownCtx.Done():
return nil
} The result channel that it's waiting on is owned by the task driver. It's a little unclear to me where all your logs are from; some of these look like Nomad log lines and others look like they might be from It does seem suspicious that the exec stream was left open like that, but I wasn't able to reproduce the behavior even with a badly behaved exec call. I tried "holding open" the exec stream by having a shell open and that didn't reproduce, and I also tried the following job in hopes that trapping the signal on the exec stream's task would cause the problem, but that didn't reproduce either. I've tested on both Nomad 1.2.6 and the current HEAD of job "example" {
datacenters = ["dc1"]
group "group" {
task "task" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/var/www"]
}
template {
data = <<EOT
#!/bin/sh
trap 'echo ok' INT TERM
sleep 3600
EOT
destination = "local/blocking-script.sh"
}
resources {
cpu = 128
memory = 128
}
}
}
} That being said, I looked into the code and and found that the driver's code is calling out to the docker client with a context startOpts := docker.StartExecOptions{
// (snipped for clarity)
Context: ctx,
}
if err := client.StartExec(exec.ID, startOpts); err != nil {
return nil, fmt.Errorf("failed to start exec: %v", err)
} I traced that context all the back up to the client's ctx, cancel := context.WithCancel(context.Background())
defer cancel()
h := ar.GetTaskExecHandler(req.Task)
if h == nil {
return helper.Int64ToPtr(404), fmt.Errorf("task %q is not running.", req.Task)
}
err = h(ctx, req.Cmd, req.Tty, newExecStream(decoder, encoder)) |
Yeah, sorry. Logs are both from nomad and dockerd. Doesn't seem to be any goroutine on (taskhandle).kill . Regarding taskhandle this is the only one I found (besides others from stats and collectstats)
In the "8 minute" time (since the allocation should have stopped) there's only 4 goroutines blocking
In the first one there's a I'll take a look at that context you're telling be about in alloc_endpoint.go#L257-L265 and see if I can get to something. |
Nomad version 1.2.6
When making a new deployment an allocation wouldn't stop so the new allocation was left in pending state and the deployment stuck, we had to manually run a
docker rm -f
to stop the allocation.Here are the allocation events:
Relevant logs:
At this point after a few minutes I run:
$ docker rm -f <container-id>
I don't know if this is a docker problem, but it was killed correctly with the
docker rm -f
but nomad wasn't able to kill it.I got a dump of the goroutines before the docker rm and noticed a couple that I think are relevant.
Could a exec session be left there and that is blocking the container from stopping?
The text was updated successfully, but these errors were encountered: