task restart No response #20406

chenjpu · 2024-04-16T07:16:45Z

Nomad version

1.7.6 or main branch

Operating system and Environment details

CentOS Linux release 7.9.2009 (Core)
Docker Engine - 24.0.1

Issue

Multiple attempts to restart the task showed no response

Apr 16 15:05:41 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:41.869+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app type="Restart Signaled" msg="User requested task to restart" failed=false
Apr 16 15:05:41 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:41.942+0800 [INFO]  client.driver_mgr.docker: stopped container: container_id=d071491064b92094bdbbd65ef626e7ae8ec5d460258327ca5bc01b714cfa41f8 driver=docker
Apr 16 15:05:41 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:41.947+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app type=Terminated msg="Exit Code: 2, Exit Message: \"Docker container exited with non-zero exit code: 2\"" failed=false
Apr 16 15:05:41 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:41.950+0800 [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/usr/bin/nomad id=20349
Apr 16 15:05:41 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:41.957+0800 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app reason="" delay=0s
Apr 16 15:05:41 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:41.957+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app type=Restarting msg="Task restarting in 0s" failed=false
Apr 16 15:05:41 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:41.998+0800 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=09e567496e3fb7f8f4cdfe96d0d607acb31d1bfe51b3e749498276befad141ae
Apr 16 15:05:42 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:42.097+0800 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=09e567496e3fb7f8f4cdfe96d0d607acb31d1bfe51b3e749498276befad141ae
Apr 16 15:05:42 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:42.129+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app type=Started msg="Task started by client" failed=false
Apr 16 15:05:52 hgc-webserver-2 nomad[19547]: 2024-04-16T15:05:52.722+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app type="Restart Signaled" msg="User requested task to restart" failed=false
Apr 16 15:08:20 hgc-webserver-2 nomad[19547]: 2024-04-16T15:08:20.374+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app type="Restart Signaled" msg="User requested task to restart" failed=false
Apr 16 15:13:08 hgc-webserver-2 nomad[19547]: 2024-04-16T15:13:08.429+0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0ea8fd38-d837-9c01-c2dc-aa6f876ee517 task=app type="Restart Signaled" msg="User requested task to restart" failed=false

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

chenjpu · 2024-04-24T03:21:58Z

Service registration provider defaults to consul, the cluster environment does not rely on the consul environment, when configured as nomad, the problem does not appear

chenjpu · 2024-04-24T03:33:33Z

The following is the configuration of the service template, daprd task restart operation is ok, only when not set provider app task restart no response

job "xxxxx" {
  datacenters = ["dc1"]
  type        = "service"
  group "service" {
    task "app" {
      driver = "docker"
      config {
        image   = "alpine:3.19"
        command = "local/app"
      }
      service {
        name         = "${NOMAD_JOB_NAME}"
        port         = "app"
        address_mode = "host"
        provider     = "nomad" // Correct configuration
        check {
          name     = "health check"
          type     = "tcp"
          port     = "app"
          interval = "12s"
          timeout  = "6s"
          check_restart {
            limit           = 3
            grace           = "10s"
          }
        }
        check {
          name           = "ready check"
          type           = "http"
          port           = "http"
          path           = "/v1.0/healthz"
          interval       = "12s"
          timeout        = "6s"
          on_update      = "ignore"
        }
      }
      artifact {
        source = "..../app.tar.gz"
      }
    }


    task "daprd" {
      lifecycle {
        hook = "poststart"
        sidecar = true
      }
      driver = "docker"
      config {
        image   = "alpine:3.19"
        command = "local/daprd"
      }

      artifact {
        source = ".../daprd_min_linux_${attr.cpu.arch}.tar.gz"
      }

    }
  }
}

tgross · 2024-06-21T20:44:52Z

Hi @chenjpu! Apologies for the delay in responding to this. Let me verify I understand what you're saying here:

You don't have Consul in your environment.
If the service.provider field is unset, the workload runs but the app task will not restart as expected.
If the service.provider = "nomad", the workload runs and the app task will restart as expected.

Is that right?

I would not have expected the workload to run at all with service.provider unset (defaulting to Consul) if there's no Consul in your environment. Nomad adds a constraint that requires Consul if you've got a Consul service in the jobspec.

chenjpu · 2024-06-22T00:13:45Z

provider = ""
For a little long time, remember when this configuration was an empty string that caused an exception

tgross · 2024-06-26T20:22:50Z

Hi @chenjpu!

I think what I'm not making clear is that if you had provider = "" in your jobspec without Consul available, the job would not start at all. See this example jobspec:

jobspec

job "example" {

  group "group" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    service {
      name     = "httpd-web"
      provider = ""
      port     = "www"
    }

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 50
        memory = 50
      }

    }
  }
}

I get a scheduling error like the following:

$ nomad job plan example.nomad.hcl
+ Job: "example"
+ Task Group: "group" (1 create)
  + Task: "task" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "group" (failed to place 1 allocation):
    * Constraint "${attr.consul.version} semver >= 1.8.0": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 example.nomad.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

However, I suspect the provider is irrelevant here and that there's something else going on.

We emit the "User requested task to restart" event just before we actually try to restart the task (ref lifecycle.go#L81-L82), because it can take a while for the task to actually shut down. We wait for "prekill" behaviors and the task itself before we return any errors to the caller.

So there might be something that's blocking the shutdown here.

Next steps to debug:

Does the API request to restart the task return an error, or does it "hang" and not respond?
Can you provide debug-level or trace-level logs for the client node running that task? There's a lot more information that could tell us why the task isn't stopping there. You can redact these logs just to show what happens for that allocation and task.

chenjpu · 2024-06-27T00:24:25Z

Hi @tgross
I just reset provider= "", does prompt service 1 unplaced error. due to the environment in which the scene appears is production
Environment, sorry to not recover the wrong scene,

tgross · 2024-06-27T13:31:10Z

@chenjpu ok I understand.

If this happens again, you can capture the logs of the running client with: nomad monitor -log-level=DEBUG -node-id=$node_id. It might also be helpful to capture the goroutine stack by making a request to the client agent's HTTP endpoint at /debug/pprof/goroutine?debug=2.

chenjpu · 2024-06-29T04:25:56Z

OK, I am happy to help with this problem

tgross · 2024-07-26T19:11:45Z

Doing a little issue cleanup. Going to close this out as unable to reproduce.

github-actions · 2024-12-20T02:15:50Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

chenjpu added the type/bug label Apr 16, 2024

tgross added stage/waiting-reply theme/service-discovery labels Jun 21, 2024

tgross self-assigned this Jun 21, 2024

tgross added the theme/restart/reschedule label Jun 21, 2024

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Triaging in Nomad - Community Issues Triage Jun 24, 2024

tgross removed the stage/waiting-reply label Jun 24, 2024

tgross added the stage/waiting-reply label Jun 26, 2024

tgross closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2024

github-project-automation bot moved this from Triaging to Done in Nomad - Community Issues Triage Jul 26, 2024

github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task restart No response #20406

task restart No response #20406

chenjpu commented Apr 16, 2024

chenjpu commented Apr 24, 2024

chenjpu commented Apr 24, 2024

tgross commented Jun 21, 2024

chenjpu commented Jun 22, 2024 •

edited

Loading

tgross commented Jun 26, 2024

chenjpu commented Jun 27, 2024

tgross commented Jun 27, 2024

chenjpu commented Jun 29, 2024

tgross commented Jul 26, 2024

github-actions bot commented Dec 20, 2024

task restart No response #20406

task restart No response #20406

Comments

chenjpu commented Apr 16, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

chenjpu commented Apr 24, 2024

chenjpu commented Apr 24, 2024

tgross commented Jun 21, 2024

chenjpu commented Jun 22, 2024 • edited Loading

tgross commented Jun 26, 2024

chenjpu commented Jun 27, 2024

tgross commented Jun 27, 2024

chenjpu commented Jun 29, 2024

tgross commented Jul 26, 2024

github-actions bot commented Dec 20, 2024

chenjpu commented Jun 22, 2024 •

edited

Loading