1.4.2: Jobs killed due to nomadService templating failures #15115

iSchluff · 2022-11-02T20:28:25Z

Nomad version

Output from nomad version
Nomad v1.4.2 (039d70e)

Operating system and Environment details

Linux nomad1 5.4.0-131-generic 147-Ubuntu

Issue

I have a set up where a job with multiple groups has various template cross-references using nomadService to dynamically fill in the address/port details.

With 1.4.x I see jobs getting killed because of template failures like these, even though the referenced service is running fine:
alertmanager service

Nov 02, '22 20:56:09 +0100 | Killed | Task successfully killed
Nov 02, '22 20:56:09 +0100 | Terminated | Exit Code: 0
Nov 02, '22 20:56:09 +0100 | Killing | Template failed: nomad.service(promteams): Unexpected response code: 403 (Permission denied)
Nov 02, '22 20:51:08 +0100 | Template | Missing: nomad.service(promteams)
Nov 02, '22 20:13:35 +0100 | Started | Task started by client
Nov 02, '22 20:13:35 +0100 | Task Setup | Building Task Directory
Nov 02, '22 20:13:35 +0100 | Received | Task received by client

promteams service

Nov 02, '22 17:39:44 +0100 | Started | Task started by client1
Nov 02, '22 17:39:43 +0100 | Driver | Downloading image
Nov 02, '22 17:39:43 +0100 | Task Setup | Building Task Directory
Nov 02, '22 17:39:42 +0100 | Received | Task received by client

Additionally sometimes a template will not update even though the nomad service changed.

Reproduction steps

Expected Result

Tasks should not be randomly killed, and templates should be properly updated when the service changes.

Actual Result

The text was updated successfully, but these errors were encountered:

lgfa29 · 2022-11-02T23:54:32Z

Thanks for the report @iSchluff.

I haven't been able to reproduce it yet. Would you happen to have any client and server logs around the time this happened?

lgfa29 · 2022-11-03T15:11:53Z

@iSchluff the team found the root cause for the 403 errors and #15121 has a fix for that.

But now I'm curious about the job failing due to the template render error. By default, Nomad should retry rendering forever, which is configured by the client.template.nomad config. Did you happen to change this value?

iSchluff · 2022-11-03T15:48:30Z

Hi @lgfa29, no the client config is completely default

client {
  enabled = true
  servers = ["{{ nomad_address }}"]
  network_interface = "{{ nomad_default_interface }}"
}

The only thing that might be special is that it's a single server+client instance

lgfa29 · 2022-11-04T20:58:23Z

Ah interesting. Do you have any logs form the client around the time of failure?

iSchluff · 2022-11-05T16:41:06Z

I think the relevant portion is this, which was logged when the allocation was terminated

This is a single instance with server+client, so the logs should all be interleaved

    2022-11-02T19:56:09.063Z [ERROR] agent: (view) nomad.service(alertmanager): Unexpected response code: 403 (Permission denied) (exceeded maximum retries)
    2022-11-02T19:56:09.127Z [ERROR] agent: (view) nomad.service(promteams): Unexpected response code: 403 (Permission denied) (exceeded maximum retries)
    2022-11-02T19:56:09.127Z [ERROR] agent: (runner) watcher reported error: nomad.service(promteams): Unexpected response code: 403 (Permission denied)
    2022-11-02T19:56:09.217Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f89a54d96766666318ef64c1c6b70deab979dcd313a1924d12f32b20079e2a8e driver=docker
    2022-11-02T19:56:09.257Z [INFO]  agent: (runner) stopping
    2022-11-02T19:56:09.257Z [INFO]  client.gc: marking allocation for GC: alloc_id=7f92cda3-2d4d-d0a9-139a-6ba95f042d7d
    2022-11-02T19:56:10.032Z [WARN]  nomad.rpc: failed TLS handshake: remote_addr=172.20.xx.xx:52738 error="remote error: tls: bad certificate"
    2022-11-02T19:56:10.034Z [INFO]  client.driver_mgr.docker: stopped container: container_id=2e88dc4bfe6d7e55ea494dda5f5fc77a340ee64a93302d0c3fc362c583b9333e driver=docker
    2022-11-02T19:56:10.061Z [INFO]  agent: (runner) stopping
    2022-11-02T19:56:10.061Z [INFO]  client.gc: marking allocation for GC: alloc_id=a2c806ea-69c4-4e01-4001-7a30436d9499
    2022-11-02T19:56:10.514Z [INFO]  client.driver_mgr.docker: stopped container: container_id=c7ff62929c182ade793053234a7f53bfcf539828edc575f30967fb9debd956a3 driver=docker
    2022-11-02T19:56:10.540Z [INFO]  agent: (runner) stopping
    2022-11-02T19:56:10.540Z [INFO]  client.gc: marking allocation for GC: alloc_id=7be59997-98e6-5e2a-fc65-b5a6f988780a

full log: https://p.rrbone.net/?be507af6af7b677d#5nZiq24EAXZeUAMPrWcKz2wvoD2Jr6k6JRqgAGAZXx6k

I have since started over on this instance with a new data directory as it was basically unusable in that state. With a 1.4.2 instance it pretty much works as expected, so it is clearly related to the update procedure.
If the old data directory would help you I can upload that somewhere

lgfa29 · 2022-11-07T23:58:22Z

Thanks for the logs, I will dig deeper throughout the week.

The problem with the upgrade was fixed in #15121 and will be released when 1.4.3 comes out. I'm now trying to understand why the template caused the task to fail. I expected it to retry forever.

iSchluff · 2022-11-08T20:31:42Z

Maybe related: I am also having the problem that dependent templates are not updated after nomad services successfully restart, leaving me with broken configs.

On service stop the template is correctly updated, but not after it comes back.

Also for whatever reason the permission errors are back, even though this was fresh set up

Nov 08, '22 21:30:17 +0100 | Killing | Sent interrupt. Waiting 5s before force killing
Nov 08, '22 21:30:17 +0100 | Killed | Task successfully killed
Nov 08, '22 21:30:17 +0100 | Terminated | Exit Code: 0
Nov 08, '22 21:30:16 +0100 | Killing | Template failed: nomad.service(grafana): Unexpected response code: 403 (Permission denied)
Nov 08, '22 21:25:16 +0100 | Template | Missing: nomad.service(grafana), nomad.service(loki), nomad.service(telegraf), and 1 more
Nov 05, '22 18:52:21 +0100 | Started | Task started by client
Nov 05, '22 18:52:21 +0100 | Task Setup | Building Task Directory
Nov 05, '22 18:52:21 +0100 | Received | Task received by client

veloc1 · 2022-11-23T09:26:50Z

+1

Got template failed on every job update on 1.4.3. Sample nomad job spec:

spec

job "nomad-template-error" {
  type = "service"

  region        = "global"
  datacenters   = ["service_ds"]

  update {
    max_parallel = 1
    min_healthy_time = "15s"
    healthy_deadline = "3m"
    auto_revert = false
    canary = 0
  }

  group "postgres" {
    count = 1

    network {
      port "db" { to = 5432}
    }

    task "postgres" {
      driver = "podman"

      leader = true

      config {
        image = "docker.io/library/postgres:14.5"
        ports = ["db"]

      }

      service {
        name         = "postgres"
        provider     = "nomad"
        port         = "db"
        address_mode = "host"

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      env {
        POSTGRES_PASSWORD = "123"
        POSTGRES_USER = "user"
        POSTGRES_DB = "db"
        PGDATA = "/var/lib/postgresql/data/pgdata"
      }

      resources {
        cpu    = 500 # MHz
        memory = 512 # MB
      }
    }

    task "migrator" {
      driver = "podman"

      config {
        image = "docker://hello-world"
      }

      lifecycle {
        hook = "poststart"
        sidecar = false
      }

      restart {
        interval = "3m"
        attempts = 3
        delay    = "15s"
        mode     = "fail"
      }

      template {
        destination=".env"
        env = true

        data = <<EOH
        {{ range nomadService "postgres" }}
        DATABASE_URL="postgresql://user:123@{{ .Address }}:{{ .Port }}/db?schema=public"
        {{ end }}
        EOH
      }
    }
  }

  group "additional-group" {
    task "some-task" {
      driver = "podman"

      config {
        image = "docker://redis"
      }

      env {
        A="b"
      }
    }
  }
}

nomad alloc status after updating env variable in second group

ID                   = 37fa4b07-3239-88dd-89d8-1da29a4352c3
Eval ID              = f47974ab
Name                 = nomad-template-error.postgres[0]
Node ID              = 9161c33b
Node Name            = production
Job ID               = nomad-template-error
Job Version          = 3
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 9m38s ago
Modified             = 56s ago
Deployment ID        = c660b5bd
Deployment Health    = unhealthy
Replacement Alloc ID = a1f32824

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.25:30461 -> 5432

Task "migrator" (poststart) is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB  

Task Events:
Started At     = 2022-11-23T09:15:28Z
Finished At    = 2022-11-23T09:15:29Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2022-11-23T12:22:10+03:00  Alloc Unhealthy  Unhealthy because of failed task
2022-11-23T12:22:10+03:00  Killing          Sent interrupt. Waiting 5s before force killing
2022-11-23T12:22:05+03:00  Template         Missing: nomad.service(postgres)
2022-11-23T12:22:01+03:00  Killing          Template failed: nomad.service(postgres): Unexpected response code: 403 (Permission denied)
2022-11-23T12:15:29+03:00  Terminated       Exit Code: 0
2022-11-23T12:15:28+03:00  Started          Task started by client
2022-11-23T12:15:25+03:00  Driver           Pulling image docker.io/library/hello-world:latest
2022-11-23T12:15:25+03:00  Task Setup       Building Task Directory
2022-11-23T12:15:25+03:00  Received         Task received by client

Task "postgres" is "dead"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  3.9 MiB/512 MiB  300 MiB  

Task Events:
Started At     = 2022-11-23T09:15:25Z
Finished At    = 2022-11-23T09:22:06Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                 Description
2022-11-23T12:22:10+03:00  Alloc Unhealthy      Unhealthy because of dead task
2022-11-23T12:22:06+03:00  Killed               Task successfully killed
2022-11-23T12:22:06+03:00  Terminated           Exit Code: 0
2022-11-23T12:22:01+03:00  Killing              Sent interrupt. Waiting 5s before force killing
2022-11-23T12:22:01+03:00  Sibling Task Failed  Task's sibling "migrator" failed
2022-11-23T12:15:25+03:00  Started              Task started by client
2022-11-23T12:15:25+03:00  Task Setup           Building Task Directory
2022-11-23T12:15:25+03:00  Received             Task received by client

lgfa29 · 2022-11-23T17:49:49Z

Thanks for the extra info @veloc1 and @iSchluff.

Could you check the allocation details as returned from the GET /v1/allocation/:id API endpoint?

There should be a field called SigningKeyID. Could you then check if the ID is listed in the output of the nomad operator root keyring list command?

Thanks!

veloc1 · 2022-11-24T10:28:43Z

@lgfa29 Yep, SigningKeyId is the same.

Also, attaching logs from client:

Logs from client

  2022-11-24T12:47:45.040+0300 [DEBUG] client: updated allocations: index=118181 total=10 pulled=6 filtered=4
2022-11-24T12:47:45.040+0300 [DEBUG] client: allocation updates: added=1 removed=0 updated=5 ignored=4
2022-11-24T12:47:45.041+0300 [DEBUG] http: request complete: method=GET path=/v1/node/6ff14f6b-aab8-d5d3-3dac-fdb89b501fc3/allocations?index=118175 duration=25.320547306s
2022-11-24T12:47:45.045+0300 [INFO]  client.driver_mgr.nomad-driver-podman: Stopping task: driver=podman signal="" taskID=4281c919-9fe6-ab79-fc37-6c678a11ea76/some-task/ffa6d759 @module=podman timestamp="2022-11-24T12:47:45.045+0300"
2022-11-24T12:47:45.046+0300 [INFO]  agent: (runner) stopping
2022-11-24T12:47:45.046+0300 [DEBUG] agent: (runner) stopping watcher
2022-11-24T12:47:45.046+0300 [DEBUG] agent: (watcher) stopping all views
2022-11-24T12:47:45.046+0300 [INFO]  agent: (runner) received finish
2022-11-24T12:47:45.046+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2022-11-24T12:47:45.047+0300 [INFO]  agent: (runner) creating watcher
2022-11-24T12:47:45.047+0300 [DEBUG] client: allocation updates applied: added=1 removed=0 updated=5 ignored=4 errors=0
2022-11-24T12:47:45.048+0300 [INFO]  agent: (runner) starting
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) running initial templates
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) initiating run
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) checking template 8b0d0af27e8b6a0428e869b969771f72
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) missing data for 1 dependencies
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) missing dependency: nomad.service(postgres)
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) add used dependency nomad.service(postgres) to missing since isLeader but do not have a watcher
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) was not watching 1 dependencies
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (watcher) adding nomad.service(postgres)
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) diffing and updating dependencies
2022-11-24T12:47:45.048+0300 [DEBUG] agent: (runner) watching 1 dependencies
2022-11-24T12:47:45.049+0300 [DEBUG] http: request failed: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" error="Permission denied" code=403
2022-11-24T12:47:45.049+0300 [DEBUG] http: request complete: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" duration="91.989µs"
2022-11-24T12:47:45.049+0300 [WARN]  agent: (view) nomad.service(postgres): Unexpected response code: 403 (Permission denied) (retry attempt 1 after "250ms")
2022-11-24T12:47:45.055+0300 [DEBUG] client.alloc_migrator: waiting for previous alloc to terminate: alloc_id=f0c29325-7d53-39b4-8d82-e2a18dc97b63 previous_alloc=4281c919-9fe6-ab79-fc37-6c678a11ea76
2022-11-24T12:47:45.181+0300 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/4281c919-9fe6-ab79-fc37-6c678a11ea76/stats duration="196.999µs"
2022-11-24T12:47:45.181+0300 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/f4f2e0e0-d3a0-fee7-cc76-7b3cd8f9372d/stats duration="278.279µs"
2022-11-24T12:47:45.300+0300 [DEBUG] http: request failed: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" error="Permission denied" code=403
2022-11-24T12:47:45.300+0300 [DEBUG] http: request complete: method=GET path="/v1/service/postgres?namespace=default&stale=&wait=60000ms" duration="104.235µs"
2022-11-24T12:47:45.300+0300 [WARN]  agent: (view) nomad.service(postgres): Unexpected response code: 403 (Permission denied) (retry attempt 2 after "500ms")
2022-11-24T12:47:45.340+0300 [DEBUG] client: updated allocations: index=118184 total=10 pulled=3 filtered=7
2022-11-24T12:47:45.340+0300 [DEBUG] client: allocation updates: added=0 removed=0 updated=3 ignored=7
2022-11-24T12:47:45.343+0300 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=3 ignored=7 errors=0

Looks like, for some reason, dependency watcher missing auth token for first requests?

mr-karan · 2022-12-07T02:59:47Z

@lgfa29 Faced this in production today as well. I've a close to default client config as well (client.template.nomad is left untouched) but still the template failed to render which caused the job to fail.

Thunderbottom · 2023-01-10T11:36:38Z

@lgfa29 any updates on this? have been facing the same issue for a while now. Recently we also migrated to nomad variables, which also shows up alongside Missing: nomad.service in the UI events. Although, from what we could see, the variables and service gets rendered just fine in the template.

Now from what we've figured, this happens when you deploy a new version over an existing deployment. Seems like a race condition with the events channel in template.go: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/template/template.go#L350-L371

Let me know if I could be of any help.

lgfa29 · 2023-01-14T01:32:23Z

Hi all, thanks for the extra info. I haven't been able to reproduce this yet, so no updates so far.

I will try the deployment update @Thunderbottom next time I get a chance, thanks for the tip!

zhiguangwang · 2023-02-14T06:45:29Z

I have been facing the same issue with Nomad 1.4.3 when using job templates with the nomadVar function.

Example job spec:

job "whoami" {
  region      = "global"
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "service"

  update {
    max_parallel     = 1
    min_healthy_time = "0s"
  }

  group "whoami" {

    scaling {
      min = 1
      max = 4
    }

    network {
      port "http" {}
    }

    service {
      name         = "whoami"
      tags         = []
      port         = "http"
      address_mode = "host"
      check {
        name     = "whoami HTTP check"
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "whoami" {
      driver       = "docker"
      user         = "65534" # nobody on Ubuntu
      kill_signal  = "SIGTERM"
      kill_timeout = "15s"

      template {
        destination = "local/config.yaml"
        data        = <<EOF
{{ with nomadVar "nomad/jobs/whoami" }}
mysql_username = "{{ .mysql_username }}"
mysql_password = "{{ .mysql_password }}"
{{ end }}
EOF
      }

      config {
        image = "traefik/whoami:v1.8.7"
        args = [
          "--name=${NOMAD_ALLOC_NAME}",
          "--port=${NOMAD_PORT_http}",
        ]
        network_mode = "host"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Note that I'm using Workload Identity feature by putting nomad variables at path nomad/jobs/whoami.

The issue can always be reproduced by scaling up/down the task group count at Nomad UI: Events with Type Template and Description Missing: nomad.var.block(nomad/jobs/[email protected]) will show up in task "Recent Events".

tgross · 2023-02-27T15:16:38Z

Hi folks! 👋 I was just fixing up #16259 and was looking around for similar issues and found this one... I suspect this issue will be fixed by #16266. Note that although #16259 talks about canaries, as far as I can tell this will happen for any in-place update, which looks like it could describe the behaviors we're seeing in this issue (unless I've missed something)

tgross · 2024-06-24T18:01:10Z

As noted above I'm going to mark this as closed by #16266. There's enough changed here that if anyone hits this again and is reading this, I'd ask that they open a new issue with the debug-level logs described above.

github-actions · 2024-12-21T02:16:57Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

iSchluff added the type/bug label Nov 2, 2022

lgfa29 added stage/needs-investigation theme/template theme/service-discovery/nomad labels Nov 2, 2022

lgfa29 self-assigned this Nov 2, 2022

tgross mentioned this issue Nov 3, 2022

WI: set identity to client secret if missing #15121

Merged

hc-github-team-nomad-core mentioned this issue Nov 3, 2022

Backport of WI: set identity to client secret if missing into release/1.4.x #15122

Merged

lgfa29 added stage/waiting-reply and removed stage/needs-investigation theme/service-discovery/nomad labels Nov 3, 2022

lgfa29 added stage/needs-investigation and removed stage/waiting-reply labels Nov 4, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

tgross closed this as completed Jun 24, 2024

github-project-automation bot moved this from Needs Roadmapping to Done in Nomad - Community Issues Triage Jun 24, 2024

github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.4.2: Jobs killed due to nomadService templating failures #15115

1.4.2: Jobs killed due to nomadService templating failures #15115

iSchluff commented Nov 2, 2022 •

edited

Loading

lgfa29 commented Nov 2, 2022

lgfa29 commented Nov 3, 2022

iSchluff commented Nov 3, 2022

lgfa29 commented Nov 4, 2022

iSchluff commented Nov 5, 2022 •

edited

Loading

lgfa29 commented Nov 7, 2022

iSchluff commented Nov 8, 2022 •

edited

Loading

veloc1 commented Nov 23, 2022

lgfa29 commented Nov 23, 2022

veloc1 commented Nov 24, 2022 •

edited

Loading

mr-karan commented Dec 7, 2022

Thunderbottom commented Jan 10, 2023

lgfa29 commented Jan 14, 2023

zhiguangwang commented Feb 14, 2023

tgross commented Feb 27, 2023 •

edited

Loading

tgross commented Jun 24, 2024

github-actions bot commented Dec 21, 2024

1.4.2: Jobs killed due to nomadService templating failures #15115

1.4.2: Jobs killed due to nomadService templating failures #15115

Comments

iSchluff commented Nov 2, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

lgfa29 commented Nov 2, 2022

lgfa29 commented Nov 3, 2022

iSchluff commented Nov 3, 2022

lgfa29 commented Nov 4, 2022

iSchluff commented Nov 5, 2022 • edited Loading

lgfa29 commented Nov 7, 2022

iSchluff commented Nov 8, 2022 • edited Loading

veloc1 commented Nov 23, 2022

lgfa29 commented Nov 23, 2022

veloc1 commented Nov 24, 2022 • edited Loading

mr-karan commented Dec 7, 2022

Thunderbottom commented Jan 10, 2023

lgfa29 commented Jan 14, 2023

zhiguangwang commented Feb 14, 2023

tgross commented Feb 27, 2023 • edited Loading

tgross commented Jun 24, 2024

github-actions bot commented Dec 21, 2024

iSchluff commented Nov 2, 2022 •

edited

Loading

iSchluff commented Nov 5, 2022 •

edited

Loading

iSchluff commented Nov 8, 2022 •

edited

Loading

veloc1 commented Nov 24, 2022 •

edited

Loading

tgross commented Feb 27, 2023 •

edited

Loading