Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow querying Nomad service discovery checks by service (for consul-template) #23317

Open
msirovy opened this issue Jun 13, 2024 · 3 comments
Open

Comments

@msirovy
Copy link

msirovy commented Jun 13, 2024

Nomad version

Tested on nomad versions 1.7.3, 1.8.0

Operating system and Environment details

Linux XXX 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
Docker version 26.0.0, build 2ae903e

Issue

We have a deploymet using php-fpm with nginx but I am able to simulate the same results even with this example deployment. When I do an update of the deployment I can see a short downtime (2-10sec). I expect that nomad will start a new version, wait till the health checks pass and than append new deployment to services, but the broken deployment is available sooner. It is critical issue for us and we can't move to production stage before we solve it.

The strange is, that when I use deployment where is only one job in group, than the update works without downtime...
I am not able to share with you my php-fpm+nginx deployment because it uses aour private registry, but this example works the same way.

Reproduction steps

I have small devel nomad cluster with 5 nodes (1 master, 4 nodes), using nomad service discovery and nginx ingress (simillar to this https://github.com/theztd/startup-infra-docker/blob/main/files/jobs/nginx-ingress/deploy.nomad)

  1. Run deployment definition (from job definition part)
  2. Run curl in loop like this
while true; do curl "https://hello.freelo.cz/v1/health/"; sleep 1; echo ""; done 
  1. Run updated version of deployment (update for example )
  2. Try it at least 3 times and you will see inconsistency of downtime at least 2 times.

Expected Result

Do update without downtime.

Actual Result

Short but noticable down time during update

Job file (if appropriate)

job "hello" {

  group "servers" {
    count = 2

    network {
      mode = "bridge"
      port "www" {
        to = 80
      }
      
      port "app" {
      	to = 8080
      }
    }

    task "app" {
    	driver = "docker"
      
      config {
        image      = "ghcr.io/theztd/troll:1.2.0"
        force_pull = true

        ports = ["app"]
      }
      
      service {
      provider = "nomad"
      port     = "www"
      
      check {
        name     = "${NOMAD_JOB_NAME} - alive"
        type     = "http"
        path     = "/v1/_healthz/"
        interval = "10s"
        timeout  = "2s"
      }
      
      tags = [
        "http=true",
        "http.url=hello.freelo.cz"
      ]
    }

      env {
        ADDRESS = ":8080"
        WAIT = 100
      }
      
      
    }
    
    # Tasks are individual units of work that are run by Nomad.
    task "web" {
      # This particular task starts a simple web server within a Docker container
      driver = "docker"

      config {
        image   = "nginx:latest"
        ports   = ["www"]
        
        volumes = [
          "data/:/usr/share/nginx/html",
          "local/:/etc/nginx/conf.d/"
        ]
      }

      template {
        data        = <<-EOF
                      <h1>Hello, Nomad!</h1>
                      <ul>
                        <li>Version: <b>- 6 -</b></li>
                        <li>Currently running on port: {{env "NOMAD_PORT_www"}}</li>
                      </ul>
                      EOF
        destination = "data/index.html"
      }
      
      template {
        data        = <<-EOF
                      server {
                          listen 80 default;

                          # server_name _;

                          location / {
                              proxy_pass http://localhost:8080;
                              proxy_set_header Host $http_host;
                              proxy_set_header X-Real-IP $remote_addr;
                              proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                              proxy_set_header X-Forwarded-Proto $scheme;
                          }
                      }
                      EOF
        destination = "local/default.conf"
      }

      # Specify the maximum resources required to run the task
      resources {
        cpu    = 50
        memory = 64
      }
    }
  }
}

Nomad Server logs (if appropriate)

I capture event stream by this small code

#!/usr/bin/env python3

import time, requests, json, os
from datetime import datetime
from pprint import pprint

def stream(domain, token):
    with requests.get(domain, stream=True, headers={"X-Nomad-Token": token}) as res:
        if res.status_code != 200:
            print(f"Failed to connect: {res.status_code}")
            return

        print(f"{datetime.now().strftime('%H:%M.%S')}: Connected ({res.status_code})")

        buffer = ""
        for line in res.iter_lines():
            if line:
                buffer += line.decode('utf-8')
                event_data = ""
                try:
                    event_data = json.loads(buffer)
                    buffer = ""  # Reset buffer after successful parsing

                except json.JSONDecodeError as err:
                    print(err)
                    print("Unable to parse json input")
                    # Incomplete data, continue to buffer it
                    continue

                try:
                    for ev in event_data.get("Events", []):
                        print("=================================")
                        #pprint(ev)
                        print("{} -> Topic:{} Type:{} Name:{} Job:{} ({})".format(
                            datetime.now().strftime("%H:%M.%S"), ev["Topic"], ev.get("Type", "undefined"), ev.get("Name", "undefined"), ev.get("JobID", "undefined"), str(ev["Payload"])[:300]))

                except KeyError as err:
                    print(err)
                    print("Unable to find Events key in data.... SKIP")
                    continue


if __name__ == "__main__":
    stream(os.getenv("NOMAD_ADDR")+"/v1/event/stream", os.getenv("NOMAD_TOKEN"))
=================================
17:03.34 -> Topic:Job Type:JobRegistered Name:undefined Job:undefined ({'Job': {'Stop': False, 'Region': 'global', 'Namespace': 'default', 'ID': 'hello', 'ParentID': '', 'Name': 'hello', 'Type': 'service', 'Priority': 50, 'AllAtOnce': False, 'Datacenters': ['*'], 'NodePool': 'default', 'Constraints': None, 'Affinities': None, 'Spreads': None, 'TaskGroups': [{'Name': 'h)
=================================
17:03.34 -> Topic:Evaluation Type:JobRegistered Name:undefined Job:undefined ({'Evaluation': {'ID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Namespace': 'default', 'Priority': 50, 'Type': 'service', 'TriggeredBy': 'job-register', 'JobID': 'hello', 'JobModifyIndex': 780713, 'Status': 'pending', 'CreateIndex': 780713, 'ModifyIndex': 780713, 'CreateTime': 1718291014573443082, 'M)
=================================
17:03.34 -> Topic:Evaluation Type:PlanResult Name:undefined Job:undefined ({'Evaluation': {'ID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Namespace': 'default', 'Priority': 50, 'Type': 'service', 'TriggeredBy': 'job-register', 'JobID': 'hello', 'JobModifyIndex': 780713, 'Status': 'pending', 'CreateIndex': 780713, 'ModifyIndex': 780714, 'CreateTime': 1718291014573443082, 'M)
=================================
17:03.34 -> Topic:Allocation Type:PlanResult Name:undefined Job:undefined ({'Allocation': {'ID': 'd3223314-31a8-1daf-1fa3-77eb927db251', 'Namespace': 'default', 'EvalID': '253d9daf-3bdf-5835-7231-74f12e8b3a25', 'Name': 'hello.servers[1]', 'NodeID': 'd64f374e-51f3-8880-9a47-41a03979f3b5', 'NodeName': 'n2-dev', 'JobID': 'hello', 'TaskGroup': 'servers', 'Resources': {'CPU': 1)
=================================
17:03.34 -> Topic:Allocation Type:PlanResult Name:undefined Job:undefined ({'Allocation': {'ID': '1ae43b51-a15e-b46b-f3d4-adf44cae2022', 'Namespace': 'default', 'EvalID': 'fc58484b-68f2-4060-a072-47da5a83046d', 'Name': 'hello.servers[0]', 'NodeID': '47e6428a-3251-e294-ab9a-60dda10d8a9b', 'NodeName': 'n5-dev', 'JobID': 'hello', 'TaskGroup': 'servers', 'Resources': {'CPU': 1)
=================================
17:03.34 -> Topic:Allocation Type:PlanResult Name:undefined Job:undefined ({'Allocation': {'ID': 'df98bf40-d1a9-6823-ed3b-a6c2cf60a15b', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[1]', 'NodeID': '7b53a6e3-ecca-fe73-092c-094ecbd58330', 'NodeName': 'n1-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:03.34 -> Topic:Deployment Type:PlanResult Name:undefined Job:undefined ({'Deployment': {'ID': '773af71b-b377-9074-23b3-d9cd07573928', 'Namespace': 'default', 'JobID': 'hello', 'JobVersion': 23, 'JobModifyIndex': 780713, 'JobSpecModifyIndex': 780713, 'JobCreateIndex': 758849, 'IsMultiregion': False, 'TaskGroups': {'hello': {'AutoRevert': False, 'AutoPromote': False, 'Pro)
=================================
17:03.34 -> Topic:Allocation Type:PlanResult Name:undefined Job:undefined ({'Allocation': {'ID': '876bce51-b7fd-86c3-5838-c7be5b305a61', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[0]', 'NodeID': '8315c70c-0587-bd77-b890-116bbab83439', 'NodeName': 'n4-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:03.34 -> Topic:Evaluation Type:EvaluationUpdated Name:undefined Job:undefined ({'Evaluation': {'ID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Namespace': 'default', 'Priority': 50, 'Type': 'service', 'TriggeredBy': 'job-register', 'JobID': 'hello', 'JobModifyIndex': 780713, 'DeploymentID': '773af71b-b377-9074-23b3-d9cd07573928', 'Status': 'complete', 'QueuedAllocations': {'hel)
=================================
17:03.34 -> Topic:Service Type:ServiceDeregistration Name:undefined Job:undefined ({'Service': {'ID': '_nomad-task-d3223314-31a8-1daf-1fa3-77eb927db251-app-hello-servers-app-www', 'ServiceName': 'hello-servers-app', 'Namespace': 'default', 'NodeID': 'd64f374e-51f3-8880-9a47-41a03979f3b5', 'Datacenter': 'dc1', 'JobID': 'hello', 'AllocID': 'd3223314-31a8-1daf-1fa3-77eb927db251', 'Ta)
=================================
17:03.34 -> Topic:Service Type:ServiceDeregistration Name:undefined Job:undefined ({'Service': {'ID': '_nomad-task-1ae43b51-a15e-b46b-f3d4-adf44cae2022-app-hello-servers-app-www', 'ServiceName': 'hello-servers-app', 'Namespace': 'default', 'NodeID': '47e6428a-3251-e294-ab9a-60dda10d8a9b', 'Datacenter': 'dc1', 'JobID': 'hello', 'AllocID': '1ae43b51-a15e-b46b-f3d4-adf44cae2022', 'Ta)
=================================
17:03.34 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': '1ae43b51-a15e-b46b-f3d4-adf44cae2022', 'Namespace': 'default', 'EvalID': 'fc58484b-68f2-4060-a072-47da5a83046d', 'Name': 'hello.servers[0]', 'NodeID': '47e6428a-3251-e294-ab9a-60dda10d8a9b', 'NodeName': 'n5-dev', 'JobID': 'hello', 'TaskGroup': 'servers', 'Resources': {'CPU': 1)
=================================
17:03.34 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'os.signals': 'SIGTRAP,SIGTTIN,SIGQUIT,SIGSYS,SIGTSTP,SIGTTOU,SIGUSR1,SIGFPE,SIGSEGV,SIGILL,SIGINT,SIGALRM,SIGBUS,SIGIO,SIGWINCH,SIGPROF,SIGSTOP,SIGTERM,SIGUSR2,SIGABRT,SIGCONT,SIGPIPE,SIGXFSZ,SIGIOT,SIGKILL,SIGXCPU,SIGNULL,SIGHUP', 'cpu.reservablecores': '7', 'plugins.cni.v)
=================================
17:03.34 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': 'd3223314-31a8-1daf-1fa3-77eb927db251', 'Namespace': 'default', 'EvalID': '253d9daf-3bdf-5835-7231-74f12e8b3a25', 'Name': 'hello.servers[1]', 'NodeID': 'd64f374e-51f3-8880-9a47-41a03979f3b5', 'NodeName': 'n2-dev', 'JobID': 'hello', 'TaskGroup': 'servers', 'Resources': {'CPU': 1)
=================================
17:03.34 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'nomad.version': '1.8.0', 'plugins.cni.version.static': 'v1.1.1', 'unique.network.ip-address': '37.235.104.164', 'os.cgroups.version': '2', 'cpu.arch': 'amd64', 'plugins.cni.version.vrf': 'v1.1.1', 'nomad.bridge.hairpin_mode': 'false', 'plugins.cni.version.bridge': 'v1.1.1',)
=================================
17:03.35 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': 'df98bf40-d1a9-6823-ed3b-a6c2cf60a15b', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[1]', 'NodeID': '7b53a6e3-ecca-fe73-092c-094ecbd58330', 'NodeName': 'n1-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:03.35 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'plugins.cni.version.ipvlan': 'v1.1.1', 'driver.docker.version': '26.0.0', 'plugins.cni.version.firewall': 'v1.1.1', 'nomad.version': '1.8.0', 'cpu.frequency': '3408', 'numa.node0.cores': '0-4', 'unique.hostname': 'n1-dev.freelo.net', 'plugins.cni.version.tuning': 'v1.1.1', )
=================================

DOWN TIME STARTS

17:03.35 -> Topic:Service Type:ServiceRegistration Name:undefined Job:undefined ({'Service': {'ID': '_nomad-task-876bce51-b7fd-86c3-5838-c7be5b305a61-app-hello-hello-app-www', 'ServiceName': 'hello-hello-app', 'Namespace': 'default', 'NodeID': '8315c70c-0587-bd77-b890-116bbab83439', 'Datacenter': 'dc1', 'JobID': 'hello', 'AllocID': '876bce51-b7fd-86c3-5838-c7be5b305a61', 'Tags':)
=================================
17:03.35 -> Topic:Service Type:ServiceRegistration Name:undefined Job:undefined ({'Service': {'ID': '_nomad-task-df98bf40-d1a9-6823-ed3b-a6c2cf60a15b-app-hello-hello-app-www', 'ServiceName': 'hello-hello-app', 'Namespace': 'default', 'NodeID': '7b53a6e3-ecca-fe73-092c-094ecbd58330', 'Datacenter': 'dc1', 'JobID': 'hello', 'AllocID': 'df98bf40-d1a9-6823-ed3b-a6c2cf60a15b', 'Tags':)
=================================
17:03.35 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': 'df98bf40-d1a9-6823-ed3b-a6c2cf60a15b', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[1]', 'NodeID': '7b53a6e3-ecca-fe73-092c-094ecbd58330', 'NodeName': 'n1-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:03.35 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'plugins.cni.version.ptp': 'v1.1.1', 'driver.docker.bridge_ip': '172.17.0.1', 'driver.docker.runtimes': 'io.containerd.runc.v2,runc', 'unique.network.ip-address': '37.235.102.157', 'cpu.reservablecores': '5', 'plugins.cni.version.host-local': 'v1.1.1', 'plugins.cni.version.l)
=================================
17:03.36 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': '876bce51-b7fd-86c3-5838-c7be5b305a61', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[0]', 'NodeID': '8315c70c-0587-bd77-b890-116bbab83439', 'NodeName': 'n4-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:03.36 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'os.version': '12.5', 'nomad.revision': '28b82e4b2259fae5a62e2ed47395334bea5a24c4', 'cpu.usablecompute': '16540', 'plugins.cni.version.loopback': 'v1.1.1', 'plugins.cni.version.vlan': 'v1.1.1', 'cpu.numcores': '5', 'cpu.modelname': 'Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz', 'n)
=================================
17:03.42 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': 'df98bf40-d1a9-6823-ed3b-a6c2cf60a15b', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[1]', 'NodeID': '7b53a6e3-ecca-fe73-092c-094ecbd58330', 'NodeName': 'n1-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:03.42 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'cpu.usablecompute': '16540', 'plugins.cni.version.vlan': 'v1.1.1', 'plugins.cni.version.sbr': 'v1.1.1', 'driver.docker': '1', 'driver.docker.volumes.enabled': 'true', 'kernel.landlock': 'v2', 'driver.docker.os_type': 'linux', 'cpu.modelname': 'Intel(R) Xeon(R) E-2236 CPU @ )
=================================
17:03.42 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': '876bce51-b7fd-86c3-5838-c7be5b305a61', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[0]', 'NodeID': '8315c70c-0587-bd77-b890-116bbab83439', 'NodeName': 'n4-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:03.42 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'os.cgroups.version': '2', 'numa.node.count': '1', 'unique.storage.bytesfree': '71301427200', 'plugins.cni.version.ptp': 'v1.1.1', 'kernel.version': '6.1.0-18-amd64', 'memory.totalbytes': '16767406080', 'os.name': 'debian', 'os.signals': 'SIGALRM,SIGFPE,SIGTRAP,SIGTSTP,SIGHU)

NOW RUNING

=================================
17:04.00 -> Topic:Deployment Type:AllocationUpdated Name:undefined Job:undefined ({'Deployment': {'ID': '773af71b-b377-9074-23b3-d9cd07573928', 'Namespace': 'default', 'JobID': 'hello', 'JobVersion': 23, 'JobModifyIndex': 780713, 'JobSpecModifyIndex': 780713, 'JobCreateIndex': 758849, 'IsMultiregion': False, 'TaskGroups': {'hello': {'AutoRevert': False, 'AutoPromote': False, 'Pro)
=================================
17:04.00 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': '876bce51-b7fd-86c3-5838-c7be5b305a61', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[0]', 'NodeID': '8315c70c-0587-bd77-b890-116bbab83439', 'NodeName': 'n4-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:04.00 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'plugins.cni.version.portmap': 'v1.1.1', 'numa.node0.cores': '0-4', 'plugins.cni.version.dhcp': 'v1.1.1', 'cpu.totalcompute': '17040', 'driver.docker.version': '25.0.3', 'driver.docker.os_type': 'linux', 'cpu.arch': 'amd64', 'nomad.service_discovery': 'true', 'kernel.arch': )
=================================
17:04.01 -> Topic:Deployment Type:AllocationUpdated Name:undefined Job:undefined ({'Deployment': {'ID': '773af71b-b377-9074-23b3-d9cd07573928', 'Namespace': 'default', 'JobID': 'hello', 'JobVersion': 23, 'JobModifyIndex': 780713, 'JobSpecModifyIndex': 780713, 'JobCreateIndex': 758849, 'IsMultiregion': False, 'TaskGroups': {'hello': {'AutoRevert': False, 'AutoPromote': False, 'Pro)
=================================
17:04.01 -> Topic:Allocation Type:AllocationUpdated Name:undefined Job:undefined ({'Allocation': {'ID': 'df98bf40-d1a9-6823-ed3b-a6c2cf60a15b', 'Namespace': 'default', 'EvalID': 'ac30e20a-3f0d-ab1d-a3a6-a0392b145f8c', 'Name': 'hello.hello[1]', 'NodeID': '7b53a6e3-ecca-fe73-092c-094ecbd58330', 'NodeName': 'n1-dev', 'JobID': 'hello', 'TaskGroup': 'hello', 'Resources': {'CPU': 150, )
=================================
17:04.01 -> Topic:Node Type:AllocationUpdated Name:undefined Job:undefined ({'Node': {'Attributes': {'plugins.cni.version.ptp': 'v1.1.1', 'driver.docker.bridge_ip': '172.17.0.1', 'driver.docker.runtimes': 'io.containerd.runc.v2,runc', 'unique.network.ip-address': '37.235.102.157', 'cpu.reservablecores': '5', 'plugins.cni.version.host-local': 'v1.1.1', 'plugins.cni.version.l)
=================================
17:04.01 -> Topic:Evaluation Type:AllocationUpdateDesiredStatus Name:undefined Job:undefined ({'Evaluation': {'ID': '0a32e63e-bc0a-7ca0-b1f2-f6ec34cf175b', 'Namespace': 'default', 'Priority': 50, 'Type': 'service', 'TriggeredBy': 'deployment-watcher', 'JobID': 'hello', 'DeploymentID': '773af71b-b377-9074-23b3-d9cd07573928', 'Status': 'pending', 'CreateIndex': 780729, 'ModifyIndex': 780729, ')
=================================
17:04.01 -> Topic:Deployment Type:PlanResult Name:undefined Job:undefined ({'Deployment': {'ID': '773af71b-b377-9074-23b3-d9cd07573928', 'Namespace': 'default', 'JobID': 'hello', 'JobVersion': 23, 'JobModifyIndex': 780713, 'JobSpecModifyIndex': 780713, 'JobCreateIndex': 758849, 'IsMultiregion': False, 'TaskGroups': {'hello': {'AutoRevert': False, 'AutoPromote': False, 'Pro)
=================================
17:04.01 -> Topic:Job Type:PlanResult Name:undefined Job:undefined ({'Job': {'Stop': False, 'Region': 'global', 'Namespace': 'default', 'ID': 'hello', 'ParentID': '', 'Name': 'hello', 'Type': 'service', 'Priority': 50, 'AllAtOnce': False, 'Datacenters': ['*'], 'NodePool': 'default', 'Constraints': None, 'Affinities': None, 'Spreads': None, 'TaskGroups': [{'Name': 'h)
=================================
17:04.01 -> Topic:Evaluation Type:PlanResult Name:undefined Job:undefined ({'Evaluation': {'ID': '0a32e63e-bc0a-7ca0-b1f2-f6ec34cf175b', 'Namespace': 'default', 'Priority': 50, 'Type': 'service', 'TriggeredBy': 'deployment-watcher', 'JobID': 'hello', 'DeploymentID': '773af71b-b377-9074-23b3-d9cd07573928', 'Status': 'pending', 'CreateIndex': 780729, 'ModifyIndex': 780730, ')
=================================
17:04.01 -> Topic:Allocation Type:PlanResult Name:undefined Job:undefined ({'Allocation': {'ID': 'd3223314-31a8-1daf-1fa3-77eb927db251', 'Namespace': 'default', 'EvalID': '253d9daf-3bdf-5835-7231-74f12e8b3a25', 'Name': 'hello.servers[1]', 'NodeID': 'd64f374e-51f3-8880-9a47-41a03979f3b5', 'NodeName': 'n2-dev', 'JobID': 'hello', 'TaskGroup': 'servers', 'Resources': {'CPU': 1)
=================================
17:04.01 -> Topic:Allocation Type:PlanResult Name:undefined Job:undefined ({'Allocation': {'ID': '1ae43b51-a15e-b46b-f3d4-adf44cae2022', 'Namespace': 'default', 'EvalID': 'fc58484b-68f2-4060-a072-47da5a83046d', 'Name': 'hello.servers[0]', 'NodeID': '47e6428a-3251-e294-ab9a-60dda10d8a9b', 'NodeName': 'n5-dev', 'JobID': 'hello', 'TaskGroup': 'servers', 'Resources': {'CPU': 1)
=================================
17:04.01 -> Topic:Evaluation Type:EvaluationUpdated Name:undefined Job:undefined ({'Evaluation': {'ID': '0a32e63e-bc0a-7ca0-b1f2-f6ec34cf175b', 'Namespace': 'default', 'Priority': 50, 'Type': 'service', 'TriggeredBy': 'deployment-watcher', 'JobID': 'hello', 'DeploymentID': '773af71b-b377-9074-23b3-d9cd07573928', 'Status': 'complete', 'QueuedAllocations': {'hello': 0}, 'SnapshotIn)

Nomad Client logs (if appropriate)

@tgross
Copy link
Member

tgross commented Jun 21, 2024

Hi @msirovy! You almost certainly are hitting this because your job doesn't have a task.shutdown_delay set. This value should be picked so that you overlap in a reasonable way with the update.min_healthy_time and with whatever tooling you're using to update the ingress configuration.

In #23326 we're discussing making omitting this field a warning if you're submitting service jobs with service blocks, because it's such a common problem.

@msirovy
Copy link
Author

msirovy commented Jun 25, 2024

Hi @msirovy! You almost certainly are hitting this because your job doesn't have a task.shutdown_delay set. This value should be picked so that you overlap in a reasonable way with the update.min_healthy_time and with whatever tooling you're using to update the ingress configuration.

In #23326 we're discussing making omitting this field a warning if you're submitting service jobs with service blocks, because it's such a common problem.

Thanks for your reply. I've tried to tune these options, but without any positive result. But in meanwhile I've contiued with debuging and I am able to isolate a problem a bit more.

  • It ocures only when my cluster nodes are without restart for longer time (like 1 hour), after the restart of all nodes, the downtime during update is unnoticable (tested many times).
  • When I use consul as a service discovery, than the problem is gone.

I am not able to debug it more, but if you have any recommendations I'll try it.

@tgross
Copy link
Member

tgross commented Jun 25, 2024

Thanks for that extra context @msirovy. I took another look through the Nomad Services feature and I suspect that the problem you're seeing is because the consul-template nomadService query returns all service instances, not just healthy ones. In fact, the Read Service API doesn't support that at all, and you need a second query to Allocation Checks API to get that data. Whereas with Consul by default the service query returns only healthy service instances.

There's an architectural reason for this, which is that the Nomad server doesn't record Nomad service health checks. When you query them, the server sends a request to the client that has those allocations. This was identified as "Future Work" in the original design document for Nomad service checks (internal doc ref for Nomad engineers reading this):

One key feature missing from the initial implementation of Nomad service provider checks is a way to query service check status by service name. Such an API could serve two purposes that are of high priority, but not critical. The first is being able to expose the overall healthiness of a particular service in the UI. In the initial implementation, the only way to view the result of a check is in the context of a specific allocation running on its single Client. There is no way to get an aggregate understanding of the healthiness of a check across allocations / Clients.

We'd need to implement this in order to have the nomadService query in CT support the very reasonable use case you have. Sorry to say that I don't have a workaround for you today except to use Consul or to add configuration to Nginx to retry requests to the other backend instance during a deployment. I'll mark this for roadmapping.

@tgross tgross changed the title Nomad service discovery register new allocation before is ready Nomad service discovery checks can't be accessed by service (for templates) Jun 25, 2024
@tgross tgross changed the title Nomad service discovery checks can't be accessed by service (for templates) allow querying Nomad service discovery checks by service (for consul-template) Jun 25, 2024
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jun 25, 2024
@tgross tgross removed their assignment Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants