Nomad Service Discovery unable to find service #16983

vincenthuynh · 2023-04-25T16:37:10Z

Nomad version

Nomad v1.4.7

Operating system and Environment details

Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux

Issue

Allocation is unable to find Nomad service when it exists. It seems to start happening on a client after an uptime of 2-3 days.

Reproduction steps

Task 1: register a service myservice using the Nomad provider
Task 2: use template stanza and NomadService function to reference the service that was registered in Task 1

Able to list service:

$ nomad service list -namespace="*"
Service Name               Namespace  Tags
myservice  default    []

Expected Result

Able to discover a service consistently

Actual Result

Task log:

Template | Missing: nomad.service(myservice)

Job file (if appropriate)

Task 1:

    service {
      provider = "nomad"
      name     = "myservice"
      port     = "redis"
    }

Task 2:

      template {
        data = <<EOH
{{range nomadService "myservice"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

Nomad Client logs

2023-04-25T16:10:02.354Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 5 after "4s")
2023-04-25T16:10:06.355Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 6 after "8s")
2023-04-25T16:10:14.356Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 7 after "16s")

The text was updated successfully, but these errors were encountered:

vincenthuynh · 2023-04-25T16:38:31Z

The workaround is to restart the Nomad service/agent on the client node.

shoenig · 2023-04-27T16:21:55Z

Hi @vincenthuynh so far I haven't been able to reproduce what you're seeing - in my cases the template is always successfully rendered once the upstream task is started and its serivce is registered. Before I dig in further, could you post a complete job file you're using that experiences the issue? I want to make sure we're not missing something (e.g. using group vs. task services, etc.)

the test job file i've been using

bug.hcl

job "bug" {

  group "group" {
    network {
      port "http" {
        to = 8080
      }
    }

    task "python" {
      driver = "raw_exec"

      config {
        command = "python3"
        args = ["-m", "http.server", "8080"]
      }

      service {
        provider = "nomad"
        name = "python"
        port = "http"
      }

      resources {
        cpu    = 10
        memory = 32
      }
    }

    task "client" {
      driver = "raw_exec"

      template {
        data = <<EOH
{{range nomadService "python"}}
blah.host: {{ .Address }}
blah.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

      config {
        command = "sleep"
        args = ["infinity"]
      }
    }
  }
}

vincenthuynh · 2023-04-27T18:36:48Z

Hi @shoenig,

We've noticed that it takes a few days (2-3 days) before it starts happening.

Here's another reproduction:

An old allocation was stopped and a new one was created and it happened to be on the same node:
It's unable to find the service:
Applying the workaround: Simply restarting the nomad service on the client allows the task to discover the service again and start successfully.

Here's our job file:

myservice.hcl

job "myservice" {

  group "myservice" {
    network {
      mode = "bridge"
    }

    service {
      name = "myservice"
      port = "8080"
      tags = [
        "env=${var.env}",
        "version=${var.version}"]
      connect {
        sidecar_service {}
      }
    }

    task "myservice" {
      driver = "docker"
      leader = true
      config {
        image = "gcr.io/myservice"
        work_dir = "/local"
      }

      vault {
        policies = ["myservice"]
        change_mode = "noop"
      }

      template {
        data = <<EOH
{{range nomadService "myservice-cache-redis"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
        change_mode = "noop"
      }
    }
  }
    
  group "myservice-cache" {
    network {
      mode = "bridge"
      port "redis" {
        to = 6379
      }
    }

    service {
      provider = "nomad"
      name = "myservice-cache-redis"
      port = "redis"
    }

    task "redis" {
      driver = "docker"
      config {
        image = "redis:6.2.0"
        args  = ["redis-server", "/local/redis.conf"]
      }
      template {
        data = <<-EOH
maxmemory 250mb
maxmemory-policy allkeys-lru
EOH
        destination = "/local/redis.conf"
        change_mode = "noop"
      }
      resources {
        cpu    = 100
        memory = 256
      }
    }
  }
}

Hope that helps. Thanks!

gulducat · 2023-06-07T18:02:56Z

I encountered a similar issue caused by having NOMAD_ADDR set in the environment that nomad agent was run in. That var apparently went through to the Nomad API client that consul-template uses, and caused it to fail its API calls (in my case, for HTTP vs. HTTPS reasons) for the nomadService lookup.

My errors happened very consistently, so different I think from this case, but wanted to mention here for anyone else who finds this issue like I did. My solution was to ensure NOMAD_ADDR is not set in my nomad agent environment.

IamTheFij · 2023-06-20T15:37:29Z

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

Restarting the allocation seems to resolve the issue and force Nomad to re-register the service.

Unfortunately, this time it was my log aggregator that disappeared, so I don't have an easy way to pull logs from around the time of the issue. I'll try to grab them the next time it happens to a different service.

tfritzenwallner-private · 2023-11-22T09:44:39Z

This issue still consistently happens for us every 2-3 days. I can observe exactly the same as @IamTheFij however we run nomad 1.6.3.

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

mikedvinci90 · 2024-05-05T12:43:42Z

I have observed same issue for nomad 1.7.3 to 1.7.7

benbourner · 2024-06-23T07:30:18Z

I observer the same problem in v1.8.1

dmclf · 2024-06-24T08:41:30Z

seeing same problem, services are clearly visible on Nomad UI, but cannot be used by templating.

Nomad 1.7.7 (multi-region, multi-dc and ACL enabled)

(Consul-based service templating works fine and reliable, as opposed to Nomad-based services)

benbourner · 2024-06-27T16:44:50Z

Yet another example in nomad 1.8.1, it's just happening randomly among my services. Because i have traefik parsing the nomad services, they just disappear from traefik and are thus inaccessible. After rock-solid running for years, now the nomad deployments are just unreliable... :(

Service are up, healthy and reachable on the given ports...

But the service allocations have again disappeared so traefik no longer sees them, so I can't access them via proper URLs...

dannyhpy · 2024-06-27T20:58:10Z

I've seen this occuring frequently under poor network conditions where

the client misses a heartbeat to the server
the server unregisters the services managed by this client in response
the client responds to the next heartbeat
but the server does not seem to register the services back

I don't know if this is intended and if it is the same issue people are having here.

As a workaround, I was restarting the Nomad agent on the client every 20 mins. (I didn't need HA)

vincenthuynh added the type/bug label Apr 25, 2023

shoenig added stage/needs-investigation theme/service-discovery/nomad labels Apr 26, 2023

shoenig self-assigned this Apr 27, 2023

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to In Progress in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad Service Discovery unable to find service #16983

Nomad Service Discovery unable to find service #16983

vincenthuynh commented Apr 25, 2023 •

edited

Loading

vincenthuynh commented Apr 25, 2023

shoenig commented Apr 27, 2023

vincenthuynh commented Apr 27, 2023

gulducat commented Jun 7, 2023

IamTheFij commented Jun 20, 2023 •

edited

Loading

tfritzenwallner-private commented Nov 22, 2023 •

edited

Loading

mikedvinci90 commented May 5, 2024

benbourner commented Jun 23, 2024

dmclf commented Jun 24, 2024 •

edited

Loading

benbourner commented Jun 27, 2024 •

edited

Loading

dannyhpy commented Jun 27, 2024 •

edited

Loading

Nomad Service Discovery unable to find service #16983

Nomad Service Discovery unable to find service #16983

Comments

vincenthuynh commented Apr 25, 2023 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Client logs

vincenthuynh commented Apr 25, 2023

shoenig commented Apr 27, 2023

vincenthuynh commented Apr 27, 2023

gulducat commented Jun 7, 2023

IamTheFij commented Jun 20, 2023 • edited Loading

tfritzenwallner-private commented Nov 22, 2023 • edited Loading

mikedvinci90 commented May 5, 2024

benbourner commented Jun 23, 2024

dmclf commented Jun 24, 2024 • edited Loading

benbourner commented Jun 27, 2024 • edited Loading

dannyhpy commented Jun 27, 2024 • edited Loading

vincenthuynh commented Apr 25, 2023 •

edited

Loading

IamTheFij commented Jun 20, 2023 •

edited

Loading

tfritzenwallner-private commented Nov 22, 2023 •

edited

Loading

dmclf commented Jun 24, 2024 •

edited

Loading

benbourner commented Jun 27, 2024 •

edited

Loading

dannyhpy commented Jun 27, 2024 •

edited

Loading