Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Service Discovery unable to find service #16983

Open
vincenthuynh opened this issue Apr 25, 2023 · 11 comments
Open

Nomad Service Discovery unable to find service #16983

vincenthuynh opened this issue Apr 25, 2023 · 11 comments

Comments

@vincenthuynh
Copy link

vincenthuynh commented Apr 25, 2023

Nomad version

Nomad v1.4.7

Operating system and Environment details

Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux

Issue

Allocation is unable to find Nomad service when it exists. It seems to start happening on a client after an uptime of 2-3 days.

Reproduction steps

  • Task 1: register a service myservice using the Nomad provider
  • Task 2: use template stanza and NomadService function to reference the service that was registered in Task 1

Able to list service:

$ nomad service list -namespace="*"
Service Name               Namespace  Tags
myservice  default    []

Expected Result

Able to discover a service consistently

Actual Result

Task log:

Template | Missing: nomad.service(myservice)

Job file (if appropriate)

Task 1:

    service {
      provider = "nomad"
      name     = "myservice"
      port     = "redis"
    }

Task 2:

      template {
        data = <<EOH
{{range nomadService "myservice"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

Nomad Client logs

2023-04-25T16:10:02.354Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 5 after "4s")
2023-04-25T16:10:06.355Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 6 after "8s")
2023-04-25T16:10:14.356Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 7 after "16s")
@vincenthuynh
Copy link
Author

The workaround is to restart the Nomad service/agent on the client node.

@shoenig
Copy link
Member

shoenig commented Apr 27, 2023

Hi @vincenthuynh so far I haven't been able to reproduce what you're seeing - in my cases the template is always successfully rendered once the upstream task is started and its serivce is registered. Before I dig in further, could you post a complete job file you're using that experiences the issue? I want to make sure we're not missing something (e.g. using group vs. task services, etc.)

the test job file i've been using

bug.hcl
job "bug" {

  group "group" {
    network {
      port "http" {
        to = 8080
      }
    }

    task "python" {
      driver = "raw_exec"

      config {
        command = "python3"
        args = ["-m", "http.server", "8080"]
      }

      service {
        provider = "nomad"
        name = "python"
        port = "http"
      }

      resources {
        cpu    = 10
        memory = 32
      }
    }

    task "client" {
      driver = "raw_exec"

      template {
        data = <<EOH
{{range nomadService "python"}}
blah.host: {{ .Address }}
blah.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

      config {
        command = "sleep"
        args = ["infinity"]
      }
    }
  }
}

@shoenig shoenig self-assigned this Apr 27, 2023
@vincenthuynh
Copy link
Author

Hi @shoenig,

We've noticed that it takes a few days (2-3 days) before it starts happening.

Here's another reproduction:

  • An old allocation was stopped and a new one was created and it happened to be on the same node:
    image
  • It's unable to find the service:
    image
  • Applying the workaround: Simply restarting the nomad service on the client allows the task to discover the service again and start successfully.

Here's our job file:

myservice.hcl
job "myservice" {

  group "myservice" {
    network {
      mode = "bridge"
    }

    service {
      name = "myservice"
      port = "8080"
      tags = [
        "env=${var.env}",
        "version=${var.version}"]
      connect {
        sidecar_service {}
      }
    }

    task "myservice" {
      driver = "docker"
      leader = true
      config {
        image = "gcr.io/myservice"
        work_dir = "/local"
      }

      vault {
        policies = ["myservice"]
        change_mode = "noop"
      }

      template {
        data = <<EOH
{{range nomadService "myservice-cache-redis"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
        change_mode = "noop"
      }
    }
  }
    
  group "myservice-cache" {
    network {
      mode = "bridge"
      port "redis" {
        to = 6379
      }
    }

    service {
      provider = "nomad"
      name = "myservice-cache-redis"
      port = "redis"
    }

    task "redis" {
      driver = "docker"
      config {
        image = "redis:6.2.0"
        args  = ["redis-server", "/local/redis.conf"]
      }
      template {
        data = <<-EOH
maxmemory 250mb
maxmemory-policy allkeys-lru
EOH
        destination = "/local/redis.conf"
        change_mode = "noop"
      }
      resources {
        cpu    = 100
        memory = 256
      }
    }
  }
}

Hope that helps. Thanks!

@gulducat
Copy link
Member

gulducat commented Jun 7, 2023

I encountered a similar issue caused by having NOMAD_ADDR set in the environment that nomad agent was run in. That var apparently went through to the Nomad API client that consul-template uses, and caused it to fail its API calls (in my case, for HTTP vs. HTTPS reasons) for the nomadService lookup.

My errors happened very consistently, so different I think from this case, but wanted to mention here for anyone else who finds this issue like I did. My solution was to ensure NOMAD_ADDR is not set in my nomad agent environment.

@IamTheFij
Copy link
Contributor

IamTheFij commented Jun 20, 2023

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

image

Restarting the allocation seems to resolve the issue and force Nomad to re-register the service.

Unfortunately, this time it was my log aggregator that disappeared, so I don't have an easy way to pull logs from around the time of the issue. I'll try to grab them the next time it happens to a different service.

@tfritzenwallner-private
Copy link

tfritzenwallner-private commented Nov 22, 2023

This issue still consistently happens for us every 2-3 days. I can observe exactly the same as @IamTheFij however we run nomad 1.6.3.

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

@mikedvinci90
Copy link

I have observed same issue for nomad 1.7.3 to 1.7.7

@benbourner
Copy link

I observer the same problem in v1.8.1

@dmclf
Copy link

dmclf commented Jun 24, 2024

seeing same problem, services are clearly visible on Nomad UI, but cannot be used by templating.

Nomad 1.7.7 (multi-region, multi-dc and ACL enabled)

(Consul-based service templating works fine and reliable, as opposed to Nomad-based services)

@benbourner
Copy link

benbourner commented Jun 27, 2024

Yet another example in nomad 1.8.1, it's just happening randomly among my services. Because i have traefik parsing the nomad services, they just disappear from traefik and are thus inaccessible. After rock-solid running for years, now the nomad deployments are just unreliable... :(

Service are up, healthy and reachable on the given ports...
image

But the service allocations have again disappeared so traefik no longer sees them, so I can't access them via proper URLs...
image

@dannyhpy
Copy link

dannyhpy commented Jun 27, 2024

I've seen this occuring frequently under poor network conditions where

  1. the client misses a heartbeat to the server
  2. the server unregisters the services managed by this client in response
  3. the client responds to the next heartbeat
  4. but the server does not seem to register the services back

I don't know if this is intended and if it is the same issue people are having here.

As a workaround, I was restarting the Nomad agent on the client every 20 mins. (I didn't need HA)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

9 participants