Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad does not register HTTP tag for server in Consul #23384

Closed
BrianHicks opened this issue Jun 19, 2024 · 4 comments
Closed

nomad does not register HTTP tag for server in Consul #23384

BrianHicks opened this issue Jun 19, 2024 · 4 comments

Comments

@BrianHicks
Copy link

Nomad version

Nomad v1.8.0

Operating system and Environment details

NixOS 24.05 running on Hetzner cloud VMs.

Issue

When advertise.http is set, Nomad is not registering a http tag with Consul. rpc and serf are registered, though.

(This is blocking me from scraping job metrics with Prometheus.)

Reproduction steps

Run Nomad using this config:

{
  "acl": {
    "enabled": true
  },
  "advertise": {
    "http": "{{ GetInterfaceIP \"enp7s0\" }}",
    "rpc": "{{ GetInterfaceIP \"enp7s0\" }}",
    "serf": "{{ GetInterfaceIP \"enp7s0\" }}"
  },
  "consul": {
    "address": "127.0.0.1:8501",
    "ssl": true
  },
  "data_dir": "/var/lib/nomad",
  "datacenter": "us-east",
  "log_level": "TRACE",
  "ports": {
    "http": 4646,
    "rpc": 4647,
    "serf": 4648
  },
  "server": {
    "bootstrap_expect": 1,
    "enabled": true
  },
  "telemetry": {
    "collection_interval": "1s",
    "disable_hostname": true,
    "prometheus_metrics": true,
    "publish_allocation_metrics": true,
    "publish_node_metrics": true
  },
  "tls": {
    "ca_file": "[SNIP]",
    "cert_file": "[SNIP]",
    "http": true,
    "key_file": "[SNIP]",
    "rpc": true,
    "verify_https_client": false,
    "verify_server_hostname": true
  },
  "ui": {
    "enabled": true
  }
}

(Plus a side config I have not shared that sets consul.token.)

Expected Result

Nomad registers a nomad service with http, rpc, and serf tags.

Actual Result

Nomad only registers rpc and serf tags.

Nomad Server logs (if appropriate)

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.json, /etc/nomad.d/consul-token.json
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.1.0:4646; RPC: 10.0.1.0:4647; Serf: 10.0.1.0:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: false
             Log Level: INFO
               Node Id: 60100119-2101-5fe3-1fc7-887d6a5dab36
                Region: global (DC: us-east)
                Server: true
               Version: 1.8.0
==> Nomad agent started! Log data will stream in below:
    2024-06-19T05:42:49.734Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-06-19T05:42:49.736Z [INFO]  nomad.raft: starting restore from snapshot: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159
    2024-06-19T05:42:49.760Z [INFO]  nomad.raft: snapshot restore progress: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159 read-bytes=298159 percent-complete="100.00%"
    2024-06-19T05:42:49.760Z [INFO]  nomad.raft: restored from snapshot: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159
    2024-06-19T05:42:49.770Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:35e115c2-34da-f3ba-8579-e8e122ba3dfd Address:10.0.1.0:4647}]"
    2024-06-19T05:42:49.771Z [INFO]  nomad: serf: EventMemberJoin: leader-red.global 10.0.1.0
    2024-06-19T05:42:49.771Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-19T05:42:49.771Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-19T05:42:49.773Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.1.0:4647 [Follower]" leader-address= leader-id=
    2024-06-19T05:42:49.774Z [WARN]  nomad: serf: Failed to re-join any previously known node
    2024-06-19T05:42:49.774Z [INFO]  nomad: adding server: server="leader-red.global (Addr: 10.0.1.0:4647) (DC: us-east)"
    2024-06-19T05:42:51.014Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-06-19T05:42:51.014Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.1.0:4647 [Candidate]" term=32
    2024-06-19T05:42:51.016Z [INFO]  nomad.raft: election won: term=32 tally=1
    2024-06-19T05:42:51.016Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.1.0:4647 [Leader]"
    2024-06-19T05:42:51.016Z [INFO]  nomad: cluster leadership acquired
    2024-06-19T05:42:51.055Z [INFO]  nomad: eval broker status modified: paused=false
    2024-06-19T05:42:51.055Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-06-19T05:42:51.055Z [INFO]  nomad: revoking consul accessors after becoming leader: accessors=14
@tgross
Copy link
Member

tgross commented Jun 21, 2024

Hi @BrianHicks! I wasn't able to reproduce what you're seeing on either 1.8.0 or the current tip of main. I also played around with HCL vs JSON configuration and wasn't able to see a difference there either. The weird thing about this is that we create and register those services all at the same time: agent.go#L961-L1009

If you were to run the server with log_level = "debug" you'd see a message during startup about syncing to Consul like the one below. What's that look like?

2024-06-21T15:47:51.388-0400 [DEBUG] consul.sync: sync complete: registered_services=3 deregistered_services=0 registered_checks=3 deregistered_checks=0

Also, if you run the following command against one of the servers, what's the response body look like?

nomad operator api '/v1/agent/self' | jq '.config.Consuls'

@BrianHicks
Copy link
Author

How interesting! I don't see any such message when running in debug; here's the output:

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.json, /run/agenix/nomad-consul-token.json
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.1.0:4646; RPC: 10.0.1.0:4647; Serf: 10.0.1.0:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: false
             Log Level: DEBUG
               Node Id: 2f248988-a9b9-265f-6f60-ff48eeb337d7
                Region: global (DC: us-east)
                Server: true
               Version: 1.8.0
==> Nomad agent started! Log data will stream in below:
    2024-06-22T00:25:24.879Z [DEBUG] nomad: issuer not set; OIDC Discovery endpoint for workload identities disabled
    2024-06-22T00:25:24.884Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-06-22T00:25:24.886Z [INFO]  nomad.raft: starting restore from snapshot: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574
    2024-06-22T00:25:24.910Z [INFO]  nomad.raft: snapshot restore progress: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574 read-bytes=276574 percent-complete="100.00%"
    2024-06-22T00:25:24.911Z [INFO]  nomad.raft: restored from snapshot: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574
    2024-06-22T00:25:24.911Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:35e115c2-34da-f3ba-8579-e8e122ba3dfd Address:10.0.1.0:4647}]"
    2024-06-22T00:25:24.912Z [INFO]  nomad: serf: EventMemberJoin: leader-red.global 10.0.1.0
    2024-06-22T00:25:24.912Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-22T00:25:24.912Z [DEBUG] nomad: started scheduling worker: id=4b6681a1-493b-d115-a9df-076f99145c65 index=1 of=2
    2024-06-22T00:25:24.912Z [DEBUG] nomad: started scheduling worker: id=22ce07f8-09a0-8b34-913b-af8585da9491 index=2 of=2
    2024-06-22T00:25:24.912Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-22T00:25:24.912Z [DEBUG] http: UI is enabled
    2024-06-22T00:25:24.913Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.1.0:4647 [Follower]" leader-address= leader-id=
    2024-06-22T00:25:24.914Z [WARN]  nomad: serf: Failed to re-join any previously known node
    2024-06-22T00:25:24.914Z [DEBUG] worker: running: worker_id=4b6681a1-493b-d115-a9df-076f99145c65
    2024-06-22T00:25:24.914Z [DEBUG] worker: running: worker_id=22ce07f8-09a0-8b34-913b-af8585da9491
    2024-06-22T00:25:24.914Z [INFO]  nomad: adding server: server="leader-red.global (Addr: 10.0.1.0:4647) (DC: us-east)"
    2024-06-22T00:25:24.914Z [DEBUG] nomad.keyring.replicator: starting encryption key replication
    2024-06-22T00:25:26.507Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-06-22T00:25:26.507Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.1.0:4647 [Candidate]" term=36
    2024-06-22T00:25:26.508Z [DEBUG] nomad.raft: voting for self: term=36 id=35e115c2-34da-f3ba-8579-e8e122ba3dfd
    2024-06-22T00:25:26.510Z [DEBUG] nomad.raft: calculated votes needed: needed=1 term=36
    2024-06-22T00:25:26.510Z [DEBUG] nomad.raft: vote granted: from=35e115c2-34da-f3ba-8579-e8e122ba3dfd term=36 tally=1
    2024-06-22T00:25:26.510Z [INFO]  nomad.raft: election won: term=36 tally=1
    2024-06-22T00:25:26.510Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.1.0:4647 [Leader]"
    2024-06-22T00:25:26.510Z [INFO]  nomad: cluster leadership acquired
    2024-06-22T00:25:26.518Z [INFO]  nomad: eval broker status modified: paused=false
    2024-06-22T00:25:26.518Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-06-22T00:25:26.518Z [DEBUG] nomad.autopilot: autopilot is now running
    2024-06-22T00:25:26.518Z [DEBUG] nomad.autopilot: state update routine is now running
    2024-06-22T00:25:26.518Z [INFO]  nomad: revoking consul accessors after becoming leader: accessors=14

And here's the output of the command:

[
  {
    "Addr": "127.0.0.1:8501",
    "AllowUnauthenticated": true,
    "Auth": "",
    "AutoAdvertise": true,
    "CAFile": "",
    "CertFile": "",
    "ChecksUseAdvertise": false,
    "ClientAutoJoin": true,
    "ClientFailuresBeforeCritical": 0,
    "ClientFailuresBeforeWarning": 0,
    "ClientHTTPCheckName": "Nomad Client HTTP Check",
    "ClientServiceName": "nomad-client",
    "EnableSSL": true,
    "GRPCAddr": "",
    "GRPCCAFile": "",
    "KeyFile": "",
    "Name": "default",
    "Namespace": "",
    "ServerAutoJoin": true,
    "ServerFailuresBeforeCritical": 0,
    "ServerFailuresBeforeWarning": 0,
    "ServerHTTPCheckName": "Nomad Server HTTP Check",
    "ServerRPCCheckName": "Nomad Server RPC Check",
    "ServerSerfCheckName": "Nomad Server Serf Check",
    "ServerServiceName": "nomad",
    "ServiceIdentity": null,
    "ServiceIdentityAuthMethod": "nomad-workloads",
    "ShareSSL": null,
    "Tags": null,
    "TaskIdentity": null,
    "TaskIdentityAuthMethod": "nomad-workloads",
    "Timeout": 5000000000,
    "Token": "<redacted>",
    "VerifySSL": true
  }
]

@tgross
Copy link
Member

tgross commented Jun 27, 2024

Thanks @BrianHicks. I see that your Consul configuration doesn't have a CAFile or CertFile, but that you're connecting to Consul on port 8501 which is the Consul default port for https. Is there any chance it's just the wrong port, so Nomad can't find Consul at all?

I wouldn't expect to see any tags in that case, of course, but maybe the local agent has a cached version floating around from an earlier config?

@tgross
Copy link
Member

tgross commented Oct 23, 2024

We didn't hear back on this so I'm going to close it out for now. If you have more information, we'll be happy to reopen.

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants