Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native service delete errors for old allocs after client restart #24461

Open
mr-karan opened this issue Nov 14, 2024 · 4 comments
Open

native service delete errors for old allocs after client restart #24461

mr-karan opened this issue Nov 14, 2024 · 4 comments
Assignees
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery type/bug

Comments

@mr-karan
Copy link
Contributor

Nomad version

1.7.7

Operating system and Environment details

  • Running in an AWS environment
  • Ubuntu 24.04

Issue

Service registration errors and task failures occurring during node registration.

Reproduction steps

  1. Node starts registration process
  2. Multiple service registration deletion attempts fail
  3. Template rendering issues occur for HAProxy peer service
  4. Sibling task failures cascade to other services

Expected Result

  • Clean node registration
  • Successful service registration management
  • Proper template rendering for HAProxy peer service
  • Successful task execution without cascading failures

Actual Result

Multiple cascading failures observed:

  1. Service registration errors:
[ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: service registration not found"
  1. Template failures:
Missing: nomad.service(haproxy-peer)
  1. Task failures:
Setup Failure: failed to setup alloc: pre-run hook "group_services" failed: no servers
  1. Forced termination:
Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"

Nomad Client logs

Nov 14 08:11:51 [INFO]  agent: (runner) starting
Nov 14 08:11:51 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=172.31.2.217:4647
Nov 14 08:11:51 [INFO]  client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-d90c47ce-f4be-0fa3-e019-5d1b522e64a1-group-haproxy-default-haproxy-peer-haproxy-peer-net namespace=kite
Nov 14 08:12:01 [INFO]  client: node registration complete

Nomad Alloc Events Timeline

Nov 14, '24 08:10:36 - Terminated (Exit Code: 0)
Nov 14, '24 08:10:35 - Killing (Sent interrupt, 5s grace period)
Nov 14, '24 08:10:33 - Template Missing: nomad.service(haproxy-peer)
Nov 14, '24 08:10:30 - Sibling Task Failed (prepare-logging-setup)
Nov 14, '24 08:10:30 - Setup Failure (group_services hook failed)
Nov 14, '24 07:31:19 - Started

The primary issue appears to be related to service registration and template rendering failures, particularly affecting HAProxy peer services. This is causing cascading failures across dependent services and tasks.

@tgross
Copy link
Member

tgross commented Nov 14, 2024

@mr-karan if the node hasn't registered yet, how is it running services? Is this a node that was running services and then restarted?

@tgross
Copy link
Member

tgross commented Dec 11, 2024

@mr-karan we haven't heard back on this one in a while. It looks like you've got some running jobs and then you're rebooting the client agent, and then the allocations fail to restore? Are you rebooting the node? I tried reproducing but there really isn't enough to go on. I'm going to close this out for now as unreproducible, but if you have more info I'd be happy to reopen.

@tgross tgross closed this as completed Dec 11, 2024
@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024
@tgross
Copy link
Member

tgross commented Dec 12, 2024

Reproduced!

jobspec
job "httpd" {

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    service {
      name     = "httpd-web"
      provider = "nomad"
      port     = "www"
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

Running on a cluster with a single client and single server, I was running that job for a while and updating it frequently debugging other work. Then I see the following logs on the client after running systemd restart nomad:

2024-12-12T16:38:44.311-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.311-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.311-0500 [INFO] client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-c1c2f114-5f48-bc51-3575-95764f088e89-group-web-httpd-web-www namespace=default
2024-12-12T16:38:44.339-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.340-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.340-0500 [INFO] client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-1bec7c14-86a2-70e6-c018-654a05d02cc4-group-web-example-web-www namespace=default
2024-12-12T16:38:44.423-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.423-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.424-0500 [INFO] client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-1e098779-8d6e-8840-372f-178d44ca90c6-group-web-httpd-web-www namespace=default

The allocation IDs here are all for the service registration of the old allocations, not the ones that exist currently. So the errors we get from the server make sense -- these should all be gone already. But why the client still thinks it has to delete them I don't know yet. Seems like a chunk of data is getting left behind in the client state store.

Reopening and marking for roadmapping.

@tgross tgross reopened this Dec 12, 2024
@github-project-automation github-project-automation bot moved this from Done to Needs Triage in Nomad - Community Issues Triage Dec 12, 2024
@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Dec 12, 2024
@tgross tgross changed the title Service Registration Failures During Node Registration native service delete errors for old allocs after client restart Dec 12, 2024
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Dec 12, 2024
@tgross tgross removed their assignment Dec 12, 2024
@mismithhisler mismithhisler self-assigned this Dec 16, 2024
@mismithhisler
Copy link
Member

Hi @mr-karan, I spent a bit of time looking at this and wanted to provide an update. The service registration error appears to be unrelated. This is caused by the allocation runner always running the postrun hooks on any non GC'd allocation, just in case the Nomad client process terminated before all the postrun hooks were able to run. We will look look into suppressing this log, as it's not really an error.

The interesting log, and what we believe is the issue here, is Setup Failure: failed to setup alloc: pre-run hook "group_services" failed: no servers. "no servers" would be returned here when the RPC for service registration was being made, but the client had no known Nomad servers. I'd like to identify the cause of this, but at the moment I don't believe we have enough context.

Could you provide us more logs or a way to reproduce this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/service-discovery type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants