-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
native service delete errors for old allocs after client restart #24461
Comments
@mr-karan if the node hasn't registered yet, how is it running services? Is this a node that was running services and then restarted? |
@mr-karan we haven't heard back on this one in a while. It looks like you've got some running jobs and then you're rebooting the client agent, and then the allocations fail to restore? Are you rebooting the node? I tried reproducing but there really isn't enough to go on. I'm going to close this out for now as unreproducible, but if you have more info I'd be happy to reopen. |
Reproduced! jobspecjob "httpd" {
group "web" {
network {
mode = "bridge"
port "www" {
to = 8001
}
}
service {
name = "httpd-web"
provider = "nomad"
port = "www"
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-vv", "-f", "-p", "8001", "-h", "/local"]
ports = ["www"]
}
identity {
env = true
file = true
}
resources {
cpu = 100
memory = 100
}
}
}
}
Running on a cluster with a single client and single server, I was running that job for a while and updating it frequently debugging other work. Then I see the following logs on the client after running
The allocation IDs here are all for the service registration of the old allocations, not the ones that exist currently. So the errors we get from the server make sense -- these should all be gone already. But why the client still thinks it has to delete them I don't know yet. Seems like a chunk of data is getting left behind in the client state store. Reopening and marking for roadmapping. |
Hi @mr-karan, I spent a bit of time looking at this and wanted to provide an update. The service registration error appears to be unrelated. This is caused by the allocation runner always running the The interesting log, and what we believe is the issue here, is Could you provide us more logs or a way to reproduce this issue? |
Nomad version
Operating system and Environment details
Issue
Service registration errors and task failures occurring during node registration.
Reproduction steps
Expected Result
Actual Result
Multiple cascading failures observed:
Nomad Client logs
Nomad Alloc Events Timeline
The primary issue appears to be related to service registration and template rendering failures, particularly affecting HAProxy peer services. This is causing cascading failures across dependent services and tasks.
The text was updated successfully, but these errors were encountered: