-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.4.2: Jobs killed due to nomadService templating failures #15115
Comments
Thanks for the report @iSchluff. I haven't been able to reproduce it yet. Would you happen to have any client and server logs around the time this happened? |
@iSchluff the team found the root cause for the But now I'm curious about the job failing due to the template render error. By default, Nomad should retry rendering forever, which is configured by the |
Hi @lgfa29, no the client config is completely default
The only thing that might be special is that it's a single server+client instance |
Ah interesting. Do you have any logs form the client around the time of failure? |
I think the relevant portion is this, which was logged when the allocation was terminated This is a single instance with server+client, so the logs should all be interleaved
full log: https://p.rrbone.net/?be507af6af7b677d#5nZiq24EAXZeUAMPrWcKz2wvoD2Jr6k6JRqgAGAZXx6k I have since started over on this instance with a new data directory as it was basically unusable in that state. With a 1.4.2 instance it pretty much works as expected, so it is clearly related to the update procedure. |
Thanks for the logs, I will dig deeper throughout the week. The problem with the upgrade was fixed in #15121 and will be released when 1.4.3 comes out. I'm now trying to understand why the template caused the task to fail. I expected it to retry forever. |
Maybe related: I am also having the problem that dependent templates are not updated after nomad services successfully restart, leaving me with broken configs. On service stop the template is correctly updated, but not after it comes back. Also for whatever reason the permission errors are back, even though this was fresh set up
|
+1 Got spec
nomad alloc status after updating env variable in second group
|
Thanks for the extra info @veloc1 and @iSchluff. Could you check the allocation details as returned from the There should be a field called Thanks! |
@lgfa29 Yep, Also, attaching logs from client: Logs from client
Looks like, for some reason, dependency watcher missing auth token for first requests? |
@lgfa29 Faced this in production today as well. I've a close to default client config as well ( |
@lgfa29 any updates on this? have been facing the same issue for a while now. Recently we also migrated to nomad variables, which also shows up alongside Now from what we've figured, this happens when you deploy a new version over an existing deployment. Seems like a race condition with the events channel in template.go: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/template/template.go#L350-L371 Let me know if I could be of any help. |
Hi all, thanks for the extra info. I haven't been able to reproduce this yet, so no updates so far. I will try the deployment update @Thunderbottom next time I get a chance, thanks for the tip! |
I have been facing the same issue with Nomad 1.4.3 when using job templates with the Example job spec: job "whoami" {
region = "global"
datacenters = ["dc1"]
namespace = "default"
type = "service"
update {
max_parallel = 1
min_healthy_time = "0s"
}
group "whoami" {
scaling {
min = 1
max = 4
}
network {
port "http" {}
}
service {
name = "whoami"
tags = []
port = "http"
address_mode = "host"
check {
name = "whoami HTTP check"
type = "http"
path = "/health"
interval = "10s"
timeout = "2s"
}
}
task "whoami" {
driver = "docker"
user = "65534" # nobody on Ubuntu
kill_signal = "SIGTERM"
kill_timeout = "15s"
template {
destination = "local/config.yaml"
data = <<EOF
{{ with nomadVar "nomad/jobs/whoami" }}
mysql_username = "{{ .mysql_username }}"
mysql_password = "{{ .mysql_password }}"
{{ end }}
EOF
}
config {
image = "traefik/whoami:v1.8.7"
args = [
"--name=${NOMAD_ALLOC_NAME}",
"--port=${NOMAD_PORT_http}",
]
network_mode = "host"
}
resources {
cpu = 100
memory = 64
}
}
}
} Note that I'm using Workload Identity feature by putting nomad variables at path The issue can always be reproduced by scaling up/down the task group count at Nomad UI: Events with Type |
Hi folks! 👋 I was just fixing up #16259 and was looking around for similar issues and found this one... I suspect this issue will be fixed by #16266. Note that although #16259 talks about canaries, as far as I can tell this will happen for any in-place update, which looks like it could describe the behaviors we're seeing in this issue (unless I've missed something) |
As noted above I'm going to mark this as closed by #16266. There's enough changed here that if anyone hits this again and is reading this, I'd ask that they open a new issue with the debug-level logs described above. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
Nomad v1.4.2 (039d70e)
Operating system and Environment details
Linux nomad1 5.4.0-131-generic 147-Ubuntu
Issue
I have a set up where a job with multiple groups has various template cross-references using nomadService to dynamically fill in the address/port details.
With 1.4.x I see jobs getting killed because of template failures like these, even though the referenced service is running fine:
alertmanager service
promteams service
Additionally sometimes a template will not update even though the nomad service changed.
Reproduction steps
Expected Result
Tasks should not be randomly killed, and templates should be properly updated when the service changes.
Actual Result
The text was updated successfully, but these errors were encountered: