-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health check fails during grace period #9431
Comments
Hi @kneufeld! Is your |
@tgross As a matter a fact it is... I'll email the jobspec. |
Just wanted to circle back on this @kneufeld: I've been trying to reproduce the problem by building up a jobspec incrementally with all the features in the one you've provided and haven't yet hit on the source of the issue. I see you have #9307 open as well so hopefully folks in that thread will have some more data for you as well. |
Thank you for taking a look. This isn't a major issue, just something I felt should be documented. #9307 on the other hand is a major problem for us and will likely have us moving to kubernetes. |
I'm finding a similar thing happen to one of my jobs. I have a MySQL container running. MySQL does a temporary startup/shutdown cycle before starting properly. To skip this, based on the docs, the grace flag in the check_restart stanza should delay the health checks failing to recognise the service. But that's not happening. I'm seeing failed health check notices in Consul's log output despite the grace period of 60s not having passed. Nomad version: v1.0.4 (9294f35) Here's a timeline of the data, based on different data sources:
|
We can confirm the same issue on Nomad v1.1.4. |
The same issue there with 1.5.1. The grace period is not honored, task is marked as unhealthy just a few seconds after Nomad reports "task started". Anything we can do to help with this? Job spec: {
"ID": "application",
"Name": "application",
"Type": "service",
"Datacenters": [
"dev"
],
"Meta": {
"timestamp": "1681980389",
"version": "cb84602"
},
"Constraints": [
{
"LTarget": "${node.class}",
"RTarget": "build",
"Operand": "!="
}
],
"Update": {
"Stagger": 10000000000,
"MaxParallel": 1,
"HealthCheck": "checks",
"MinHealthyTime": 25000000000,
"HealthyDeadline": 100000000000,
"AutoRevert": true,
"AutoPromote": true,
"Canary": 1
},
"TaskGroups": [
{
"Name": "application",
"Count": 1,
"Networks": [
{
"Mode": "bridge",
"DynamicPorts": [
{
"Label": "healthcheck",
"Value": 0,
"To": -1,
"HostNetwork": "default"
},
{
"Label": "envoy-metrics",
"Value": 0,
"To": 9102,
"HostNetwork": "default"
}
]
}
],
"Services": [
{
"Name": "application",
"Port": 80,
"PortLabel": "80",
"AddressMode": "auto",
"Tags": [
"namespace=default",
"service=application",
"awesomejobid=e27853c7-8a3b-4332-9efc-8aa835775c93",
"external",
"mesh",
"prometheus_metrics_path=/metrics",
"prometheus_metrics_address=${attr.unique.network.ip-address}:${NOMAD_PORT_metrics}",
"prometheus_metrics=false"
],
"Checks": [
{
"Type": "http",
"Protocol": "http",
"Port": null,
"Path": "/api/health",
"AddressMode": "",
"Interval": 10000000000,
"Timeout": 2000000000,
"CheckRestart": {
"Limit": 2,
"Grace": 30000000000,
"IgnoreWarnings": true
},
"PortLabel": "healthcheck",
"Expose": true
}
],
"Connect": {
"Native": false,
"SidecarService": {
"Tags": [
"mesh",
"ingress",
"prometheus_metrics_address=${NOMAD_HOST_ADDR_envoy-metrics}",
"prometheus_metrics_path=/metrics"
],
"Proxy": {
"ExposeConfig": {
"Path": [
{
"Path": "/api/health",
"Protocol": "http",
"LocalPathPort": 80,
"ListenerPort": "healthcheck"
}
]
},
"Upstreams": [
{
"DestinationName": "backend-app",
"LocalBindPort": 13370
}
],
"Config": {
"envoy_prometheus_bind_addr": "0.0.0.0:9102",
"local_request_timeout_ms": 0
}
}
},
"SidecarTask": {
"ShutdownDelay": 10000000000,
"KillTimeout": 10000000000,
"Config": {
"labels": [
{
"environment": "dev",
"namespace": "default",
"service": "application-connect-proxy"
}
]
},
"Resources": {
"CPU": 50,
"MemoryMB": 96
}
}
}
}
],
"Tasks": [
{
"Name": "application",
"Vault": null,
"Driver": "docker",
"Config": {
"network_mode": null,
"image": "the-image:123abc",
"volumes": [
"/usr/local/share/ca-certificates/:/certs/"
],
"logging": [],
"labels": []
},
"Services": null,
"Env": {
"NAMESPACE": "default"
},
"Templates": [],
"Resources": {
"CPU": 100,
"MemoryMB": 256
},
"RestartPolicy": {
"Attempts": 10,
"Interval": 1800000000000,
"Delay": 15000000000,
"Mode": "fail"
},
"Artifacts": [],
"KillTimeout": 10000000000,
"ShutdownDelay": 10000000000
}
],
"EphemeralDisk": {
"SizeMB": 300
}
}
]
} |
Hi, This happened not at the first start, but after some time running, the application failed and during starting it waited less than grace + check interval * times. Looking at the code and at the logs, I noticed one thing that I don't know if it's intentional (in that case I didn't saw that documented) or a bug. In case of a restart after the first boot, the grace interval contains the restart delay. So if the delay is 25s and the check_restart grace is 60s. The application would wait for a useful grace of 35 seconds and then the check restart limits are applied (check interval * (times -1)). |
Hey @jorgemarey, thanks for the sleuthing regarding the restart delay and grace. I think we'll probably want to change behavior, but at least document it clearly if not. Added to our todo list, but not in the immediate todo column. It will realistically be at least a month before we could take this up, so if anybody wants to take a crack at making sure the grace period accounts for the delay, let us know, we'd accept a PR! |
Hi @mikenomitch, I'll try to make a PR for this if I find the time. |
Hi @mikenomitch, I was looking into this again. I saw that this only happens when the check is defined at a group level (if it's defined at task level the delay applied is correct) I think the problem is that the group service hook is executed on the prerun stage So it's registered as the checkRestarter executed before the restart is completed. Could the group_service_hook RegisterWorkload be moved to poststart fase as it's in the service_hook? I don't know if this conflicts with other things, but should solve this issue. At least what I'm referring to, that is that the check_restart.grace contains the restart.delay |
Nomad version
Nomad v1.0.0-beta3 (fcb32ef7ba037b2b83e7a0fd1c53c7410a2990db)
Operating system and Environment details
Ubuntu 18.04.5 LTS
Issue
During startup the task is occasionally killed for being unhealthy despite:
check_restart { grace = "30s" }
timemigrate { healthy_deadline = "10m" }
timeReproduction steps
Only happens occasionally
Job file (if appropriate)
Complicated job with 5 tasks, 1 job per task, connect proxies between them. Can email if desired.
Nomad Client logs (if appropriate)
From
hashi-ui
Nomad Server logs (if appropriate)
Can't find anything good. Hashi-ui relative logs vs journald absolute logs are annoying. This is actually from Friday (3 days ago).
The text was updated successfully, but these errors were encountered: