-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow querying Nomad service discovery checks by service (for consul-template) #23317
Comments
Hi @msirovy! You almost certainly are hitting this because your job doesn't have a In #23326 we're discussing making omitting this field a warning if you're submitting service jobs with |
Thanks for your reply. I've tried to tune these options, but without any positive result. But in meanwhile I've contiued with debuging and I am able to isolate a problem a bit more.
I am not able to debug it more, but if you have any recommendations I'll try it. |
Thanks for that extra context @msirovy. I took another look through the Nomad Services feature and I suspect that the problem you're seeing is because the consul-template There's an architectural reason for this, which is that the Nomad server doesn't record Nomad service health checks. When you query them, the server sends a request to the client that has those allocations. This was identified as "Future Work" in the original design document for Nomad service checks (internal doc ref for Nomad engineers reading this):
We'd need to implement this in order to have the |
Nomad version
Tested on nomad versions 1.7.3, 1.8.0
Operating system and Environment details
Linux XXX 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
Docker version 26.0.0, build 2ae903e
Issue
We have a deploymet using php-fpm with nginx but I am able to simulate the same results even with this example deployment. When I do an update of the deployment I can see a short downtime (2-10sec). I expect that nomad will start a new version, wait till the health checks pass and than append new deployment to services, but the broken deployment is available sooner. It is critical issue for us and we can't move to production stage before we solve it.
The strange is, that when I use deployment where is only one job in group, than the update works without downtime...
I am not able to share with you my php-fpm+nginx deployment because it uses aour private registry, but this example works the same way.
Reproduction steps
I have small devel nomad cluster with 5 nodes (1 master, 4 nodes), using nomad service discovery and nginx ingress (simillar to this https://github.com/theztd/startup-infra-docker/blob/main/files/jobs/nginx-ingress/deploy.nomad)
Expected Result
Do update without downtime.
Actual Result
Short but noticable down time during update
Job file (if appropriate)
Nomad Server logs (if appropriate)
I capture event stream by this small code
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: