-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System job not restarting after client failure. #15069
Comments
Hi @blmhemu 👋 Thanks for the report, do you happen to have more logs around the time the issue happen so we can get a better picture of what was going on with the connection between the client and servers? Thanks! |
Hey ! I did not find any relevant logs at the time. But If I change the network mode to normal, this issues does not occur. I think using bringe cni plugin is causing this issue. Also note that there has been a client restart. |
Do you have any logs available? It's kind of hard to investigate without more information 😅 |
This is one log I found. |
Hi, Just wanted to report that we are also seeing this exact problem where some allocations of system jobs will not be restarted on a node in certain cases (often related to the node having been disconnected from the cluster). For other people having issues with this, here is what we usually do when we discover this:
I.e. force Nomad to re-evaulate all system jobs in our cluster. It's not pretty, but it fixes the missing allocations without having to restart the jobs. |
I think I've started encountering the same issue. I was able to mitigate the issue by putting This is a wild guess, but in a quick restart (e.g. in a VM), if the client is restarted before the But, the network namespace seems down or it doesn't attempt to re-create them, I can see this because if no new allocations scheduled, there won't be a "nomad" bridge interface until it attempts to create a new allocation. The sleep forces the client to be considered down. |
This feels like a duplicate of, or closely related to, #12023 |
Nomad version
1.4.2
Operating system and Environment details
Ubuntu arm64
Issue
If the client is down, system job on that client is not restarted unless manually done.
Also the job status in client shows 2 failed. It should be
1 failed 1 passed
because as you can see below, there is one job running.Reproduction steps
Run a system job. Take the client (or the whole cluster ?) down. Bring the nodes up. Check if the system job has all allocations.
Expected Result
All allocations present.
Actual Result
Not all allocations present.
Job file (if appropriate)
Same as #14932
Nomad Server logs (if appropriate)
Could see the alloc was killed due to
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: