Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System job not restarting after client failure. #15069

Open
blmhemu opened this issue Oct 28, 2022 · 7 comments
Open

System job not restarting after client failure. #15069

blmhemu opened this issue Oct 28, 2022 · 7 comments

Comments

@blmhemu
Copy link

blmhemu commented Oct 28, 2022

Nomad version

1.4.2

Operating system and Environment details

Ubuntu arm64

Issue

If the client is down, system job on that client is not restarted unless manually done.
Screenshot 2022-10-28 at 1 51 09 PM
Also the job status in client shows 2 failed. It should be 1 failed 1 passed because as you can see below, there is one job running.

Reproduction steps

Run a system job. Take the client (or the whole cluster ?) down. Bring the nodes up. Check if the system job has all allocations.

Expected Result

All allocations present.

Actual Result

Not all allocations present.

Job file (if appropriate)

Same as #14932

Nomad Server logs (if appropriate)

Could see the alloc was killed due to

Template failed: nomad.var.get(nomad/jobs/caddy/caddy/[email protected]): Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad Client logs (if appropriate)

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 4, 2022

Hi @blmhemu 👋

Thanks for the report, do you happen to have more logs around the time the issue happen so we can get a better picture of what was going on with the connection between the client and servers?

Thanks!

@blmhemu
Copy link
Author

blmhemu commented Nov 5, 2022

Hey ! I did not find any relevant logs at the time. But If I change the network mode to normal, this issues does not occur. I think using bringe cni plugin is causing this issue. Also note that there has been a client restart.

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 7, 2022

Do you have any logs available? It's kind of hard to investigate without more information 😅

@blmhemu
Copy link
Author

blmhemu commented Nov 13, 2022

2022-10-18T15:04:08Z  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.65.37 has been allocated to c31f3174-7c95-6c1e-d782-ea00579f84c3, duplicate allocation is not allowed

This is one log I found.

@ostkrok
Copy link

ostkrok commented Jan 17, 2024

Hi,

Just wanted to report that we are also seeing this exact problem where some allocations of system jobs will not be restarted on a node in certain cases (often related to the node having been disconnected from the cluster).
I'll try to dig up some logs and attach here later today.

For other people having issues with this, here is what we usually do when we discover this:

nomad status | grep system | cut -f 1 -d " " | xargs -L1 nomad job eval

I.e. force Nomad to re-evaulate all system jobs in our cluster. It's not pretty, but it fixes the missing allocations without having to restart the jobs.

@p1u3o
Copy link

p1u3o commented Jan 26, 2024

I think I've started encountering the same issue. I was able to mitigate the issue by putting ExecStartPre=/bin/sleep 90 in the nomad.service, although the above command also works.

This is a wild guess, but in a quick restart (e.g. in a VM), if the client is restarted before the heartbeat_grace period is reached, it doesn't seem to think the client was down and attempts to resume the allocations.

But, the network namespace seems down or it doesn't attempt to re-create them, I can see this because if no new allocations scheduled, there won't be a "nomad" bridge interface until it attempts to create a new allocation.

The sleep forces the client to be considered down.

@mwild1
Copy link

mwild1 commented Feb 21, 2024

This feels like a duplicate of, or closely related to, #12023

@tgross tgross added the stage/needs-verification Issue needs verifying it still exists label Jun 24, 2024
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

6 participants