System job not restarting after client failure. #15069

blmhemu · 2022-10-28T08:20:18Z

Nomad version

1.4.2

Operating system and Environment details

Ubuntu arm64

Issue

If the client is down, system job on that client is not restarted unless manually done.

Also the job status in client shows 2 failed. It should be 1 failed 1 passed because as you can see below, there is one job running.

Reproduction steps

Run a system job. Take the client (or the whole cluster ?) down. Bring the nodes up. Check if the system job has all allocations.

Expected Result

All allocations present.

Actual Result

Not all allocations present.

Job file (if appropriate)

Same as #14932

Nomad Server logs (if appropriate)

Could see the alloc was killed due to

Template failed: nomad.var.get(nomad/jobs/caddy/caddy/[email protected]): Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

lgfa29 · 2022-11-04T21:23:10Z

Hi @blmhemu 👋

Thanks for the report, do you happen to have more logs around the time the issue happen so we can get a better picture of what was going on with the connection between the client and servers?

Thanks!

blmhemu · 2022-11-05T04:08:29Z

Hey ! I did not find any relevant logs at the time. But If I change the network mode to normal, this issues does not occur. I think using bringe cni plugin is causing this issue. Also note that there has been a client restart.

lgfa29 · 2022-11-07T23:59:55Z

Do you have any logs available? It's kind of hard to investigate without more information 😅

blmhemu · 2022-11-13T04:13:23Z

2022-10-18T15:04:08Z  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.65.37 has been allocated to c31f3174-7c95-6c1e-d782-ea00579f84c3, duplicate allocation is not allowed

This is one log I found.

ostkrok · 2024-01-17T06:47:24Z

Hi,

Just wanted to report that we are also seeing this exact problem where some allocations of system jobs will not be restarted on a node in certain cases (often related to the node having been disconnected from the cluster).
I'll try to dig up some logs and attach here later today.

For other people having issues with this, here is what we usually do when we discover this:

nomad status | grep system | cut -f 1 -d " " | xargs -L1 nomad job eval

I.e. force Nomad to re-evaulate all system jobs in our cluster. It's not pretty, but it fixes the missing allocations without having to restart the jobs.

p1u3o · 2024-01-26T11:02:06Z

I think I've started encountering the same issue. I was able to mitigate the issue by putting ExecStartPre=/bin/sleep 90 in the nomad.service, although the above command also works.

This is a wild guess, but in a quick restart (e.g. in a VM), if the client is restarted before the heartbeat_grace period is reached, it doesn't seem to think the client was down and attempts to resume the allocations.

But, the network namespace seems down or it doesn't attempt to re-create them, I can see this because if no new allocations scheduled, there won't be a "nomad" bridge interface until it attempts to create a new allocation.

The sleep forces the client to be considered down.

mwild1 · 2024-02-21T11:42:23Z

This feels like a duplicate of, or closely related to, #12023

blmhemu added the type/bug label Oct 28, 2022

lgfa29 added stage/waiting-reply theme/template labels Nov 4, 2022

lgfa29 self-assigned this Nov 4, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Triaging in Nomad - Community Issues Triage Jun 24, 2024

tgross unassigned lgfa29 Jun 24, 2024

tgross added the stage/needs-verification Issue needs verifying it still exists label Jun 24, 2024

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

tgross added theme/system-scheduler and removed stage/waiting-reply labels Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System job not restarting after client failure. #15069

System job not restarting after client failure. #15069

blmhemu commented Oct 28, 2022 •

edited

Loading

lgfa29 commented Nov 4, 2022

blmhemu commented Nov 5, 2022

lgfa29 commented Nov 7, 2022

blmhemu commented Nov 13, 2022

ostkrok commented Jan 17, 2024

p1u3o commented Jan 26, 2024 •

edited

Loading

mwild1 commented Feb 21, 2024

System job not restarting after client failure. #15069

System job not restarting after client failure. #15069

Comments

blmhemu commented Oct 28, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented Nov 4, 2022

blmhemu commented Nov 5, 2022

lgfa29 commented Nov 7, 2022

blmhemu commented Nov 13, 2022

ostkrok commented Jan 17, 2024

p1u3o commented Jan 26, 2024 • edited Loading

mwild1 commented Feb 21, 2024

blmhemu commented Oct 28, 2022 •

edited

Loading

p1u3o commented Jan 26, 2024 •

edited

Loading