-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker containers managed by Nomad in bridge network mode are brought back up with broken networks. #19962
Comments
Ah, I failed to mention that the reproduction was done with the default configuration that ships with Nomad so I don't think it's something weird in there breaking things. |
I have this issue, it seems to be caused by the Docker/Nomad service being offline less than the I worked around it by adding a sleep to the nomad service file which is longer than The nomad cluster I use utilises fast booting lightweight VMs (less than 10s) thus nearly always hits this.
Maybe #19886 would help when merged. |
Crosslinking #15086 for visibility. |
Hi @Jess3Jane and thanks for raising this issue with a great reproduction. I was able to reproduce this locally and have included details below for future readers. I'll add this to our backlog. Host networking, Docker processes, and health check endpoint after initial start.
Task events show restart of the Docker processes:
The health check no longer responds.
The Nomad client host machine (I only had this test job running on my cluster) no longer has a virtual interface configured:
|
Not sure whether this is realy related but I have similar issue together with CNI where port forwarding didn't work after all services were restarted (note: I masked the first two ip-address digits on the destination):
Seems like a race condition to me. In this case I would expect the job to fail and may be retry later. |
Apologies for closing this, I think github did something silly with automation |
I don't need to restart Docker for this to occur. I'm not sure WHAT is proccing the change but under bridge networking my allocations are started with just a loopback interface. |
Cross-linking #24292 because I think these are probably interrelated due to Docker's management of its own namespace files. |
Nomad version
Though we are hitting it in v1.7.2 as well
Operating system and Environment details
We have hit this on multiple machines with slightly different versions, though all are Ubuntu 22.04. These are the details of a completely fresh Digital Ocean instance I used to reproduce the bug.
Issue
We have noticed that when we restart the Docker daemon on our machines every Nomad job on the client is brought back up with a busted network. To be more specific, it is brought up with no network. For example, my test container before restarting docker has the following networks:
and after restarting the daemon, is brought back up with just loopback:
This happens with every container, including the Nomad init container. Docker restarts the containers (as expected), the veths get recreated (as expected), but the containers now lack any interfaces other than loopback (unexpected).
Things that might be notable, the
nomad
network changes from<BROADCAST,MULTICAST,UP,LOWER_UP>
to<NO-CARRIER,BROADCAST,MULTICAST,UP>
and on machines withsystemd-networkd
, it's logs complain about the veth's loosing carrier.Reproduction steps
docker-ce
as per their docs (I used Docker's apt registry to install it).https://github.com/containernetworking/plugins/releases/download/v1.0.0/cni-plugins-linux-amd64-v1.0.0.tgz
into/opt/cni/bin
systemctl start docker
systemctl start nomad
systemctl restart docker
Expected Result
The ip/port combo that the job binds should be
curl
-able. It is before docker is restarted.Actual Result
If you curl the ip/port combo it will complain about having no route to host:
This makes sense as executing
ip addr
from within the container will now reveal the container has lost it's bridge network veth.Job file (if appropriate)
We've noticed this happen with every job but the job file I used for the reproduction is:
The toy instance I used for reproduction has a broken journal so sadly I have no logs from that to provide. If reproduction turns out to be an issue I'd be happy to send over some logs from one of our actual failing instances but I have a hunch this won't be that hard to reproduce.
The text was updated successfully, but these errors were encountered: