-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network access becomes unavailable #490
Comments
For me, the issue occurs directly, even right after synchronizing the environment. |
It seems the reason is that the nameserver is not reachable for DNS resolution. We might either statically add these addresses to the route configuration or parse the output of |
We discussed this issue and found that it is most surprising that in some cases the containers can resolve the domain names even though the nameservers should not be routable. Details
Changing this option, we see that nothing changes. However, when digging deeper in the container configuration, we see that the container option
that because we are using a Therefore, we had to configure the DNS configuration via the Nomad Allocation configuration (that we define with Poseidon). With these changes, we are now again able to access the internet with network-enabled runners. |
Currently, we are in the progress of enabling full IPv6 connectivity (between our internal hosts but also from containers to the internet). As part of this setup, we might also need to configure our secure bridge to work with IPv6 (while excluding internal resources, probably excluding the |
Our latest changes work well and ensure we always have the desired DNS settings 💪
This discovery might be well linked to hashicorp/nomad#19962, describing the issue already. I would assume (without any confirmation yet), that this happened to us, too: If there is a new Docker release, we install it, usually requiring a Docker service restart. As a consequence, the lost network could occur. I haven't fully checked the linked issue, whether there is some reasonable workaround for this problem, but I am afraid that the issue has not been solved yet completely. |
Good finding 💪 I'm glad I got to know this issue after the times we wondered if we are seeing the same issues 😄 The reasoning described in the issue seems plausible: When using CNI, Nomad handles the network (interfaces) instead of Docker. When we restart Docker, Nomad recreates the containers, not the CNI network interfaces (on the host). The containers are then not able to establish network access anymore. As Nomad currently prioritizes this issue, it might be solved in the future. In the meantime we could create a check in our Nomad Agent Ansible playbook:
|
We just improve the service, so any change for better reliability is warmly welcomed 👍
Yes, I would also continue with an intermediate solution on our own. In chats with my colleagues today, we discovered another potential solution: Systemd. Proposed was The idea would be to link Docker and Nomad, since this would automatically resolve the issue (at least those caused by Docker restarting). We could give it a try and observe the behavior. For overwriting a systemd file, one can just add a drop-in config (manually by executing |
Thank you for this other solution! It is less complicated and more reliable. Let's go with |
Awesome, sounds great! I've merged (and deployed) the corresponding PR, and thus will close this issue for now. |
An execution environment providing network access first seems to work correctly. However, after some time (and some unknown events), the environment looses the network access. The allocation itself is still running on Nomad, but unfortunately without any possibility to reach the internet.
Within a bash container, you can test network access through:
So far, we don't know yet when the error occurs. However, resynchronizing the environment from CodeOcean fixes the issue.
The text was updated successfully, but these errors were encountered: