Network access becomes unavailable #490

MrSerth · 2023-11-01T17:54:38Z

An execution environment providing network access first seems to work correctly. However, after some time (and some unknown events), the environment looses the network access. The allocation itself is still running on Nomad, but unfortunately without any possibility to reach the internet.

Within a bash container, you can test network access through:

curl api.ipify.org

So far, we don't know yet when the error occurs. However, resynchronizing the environment from CodeOcean fixes the issue.

mpass99 · 2024-03-26T14:02:26Z

For me, the issue occurs directly, even right after synchronizing the environment.
We can reduce the error scope to the cni/secure-bridge as we can resolve the issue by replacing cni/secure-bridge in the network mode definition with bridge.
To be more precise, the internet access also works with the cni/secure-bridge network mode when replacing the specified routes with just a wildcard: { "dst": "0.0.0.0/0" }. I will continue investigating tomorrow why the routes-configuration breaks the internet access.

mpass99 · 2024-03-27T10:32:47Z

It seems the reason is that the nameserver is not reachable for DNS resolution.
When Nomad starts a container it registers three nameservers in our local network (10.224.x.x).
Since these address ranges are not defined in the cni/secure-bridge.conflist, the containers cannot reach the nameserver and therefore not resolve domain names.

We might either statically add these addresses to the route configuration or parse the output of resolvectl to determine the DNS Servers dynamically.

mpass99 · 2024-03-28T13:31:28Z

We discussed this issue and found that it is most surprising that in some cases the containers can resolve the domain names even though the nameservers should not be routable.
However, we agreed that the underlying issue is not that the containers cannot reach our nameserver, but that the containers are not using 8.8.8.8 as the nameserver configured via the Docker daemon.
We assumed that this might be caused by the introduction of the DNS option in the CNI secure bridge release.

Details

Changing this option, we see that nothing changes. However, when digging deeper in the container configuration, we see that the container option ResolvConfPath differs from the default (value: /opt/nomad/data/alloc/xyz/default-task/resolv.conf).
With this hint pointing at Nomad, we read the documentation more carefully again and notice

dns (DNSConfig: nil) - Sets the DNS configuration for the allocations. By default all task drivers will inherit DNS configuration from the client host. DNS configuration is only supported on Linux clients at this time. Note that if you are using a mode="cni/*, these values will override any DNS configuration the CNI plugins return.

that because we are using a cni/* mode, the Nomad DNS configuration always overwrites other configurations.

Therefore, we had to configure the DNS configuration via the Nomad Allocation configuration (that we define with Poseidon). With these changes, we are now again able to access the internet with network-enabled runners.

MrSerth · 2024-03-29T14:05:54Z

Currently, we are in the progress of enabling full IPv6 connectivity (between our internal hosts but also from containers to the internet). As part of this setup, we might also need to configure our secure bridge to work with IPv6 (while excluding internal resources, probably excluding the /64 prefix delegated).

MrSerth · 2024-04-03T15:01:35Z

Our latest changes work well and ensure we always have the desired DNS settings 💪
Unfortunately, however, they do not completely prevent allocations from loosing their network. I was just able to reproduce the issue:

Create a network-enabled execution environment
Execute a network command, e.g., curl api.ipify.org
Restart the Docker service on the respective Nomad host: sudo systemctl restart docker
Try running the same network command again; it will fail.

This discovery might be well linked to hashicorp/nomad#19962, describing the issue already. I would assume (without any confirmation yet), that this happened to us, too: If there is a new Docker release, we install it, usually requiring a Docker service restart. As a consequence, the lost network could occur.

I haven't fully checked the linked issue, whether there is some reasonable workaround for this problem, but I am afraid that the issue has not been solved yet completely.

mpass99 · 2024-04-15T11:34:38Z

Good finding 💪 I'm glad I got to know this issue after the times we wondered if we are seeing the same issues 😄

The reasoning described in the issue seems plausible: When using CNI, Nomad handles the network (interfaces) instead of Docker. When we restart Docker, Nomad recreates the containers, not the CNI network interfaces (on the host). The containers are then not able to establish network access anymore.

As Nomad currently prioritizes this issue, it might be solved in the future. In the meantime we could create a check in our Nomad Agent Ansible playbook:

Check if containers are running
For each running container, check if the NetworkMode is none
For one running, network-enabled container, check if curl api.ipify.org succeeds
If not successful, restart the Nomad service

MrSerth · 2024-04-16T22:37:34Z

Good finding 💪 I'm glad I got to know this issue after the times we wondered if we are seeing the same issues 😄

We just improve the service, so any change for better reliability is warmly welcomed 👍

As Nomad currently prioritizes this issue, it might be solved in the future. In the meantime we could create a check in our Nomad Agent Ansible playbook: [...]

Yes, I would also continue with an intermediate solution on our own. In chats with my colleagues today, we discovered another potential solution: Systemd. Proposed was PartOf=, but maybe another option such as BindsTo= or Requires= works too. Here is a comparison with several tests and a table that might be useful (so that we don't need to repeat that).

The idea would be to link Docker and Nomad, since this would automatically resolve the issue (at least those caused by Docker restarting). We could give it a try and observe the behavior. For overwriting a systemd file, one can just add a drop-in config (manually by executing sudo systemctl edit foo.service or by just placing the new settings in /etc/systemd/system/foo.service.d/override.conf).

mpass99 · 2024-04-25T13:19:10Z

Thank you for this other solution! It is less complicated and more reliable. Let's go with PartOf as it restarts Nomad when Docker restarts (such as Requires), but does not start Nomad when Docker starts (unlike Requires).

MrSerth · 2024-04-26T12:33:55Z

Awesome, sounds great! I've merged (and deployed) the corresponding PR, and thus will close this issue for now.

MrSerth added bug Something isn't working deployment Everything related to our production environment labels Nov 1, 2023

mpass99 mentioned this issue Mar 28, 2024

Fix Runner DNS resolution #566

Merged

mpass99 self-assigned this Mar 28, 2024

mpass99 closed this as completed in #566 Apr 3, 2024

MrSerth reopened this Apr 3, 2024

MrSerth closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network access becomes unavailable #490

Network access becomes unavailable #490

MrSerth commented Nov 1, 2023

mpass99 commented Mar 26, 2024

mpass99 commented Mar 27, 2024 •

edited

Loading

mpass99 commented Mar 28, 2024

MrSerth commented Mar 29, 2024

MrSerth commented Apr 3, 2024

mpass99 commented Apr 15, 2024

MrSerth commented Apr 16, 2024

mpass99 commented Apr 25, 2024

MrSerth commented Apr 26, 2024

Network access becomes unavailable #490

Network access becomes unavailable #490

Comments

MrSerth commented Nov 1, 2023

mpass99 commented Mar 26, 2024

mpass99 commented Mar 27, 2024 • edited Loading

mpass99 commented Mar 28, 2024

MrSerth commented Mar 29, 2024

MrSerth commented Apr 3, 2024

mpass99 commented Apr 15, 2024

MrSerth commented Apr 16, 2024

mpass99 commented Apr 25, 2024

MrSerth commented Apr 26, 2024

mpass99 commented Mar 27, 2024 •

edited

Loading