Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network access becomes unavailable #490

Closed
MrSerth opened this issue Nov 1, 2023 · 9 comments · Fixed by #566
Closed

Network access becomes unavailable #490

MrSerth opened this issue Nov 1, 2023 · 9 comments · Fixed by #566
Assignees
Labels
bug Something isn't working deployment Everything related to our production environment

Comments

@MrSerth
Copy link
Member

MrSerth commented Nov 1, 2023

An execution environment providing network access first seems to work correctly. However, after some time (and some unknown events), the environment looses the network access. The allocation itself is still running on Nomad, but unfortunately without any possibility to reach the internet.

Within a bash container, you can test network access through:

curl api.ipify.org

So far, we don't know yet when the error occurs. However, resynchronizing the environment from CodeOcean fixes the issue.

Bildschirmfoto 2023-11-01 um 18 52 43

@MrSerth MrSerth added bug Something isn't working deployment Everything related to our production environment labels Nov 1, 2023
@mpass99
Copy link
Contributor

mpass99 commented Mar 26, 2024

For me, the issue occurs directly, even right after synchronizing the environment.
We can reduce the error scope to the cni/secure-bridge as we can resolve the issue by replacing cni/secure-bridge in the network mode definition with bridge.
To be more precise, the internet access also works with the cni/secure-bridge network mode when replacing the specified routes with just a wildcard: { "dst": "0.0.0.0/0" }. I will continue investigating tomorrow why the routes-configuration breaks the internet access.

@mpass99
Copy link
Contributor

mpass99 commented Mar 27, 2024

It seems the reason is that the nameserver is not reachable for DNS resolution.
When Nomad starts a container it registers three nameservers in our local network (10.224.x.x).
Since these address ranges are not defined in the cni/secure-bridge.conflist, the containers cannot reach the nameserver and therefore not resolve domain names.

We might either statically add these addresses to the route configuration or parse the output of resolvectl to determine the DNS Servers dynamically.

@mpass99
Copy link
Contributor

mpass99 commented Mar 28, 2024

We discussed this issue and found that it is most surprising that in some cases the containers can resolve the domain names even though the nameservers should not be routable.
However, we agreed that the underlying issue is not that the containers cannot reach our nameserver, but that the containers are not using 8.8.8.8 as the nameserver configured via the Docker daemon.
We assumed that this might be caused by the introduction of the DNS option in the CNI secure bridge release.

Details

Changing this option, we see that nothing changes. However, when digging deeper in the container configuration, we see that the container option ResolvConfPath differs from the default (value: /opt/nomad/data/alloc/xyz/default-task/resolv.conf).
With this hint pointing at Nomad, we read the documentation more carefully again and notice

dns (DNSConfig: nil) - Sets the DNS configuration for the allocations. By default all task drivers will inherit DNS configuration from the client host. DNS configuration is only supported on Linux clients at this time. Note that if you are using a mode="cni/*, these values will override any DNS configuration the CNI plugins return.

that because we are using a cni/* mode, the Nomad DNS configuration always overwrites other configurations.

Therefore, we had to configure the DNS configuration via the Nomad Allocation configuration (that we define with Poseidon). With these changes, we are now again able to access the internet with network-enabled runners.

@mpass99 mpass99 self-assigned this Mar 28, 2024
@MrSerth
Copy link
Member Author

MrSerth commented Mar 29, 2024

Currently, we are in the progress of enabling full IPv6 connectivity (between our internal hosts but also from containers to the internet). As part of this setup, we might also need to configure our secure bridge to work with IPv6 (while excluding internal resources, probably excluding the /64 prefix delegated).

@MrSerth
Copy link
Member Author

MrSerth commented Apr 3, 2024

Our latest changes work well and ensure we always have the desired DNS settings 💪
Unfortunately, however, they do not completely prevent allocations from loosing their network. I was just able to reproduce the issue:

  1. Create a network-enabled execution environment
  2. Execute a network command, e.g., curl api.ipify.org
  3. Restart the Docker service on the respective Nomad host: sudo systemctl restart docker
  4. Try running the same network command again; it will fail.

This discovery might be well linked to hashicorp/nomad#19962, describing the issue already. I would assume (without any confirmation yet), that this happened to us, too: If there is a new Docker release, we install it, usually requiring a Docker service restart. As a consequence, the lost network could occur.

I haven't fully checked the linked issue, whether there is some reasonable workaround for this problem, but I am afraid that the issue has not been solved yet completely.

@MrSerth MrSerth reopened this Apr 3, 2024
@mpass99
Copy link
Contributor

mpass99 commented Apr 15, 2024

Good finding 💪 I'm glad I got to know this issue after the times we wondered if we are seeing the same issues 😄

The reasoning described in the issue seems plausible: When using CNI, Nomad handles the network (interfaces) instead of Docker. When we restart Docker, Nomad recreates the containers, not the CNI network interfaces (on the host). The containers are then not able to establish network access anymore.

As Nomad currently prioritizes this issue, it might be solved in the future. In the meantime we could create a check in our Nomad Agent Ansible playbook:

  1. Check if containers are running
  2. For each running container, check if the NetworkMode is none
  3. For one running, network-enabled container, check if curl api.ipify.org succeeds
  4. If not successful, restart the Nomad service

@MrSerth
Copy link
Member Author

MrSerth commented Apr 16, 2024

Good finding 💪 I'm glad I got to know this issue after the times we wondered if we are seeing the same issues 😄

We just improve the service, so any change for better reliability is warmly welcomed 👍

As Nomad currently prioritizes this issue, it might be solved in the future. In the meantime we could create a check in our Nomad Agent Ansible playbook: [...]

Yes, I would also continue with an intermediate solution on our own. In chats with my colleagues today, we discovered another potential solution: Systemd. Proposed was PartOf=, but maybe another option such as BindsTo= or Requires= works too. Here is a comparison with several tests and a table that might be useful (so that we don't need to repeat that).

The idea would be to link Docker and Nomad, since this would automatically resolve the issue (at least those caused by Docker restarting). We could give it a try and observe the behavior. For overwriting a systemd file, one can just add a drop-in config (manually by executing sudo systemctl edit foo.service or by just placing the new settings in /etc/systemd/system/foo.service.d/override.conf).

@mpass99
Copy link
Contributor

mpass99 commented Apr 25, 2024

Thank you for this other solution! It is less complicated and more reliable. Let's go with PartOf as it restarts Nomad when Docker restarts (such as Requires), but does not start Nomad when Docker starts (unlike Requires).

@MrSerth
Copy link
Member Author

MrSerth commented Apr 26, 2024

Awesome, sounds great! I've merged (and deployed) the corresponding PR, and thus will close this issue for now.

@MrSerth MrSerth closed this as completed Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deployment Everything related to our production environment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants