Define Linux Network Devices #1271

aojea · 2024-11-07T11:28:39Z

The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace.

This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead.

The proposed scheme is based on the existing representation of network device by the struct net_device
https://docs.kernel.org/networking/netdevices.html.

This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration.

Fixes: #1239

aojea · 2024-11-07T11:29:28Z

/assign @samuelkarp

config-linux.md

AkihiroSuda · 2024-11-07T23:44:26Z

https://github.com/opencontainers/runtime-spec/blob/main/features.md should be updated too

aojea · 2024-11-12T12:29:47Z

https://github.com/opencontainers/runtime-spec/blob/main/features.md should be updated too

updated and addressed the comments

config-linux.md

schema/config-linux.json

schema/defs-linux.json

schema/test/config/bad/linux-netdevice.json

schema/test/config/good/linux-netdevice.json

aojea · 2024-11-12T21:52:30Z

AI @aojea (document the cleanup and destroy of the network interfaces)

config-linux.md

samuelkarp · 2024-11-12T22:23:45Z

From the in-person discussion today:

Net device lifecycle should follow the network namespace lifecycle
@aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted
@kad was concerned about restarts and error handling
Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

utam0k · 2024-11-12T22:25:13Z

config-linux.md

+
+This schema focuses solely on moving existing network devices identified by name into the container namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.
+
+**`netDevices`** (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.


This spec said "MUST" but, I think it can't do it in the rootless container because the rootless container doesn't have CAP_NET_ADMIN, right?

I'm not sure we should take care of the rootless container.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

+1 but It'd be better to clarify it in the spec.

added mor explanations about runtime and network devices lifecycle and runtime checks, PTAL

schema/test/config/good/linux-netdevice.json

config-linux.md

aojea · 2024-11-19T06:10:32Z

om the in-person discussion today:

Net device lifecycle should follow the network namespace lifecycle

@aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted

@kad was concerned about restarts and error handling

Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

Pushed a new commit addressing those comments, the changelog is

the network namespace lifecycle will move migratebale network devices and destroy virtual devides, the runtime MAY decide to do cleanup actions
runtime MUST check the container has enough privileges and an associated network namespace and fail if the check fail
removed the Mask field and use the Address field with CIDR notation (IP/Prefix) to deal with IPv4 and IPv6 addresses. Only one IP is allowed to be specified on purpose to simplify the operations and reduce risks
Add a HardwareAddress field for use cases that require to set a
specific mac or infiniband address

config-linux.md

kolyshkin

Left some more comments; to me it seems this won't be ready for 1.2.1 (or 1.3.0) if we want to cut a release soon.

kolyshkin · 2025-01-31T06:06:26Z

The key distinction here is between initialization and configuration. My proposal focuses on providing the bare minimum initialization required for basic network functionality within the container. This includes essential properties like the device name, MAC address, IP address, netmask, and MTU. These are the fundamental attributes that are necessary for the interface to be operational at all.

The benefits I see with this initial setup in the runtime is that it reduces the complexity of integration and avoids potential race conditions where the application might try to use the network before it's properly initialized. It also allow to reduce container size and complexity if we don't need to include scripts or logic to perform this basic network setup every time.

I like to think of this as: the runtime provides a basic "network plumbing" framework, while the application running inside the container is free to customize and configure the network according to its specific needs using tools like ip.

To me it seems that a simple StartContainer hook (which runs in a container before the container initial process is started) which calls ip (or, depending on the distro, ifconfig, etc.) with all needed parameters should be sufficient for configuring any network device. Sure, that requires a working ip (of ifconfig etc.) binary inside a container.

In case ip binary is not possible to have in container, this can be achieved via CreateRuntime hook and ip on the host.

Or, it could be done on the host by a higher-level orchestration tools just before container creation. Provided you can configure the interface while it's down, everything can be pre-configured, and the runtime tasks would be limited to:

moving a network device into container's netns;
optional renaming;
bringing it up.

I'm also worried about feature creep here. The list of properties needed to be set will grow over time, needing more and more from the runtime implementation, making it even more bloated. Say, what about protocols other than IPv4 and IPv6? Various devices that need more specific properties? Static routing entries? The list can go on and on.

Having said that, I can agree that a way to move a network device to a container would be nice to have. But I strongly believe that configuring its properties is a bit out of scope for container runtime.

aojea · 2025-01-31T09:25:12Z

Having said that, I can agree that a way to move a network device to a container would be nice to have. But I strongly believe that configuring its properties is a bit out of scope for container runtime.

I'm completely fine with that and that is what I want and need for Kubernetes, the bootstrapping configuration options are very nice to have things but not essential, as you mentioned the scope was creeping during the review, so my question is if all of you we'll be fine just removing that part and focusing only on the "moving interfaces part", @kad you were very vocal about IP configuration, is this acceptable?

aojea · 2025-01-31T11:19:44Z

Or, it could be done on the host by a higher-level orchestration tools just before container creation. Provided you can configure the interface while it's down, everything can be pre-configured, and the runtime tasks would be limited to:

moving a network device into container's netns;

optional renaming;

bringing it up.

I'm also worried about feature creep here. The list of properties needed to be set will grow over time, needing more and more from the runtime implementation, making it even more bloated. Say, what about protocols other than IPv4 and IPv6? Various devices that need more specific properties? Static routing entries? The list can go on and on

@kolyshkin see last commit removing all the network configuration options and leaving the proposal to only the network interface plumbing.

I think that this is a good next step for everyone involved in the review since it allows to unblock a lot of scenarios that today we need to workaround with out of band mechanisms and a lot of redtape ... and this is also not blocking a future expansion in case someone wants to pursue the network initialization side , but for Kubernetes this new feature in current proposal is really beneficial so I appreciate if we can still consider for next release

kad · 2025-01-31T13:45:40Z

I'm completely fine with that and that is what I want and need for Kubernetes, the bootstrapping configuration options are very nice to have things but not essential, as you mentioned the scope was creeping during the review, so my question is if all of you we'll be fine just removing that part and focusing only on the "moving interfaces part", @kad you were very vocal about IP configuration, is this acceptable?

we have case where simple assign list of addresses is going to be helpful. Removing dependency to external hook would be very nice to have.

Speaking of hooks, I need to remind that hooks usually leads to contamination of host OS (in addition to increasing complexity of deployment such hooks that sole purpose would be to call ip addr add in particular namespace) or container image. I understand concern about "feature creep", but for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime. Especially if we start thinking outside of runc/crun implementations, some other runtimes might implement "namespaces" differently (vm, usermodelinux, etc), so "moving interfaces" might have some specifics. I think original proposal about having minimal list of properties to configure was good enough middle ground.

kolyshkin · 2025-02-04T00:45:16Z

@aojea can you squash the commits? Otherwise LGTM, and this is definitely something we can build upon.

aojea · 2025-02-04T07:14:32Z

@aojea can you squash the commits? Otherwise LGTM, and this is definitely something we can build upon.

@kolyshkin squashed, thanks

config-linux.md

kolyshkin

LGTM except for one nit.

The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace. This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead. The proposed scheme is based on the existing representation of network device by the `struct net_device` https://docs.kernel.org/networking/netdevices.html. This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration. Signed-off-by: Antonio Ojea <[email protected]>

tianon · 2025-02-05T18:04:02Z

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

I agree with Kir that we should keep the scope really limited here so we don't accidentally stray too far into network management in the low level runtime, which is something we've traditionally left as a higher level concern for a lot of good reasons.

kolyshkin · 2025-02-08T00:06:32Z

One more comment -- why is the data structure a map? A map makes sense when we want to quickly find an element based on a key. Is it needed here?

Perhaps a simple array is easier.

aojea · 2025-02-08T12:22:10Z

One more comment -- why is the data structure a map? A map makes sense when we want to quickly find an element based on a key. Is it needed here?

interfaces names are unique per namespace, a map guarantees this avoiding user fat finger problems or misconfiguration, otherwise we need to validate that the names in the array are unique

kolyshkin · 2025-02-11T00:48:30Z

interfaces names are unique per namespace, a map guarantees this avoiding user fat finger problems or misconfiguration, otherwise we need to validate that the names in the array are unique

I don't think the uniqueness requirement justifies using map, and I don't think we should be concerned much about uniqueness. In case a network device will be listed twice, the second entry will fail with "no such device" because the first one will already be moved.

Also, we have other similar fields (Devices, Namespaces, Mounts etc.) as arrays. I see that time namespace offsets is implemented via maps (and I think it was a mistake, for similar reasons -- an array would work just fine).

Having said that, this is not a deal breaker. Also, I'm sure other maintainers may have different opinions. WDYT @opencontainers/runtime-spec-maintainers ?

pfl · 2025-02-11T09:19:49Z

I agree with Kir that we should keep the scope really limited here so we don't accidentally stray too far into network management in the low level runtime, which is something we've traditionally left as a higher level concern for a lot of good reasons.

With the IP configuration part now removed, am I correct that the "easy" solution configuring IP networking for additional interfaces is still via a custom runtime hook? In our case the containers are static and cannot be amended with IP configuration tools or the necessary networking capabilities, and I doubt we are able to inject another container into the pod without causing a stir. IP (and yes, a route configuration) would have been a good and standard solution for simple secondary network needs.

kad · 2025-02-11T09:40:54Z

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

My "border line" is about features and fields provided by kernel netlink api. If something is done with it, I don't see big difference between call "move interface xyz to namespace abc" vs. "set property foo to value bar on interface xyz in namespace abc". Or to what it worth, "put value 123 into cgroup foobar".
Things like DHCP that require more than plain kernel api call are clearly outside of the boundaries that can be implemented in runtime.

kolyshkin · 2025-02-11T22:53:40Z

schema/test/config/bad/linux-netdevice.json

+    "linux": {
+        "netDevices": {
+            "eth0": {
+                "name": 23


what is 23 here?

an integer to cause a bad config since it is expecting a string

kolyshkin · 2025-02-11T23:26:05Z

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

My "border line" is about features and fields provided by kernel netlink api. If something is done with it, I don't see big difference between call "move interface xyz to namespace abc" vs. "set property foo to value bar on interface xyz in namespace abc". Or to what it worth, "put value 123 into cgroup foobar". Things like DHCP that require more than plain kernel api call are clearly outside of the boundaries that can be implemented in runtime.

To me, this sounds like a runtime should implement most of what ip(8) is capable of. Which is a lot.

If that's the case, here's a crazy idea.

Considering that ip(8) is a de facto standard on Linux, perhaps such configuration can be achieved by supplying a sequence of ip command arguments in runtime-spec. A runtime when can use ip binary on the host to get to the needed configuration.

Something like

"linux": {
        "netDevices": [
            "name": "enp34s0u2u1u2",
            "ct_name": "eth0",
            "config": [
                "ip addr add 10.2.3.4/25 dev $IF",
                "ip link set $IF mtu 1350",
                "ip route add 10.10.0.0/16 via 10.2.3.99 dev $IF proto static metric 100"
                "ip route add default dev $IF"
            ]
        ]
    }

Which will result in runtime running the host's ip binary in the container's network namespace (after moving the device into container's namespace and giving it a new name). It can even be a single ip invocation if -b (batch mode) is supported.

The upsides are:

this uses a well known, de facto standard "domain language" for network-related configuration;
this allows for a quite complex configuration to be performed (limited to whatever ip understands);
this seems trivial to implement for any runtime;
this doesn't impose any specific container image requirements (no need for any extra binaries etc.);
this should be secure (as ip is running inside a netns);
some advanced runtimes can implement a subset of this internally (i.e. without the call to ip);

The biggest downside, I guess, is the scope is not well defined, as ip is evolving.

aojea · 2025-02-12T23:20:57Z

I don't like the idea of execing commands, specially in golang implementations that are problematic with goroutines and namespaces
I think that once we have just the network interfaces on the runtimes we'll get much better feedback and signal from the implementations to evaluate the next steps and to know if something else is needed and what exactly

I still think that preventing to detect failures at runtime is desirable, you can argue you should not add duplicates, but bugs exist and if somebody accidentally add a duplicate interface name it will cause a problem in production we could have avoided just not allowing to define it

kad · 2025-02-14T09:28:28Z

Considering that ip(8) is a de facto standard on Linux, perhaps such configuration can be achieved by supplying a sequence of ip command arguments in runtime-spec. A runtime when can use ip binary on the host to get to the needed configuration.

I believe that this is really bad idea, compared to have limited set of config options implemented. exec() is always slower and less reliable than calling native kernel APIs. It opens doors to all various injection vulnerabilities and other potential misuses and scope creeps than strict API subset usage. As well as Antonio mentioned proper error handling.

To provide more insights on our use case, here is snip of the code that we want to get rid of, currently implemented as OCI runtime hook (btw, including already unreliable call to ip in it): https://github.com/HabanaAI/habana-container-runtime/blob/main/cmd/habana-container-cli/netdevices.go#L131-L182

aojea requested review from AkihiroSuda, crosbymichael, cyphar, dqminh, giuseppe, hqhq, kolyshkin, mrunalp, thaJeztah, tianon and utam0k as code owners November 7, 2024 11:28

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Nov 7, 2024

View reviewed changes

config-linux.md Show resolved Hide resolved

aojea force-pushed the network-devices branch 2 times, most recently from 51e5104 to 3a666eb Compare November 12, 2024 12:26

samuelkarp reviewed Nov 12, 2024

View reviewed changes

askervin reviewed Nov 12, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

utam0k reviewed Nov 12, 2024

View reviewed changes

kad reviewed Nov 12, 2024

View reviewed changes

schema/test/config/good/linux-netdevice.json Outdated Show resolved Hide resolved

utam0k reviewed Nov 12, 2024

View reviewed changes

config-linux.md Show resolved Hide resolved

pfl reviewed Nov 19, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

kad reviewed Nov 19, 2024

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

AkihiroSuda requested a review from kolyshkin January 29, 2025 00:43

kolyshkin reviewed Jan 31, 2025

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

kolyshkin reviewed Jan 31, 2025

View reviewed changes

config-linux.md Show resolved Hide resolved

kolyshkin reviewed Jan 31, 2025

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

kolyshkin requested changes Jan 31, 2025

View reviewed changes

aojea force-pushed the network-devices branch from c5e11ef to dca9c4c Compare January 31, 2025 11:14

aojea force-pushed the network-devices branch from dca9c4c to 7c89c98 Compare January 31, 2025 11:21

aojea requested a review from kolyshkin January 31, 2025 11:25

aojea force-pushed the network-devices branch from 7c89c98 to ebe9192 Compare January 31, 2025 11:27

aojea force-pushed the network-devices branch from ebe9192 to 0f9ee33 Compare February 4, 2025 07:12

kolyshkin reviewed Feb 5, 2025

View reviewed changes

config-linux.md Outdated Show resolved Hide resolved

kolyshkin approved these changes Feb 5, 2025

View reviewed changes

aojea force-pushed the network-devices branch from 0f9ee33 to 0ce07ae Compare February 5, 2025 09:56

kolyshkin reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define Linux Network Devices #1271

Define Linux Network Devices #1271

aojea commented Nov 7, 2024

aojea commented Nov 7, 2024

AkihiroSuda commented Nov 7, 2024

aojea commented Nov 12, 2024

aojea commented Nov 12, 2024

samuelkarp commented Nov 12, 2024

utam0k Nov 12, 2024

utam0k Nov 12, 2024

samuelkarp Nov 12, 2024

utam0k Nov 12, 2024

aojea Nov 19, 2024

aojea commented Nov 19, 2024

kolyshkin left a comment

kolyshkin commented Jan 31, 2025

aojea commented Jan 31, 2025

aojea commented Jan 31, 2025

kad commented Jan 31, 2025

kolyshkin commented Feb 4, 2025

aojea commented Feb 4, 2025

kolyshkin left a comment

tianon commented Feb 5, 2025

kolyshkin commented Feb 8, 2025

aojea commented Feb 8, 2025 •

edited

Loading

kolyshkin commented Feb 11, 2025

pfl commented Feb 11, 2025

kad commented Feb 11, 2025

kolyshkin Feb 11, 2025

aojea Feb 12, 2025

kolyshkin commented Feb 11, 2025

aojea commented Feb 12, 2025

kad commented Feb 14, 2025


		This schema focuses solely on moving existing network devices identified by name into the container namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.

		`netDevices` (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.

Define Linux Network Devices #1271

Are you sure you want to change the base?

Define Linux Network Devices #1271

Conversation

aojea commented Nov 7, 2024

aojea commented Nov 7, 2024

AkihiroSuda commented Nov 7, 2024

aojea commented Nov 12, 2024

aojea commented Nov 12, 2024

samuelkarp commented Nov 12, 2024

utam0k Nov 12, 2024

Choose a reason for hiding this comment

utam0k Nov 12, 2024

Choose a reason for hiding this comment

samuelkarp Nov 12, 2024

Choose a reason for hiding this comment

utam0k Nov 12, 2024

Choose a reason for hiding this comment

aojea Nov 19, 2024

Choose a reason for hiding this comment

aojea commented Nov 19, 2024

kolyshkin left a comment

Choose a reason for hiding this comment

kolyshkin commented Jan 31, 2025

aojea commented Jan 31, 2025

aojea commented Jan 31, 2025

kad commented Jan 31, 2025

kolyshkin commented Feb 4, 2025

aojea commented Feb 4, 2025

kolyshkin left a comment

Choose a reason for hiding this comment

tianon commented Feb 5, 2025

kolyshkin commented Feb 8, 2025

aojea commented Feb 8, 2025 • edited Loading

kolyshkin commented Feb 11, 2025

pfl commented Feb 11, 2025

kad commented Feb 11, 2025

kolyshkin Feb 11, 2025

Choose a reason for hiding this comment

aojea Feb 12, 2025

Choose a reason for hiding this comment

kolyshkin commented Feb 11, 2025

aojea commented Feb 12, 2025

kad commented Feb 14, 2025

aojea commented Feb 8, 2025 •

edited

Loading