Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KNI [Kubernetes Networking Interface] Initial Draft KEP #4477

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
2bede53
init of kni kep
MikeZappa87 Jan 11, 2024
14eeea2
update issue number
MikeZappa87 Jan 17, 2024
eae3341
WIP: KNI KEP
MikeZappa87 Jan 24, 2024
00209db
chore: remove shaneutt as approver
shaneutt Jan 25, 2024
9e9ef49
chore: add title to KEP
shaneutt Jan 25, 2024
eae3f0c
chore: first draft of a motivation section
shaneutt Jan 25, 2024
68738dd
Merge pull request #1 from shaneutt/kni-kep
MikeZappa87 Jan 26, 2024
9f215c6
Merge branch 'kubernetes:master' into KNI-KEP
MikeZappa87 Jan 26, 2024
217f1c3
change ordering of goals
MikeZappa87 Jan 26, 2024
64eca47
update goals and summary
MikeZappa87 Jan 27, 2024
664c2e0
update goals/non goals and notes
MikeZappa87 Jan 27, 2024
17a0fa6
Update keps/sig-network/4410-k8s-network-interface/README.md
MikeZappa87 Jan 30, 2024
d547f62
update with shane comments
MikeZappa87 Jan 30, 2024
f957158
Merge pull request #3 from MikeZappa87/zappa/v2
MikeZappa87 Jan 30, 2024
abc4210
add create network
MikeZappa87 Jan 30, 2024
cefc7c9
chore: cleanup template text and blank space
shaneutt Jan 30, 2024
a6e3c30
Merge pull request #4 from shaneutt/shaneutt/kni-cleanup-template
MikeZappa87 Jan 30, 2024
1f05981
support vm/kata
MikeZappa87 Jan 30, 2024
e770486
docs: another pass at the kni kep goals
shaneutt Jan 30, 2024
855d5e7
Merge pull request #5 from shaneutt/shaneutt/kni-goals-2
MikeZappa87 Jan 31, 2024
8a33b31
docs: add goal about Pod network ns APIs
shaneutt Feb 1, 2024
0c3fb89
docs: add a user story for network ns goals to KNI KEP
shaneutt Feb 1, 2024
325bbfc
Merge pull request #6 from shaneutt/patch-1
MikeZappa87 Feb 1, 2024
17baf99
update motivation
MikeZappa87 Feb 2, 2024
1bfd49b
Update keps/sig-network/4410-k8s-network-interface/kep.yaml
MikeZappa87 Feb 3, 2024
49e5614
update kep goals per discussions
MikeZappa87 Feb 7, 2024
34d21b7
update kep goals per discussions
MikeZappa87 Feb 7, 2024
d6d9a5c
update kep goals per discussions
MikeZappa87 Feb 7, 2024
82af8a0
Merge pull request #7 from MikeZappa87/update-kep
MikeZappa87 Feb 7, 2024
3177ee4
Update keps/sig-network/4410-k8s-network-interface/README.md
MikeZappa87 Feb 12, 2024
61281b5
Update keps/sig-network/4410-k8s-network-interface/README.md
MikeZappa87 Feb 14, 2024
1c3107b
update kep and temp remove user stories
MikeZappa87 Feb 15, 2024
2081e13
update goals
MikeZappa87 Feb 15, 2024
cd3f4b2
update goal
MikeZappa87 Feb 15, 2024
9d2ee29
docs: add options for KNI controllers to KNI KEP
shaneutt Feb 21, 2024
30d4804
Merge pull request #10 from shaneutt/shaneutt/kni-kep-alternatives-co…
MikeZappa87 Feb 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions keps/sig-network/4410-k8s-network-interface/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# KEP-4410: Kubernetes Networking reImagined

> **NOTE**: for the initial PR we've removed a lot of the templated text and
> aimed to keep this first iteration small and easier to consume. We are only
> focusing on the "What" and "Why" (e.g. motivation, goals, user stories) for
> this iteration so that we can build consensus on those first before we add
> any of the "How".

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
<!-- /toc -->

## Summary

KNI or Kubernetes Networking Interface, is an effort to take a second look at Kubernetes networking and evaluate what are the pain points, and what can we improve. At its core, KNI will be a foundational network API specific for Kubernetes that will provide flexibility and extensibility to solve basic and the most advanced and opinionated networking use cases.

## Motivation
MikeZappa87 marked this conversation as resolved.
Show resolved Hide resolved

Kubernetes networking is an area of complexity and multiple layers which has created several challenges and areas of improvement. These challenges include deployment of the CNI plugins, troubleshooting networking issues and development of new functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(linewrap! it's too hard to comment on lines that are too long)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. The current formatting has too many discussion points per line, it's difficult to follow and comment on.


Currently networking happens in three layers of the stack, Kubernetes itself by means of kube-proxy or another controller based solution, the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods. All of this communication happens through non network specific APIs which to the reader of the code makes it hard to determine where ‘networking’ is happening. Having networking in several layers presents an issue when needing to troubleshoot issues as one needs to check several areas and some cannot be done via kubectl logs such as the CNI execution logs. This becomes more of an effort as multiple uncoordinated processes are making changes to the same resource, the network namespace of either the root or pod. The KNI aims at reducing the complexity by consolidating the networking into a single layer and having a uniform process for both namespaced and kernel isolated pods through a gRPC API. Leveraging gRPC will allow users the ability to migrate away from the current execution model that the CNI currently leverages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by namespaced AND kernel-isolated pods? isn't that the same thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to do here is separate the virtualized oci runtimes from the non-virtualized oci runtimes. Aka kata vs runc. However we just got off a call with Kubevirt where you could use either. Both will leverage network namespaces however the virtualized cases have the additional kernel isolation.

Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both will use (at least) network namespacing it's probably less confusing to just say that (and provide more detail later in the doc for other cases that use addl isolation, if need be).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in this document outlines how adding a gRPC API will reduce layers versus CNI.
Using gRPC will not inherently reduce layers versus RPC via exec.
How specifically does KNI intend to reduce layers?


The next challenge is the deployment of the CNI plugins that provide the network setup and teardown of the pod. The idiomatic way to deploy a workload in Kubernetes is that everything should be in a pod; however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state. Since all existing K8s network plugins are running as daemonsets we will take this approach as well, where all the dependencies are packaged into the container image thus adopting a well known approach. This will have added benefits of the network pod startup being much faster as nothing should need to be downloaded.
Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A key part of what current CNI provides is a framework for different vendors to independently add extensions (CNI plugins) to a node, and have guarantees about when in the lifecycle of pod creation those are invoked, and what context they have from other plugins.

There may be multiple daemonsets from multiple vendors installing multiple CNI plugins on the same node or the node may come from a cloud provider with a CNI already installed and some other vendor might want to chain another extension onto that - any model we adopt must reckon with this as a first-class concern.

That, for me, is critical to retain for this to be a 1-1 replacement to the existing CNI - we can probably do something simpler than the current model of "kubelet talks to CRI, CRI impl reads shared node config with plugin list, serially execs the plugins as arbitrary privileged binaries", as well.

At the very least, moving that "list of extensions + CNI config" to an etcd-managed kube resource would fix currently non-fixable TOCTOU bugs we have in Istio, for instance: istio/istio#46848

At a minimum it's important for for me that any CNI/KNI extension support meets the following basic criteria:

  • I am able to validate that KNI is "ready" on a node
  • I am able to subscribe or register an extension with KNI from an in-cluster daemonset, and have guarantees TOCTOU errors will not silently unregister my extension.
  • I have a guarantee that if the above things are true, my extension will be invoked for every pod creation after that point, and that if my extension (or any other extension) fails during invocation, the pod will not be scheduled on the node.
  • I am able to get things like interfaces, netns paths, and assigned IPs for that pod, as context for my invocation.

Ideally as a "this is a good enough replacement" test, I would want to see current CNI plugins from a few different projects implemented as KNI extensions, and installed/managed on the same node. If we can do that, we are effectively proving this can be a well-scoped replacement for the status quo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

This is confusing me a bit - typically the CNI config is something workload pods are not aware of at all, via init containers or otherwise - only the CRI implementation handles them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod I was referring to was the network plugin daemonset pods (flannel, calico, ...). I can try and clean this up to be more clear.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok thanks, yeah - "node agent daemonset" or "privileged node-bound pod" or whatever. Something general and consistent, since I don't think it has to be an init container specifically, in the Pod sense.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see also #4477 (comment), and the followup comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all existing K8s network plugins are running as daemonsets we will take this approach as well [...]

This is definitely not true ...?
CNI plugin binaries can be pre-installed to the host and run entirely on the host without a daemonset and there are major cluster operators working this way.

Through CRI, even the plumbing node IPAM can be done without any pod.

Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNI plugin binaries can be pre-installed to the host and run entirely on the host without a daemonset and there are major cluster operators working this way.

Yep, some environments bake the nodes and don't let you change them (and the corollary of that is you can't use other CNIs or meshes). Some do.

Either way, the current gate is "you need node privilege to install CNI plugins" - some envs have K8S policies that allow pods to assume node-level privileges, some do not and prebake everything. That k8s-based gate should be maintained for KNI, so either option works, depending on operator/environment need.

I don't think there's a world in which we can effectively separate "able to add new networking extensions" from "require privileged access to the node". That's worth considering tho.

If extensions are driven by K8S resources it makes policy enforcement via operator-managed admission hooks potentially a bit easier, I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, the current gate is "you need node privilege to install CNI plugins" [...]

I would expect that to be true, given the position they fill in the stack (!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you deploy the CNI plugins can be opinionated. However, the piece that is important is that you no longer need to have any files in the host filesystem such as CNI binaries or CNI configuration files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the piece that is important is that you no longer need to have any files in the host filesystem such as CNI binaries or CNI configuration files.

But why is that important? I mean taken to an extreme ... You cannot run Kubernetes without files on the host filesystem anyhow, and configuring the pod network is an incredibly privileged place to operate.

Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why is that important? I mean taken to an extreme ... You cannot run Kubernetes without files on the host filesystem anyhow, and configuring the pod network is an incredibly privileged place to operate.

It's important because allowing privileged pods to run on a node (where privileged necessarily means "can mutate node state", networking or otherwise) is an operator choice today, and it seems wrong to take that choice away from operators.

This is how, for instance, it is possible to install Cilium in AWS today. Or Calico in GKE. Or Openshift, etc etc.

Anyone that doesn't want to allow privileged pods on a node can already choose to leverage existing K8S APIs to preclude that as a matter of operational policy - it's rather orthogonal to CNI.


Another area that KNI will improve is ‘network readiness’. Currently the container runtime is involved with providing both network and runtime readiness with the Status CRI RPC. The container runtime defines the network as ‘ready’ by the presence of the CNI network configuration in the host file system. The more recent CNI specification does include a status verb, however this is still bound by the current limitations, files on disk and execution model. The KNI will provide an RPC that can be implemented so the kubelet will call the KNI via gRPC.

KNI aims to help the community and other proposals in the Kubernetes ecosystem. We will do this by providing necessary information via the gRPC service. We should be the API that provides the “what networks are available on this node” so that another effort can make the kube-scheduler aware of networks. We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention. We should provide visibility into this so that we can indicate “no more pods” as setting the node to not ready will evict the healthy pods. While the future state of KNI could aim to propose changes to the kube-scheduler, it's not a part of our initial work and instead should try to assist other efforts such as DRA/device plugin to provide the information they need.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also provide IPAM status as a common issue, is that the IPAM runs out of assignable IP addresses and pods are no longer able to be scheduled on that node until intervention.

Typically this is handled by coordinating the assigned IP range and the max pods setting in the kubelet.
Cluster operators already have the tools to prevent this issue, why would you allow a node to be configured to have more pods than IPs?


The community may ask for more features, as we are taking a bold approach to reimagining Kubernetes networking by reducing the amount of layers involved in networking. We should prioritize feature parity with the current CNI model and then capture future work. KNI aims to be the foundational network api that is specific for Kubernetes and should make troubleshooting easier, deploying more friendly and innovate faster while reducing the need to make changes to core Kubernetes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community may ask for more features, as we are taking a bold approach to reimagining Kubernetes networking by reducing the amount of layers involved in networking.

Reducing the layers ... by adding a new GRPC service? Why isn't this CRI? (An existing Kubernetes-Specific gRPC service for Pods, which includes some networking related bits today ...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to the CRI-API is a design detail. The only piece of information relevant to networking is the PodIP coming back in PodSandboxStatus. This talks about eliminating the network setup/teardown and netns creation from the container/oci runtimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This talks about eliminating the network setup/teardown and netns creation from the container/oci runtimes.

Has this been a significant obstacle for implementers? Examples?

What blocks a CNI revision from handling netns/teardown?


### Goals

- Design a cool looking t-shirt
- Provide a RPC for the Attachment and Detachment of interface[s] for a Pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a significant departure from current expectations around pod networking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this? We already do something similar with CNI ADD/DEL with an execution model. This is levering gRPC to communicate with the gRPC server that would be flannel, calico, cilium.

- Provide a RPC for the Querying of Pod network information (interfaces, network namespace path, ip addresses, routes, ...)
- Provide a RPC to prevent additional scheduling of pods if IPAM is out of IP addresses without evicting running pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this is just reporting status up through CRI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last place I mention this, that the decision around CRI or a new API is a design detail.

- Provide a RPC to indicate network readiness for the node (no more CNI network configuration files in host file system)
- Provide a RPC to provide the user the ability to query what networks are on a node
- Consolidate K8s networking to a single layer without involving the container/oci runtimes
- KNI should provide the RPC's required to establish feature parity with current CNI [ADD, DEL]
- Provide documentation, examples, troubleshooting and FAQ's for KNI.
- Decouple the Pod and Node Network setup
- Provide garbage collection to ensure no resources created during pod setup such as Linux bridges, ebpf programs,
allocated IP addresses are left behind after pod deletion
- Improve the current IP handling for pods (PodIP) to be handle multiple IP addresses and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is not possible in CNI because ...?

a field to identify the IP address family (IPV4 vs IPV6)
- Provide backwards compatibility for the existing CNI approach and migration a path to fully adopt KNI
- Provide a uniform approach for network setup/teardown for both virtualized (kata) and non-virtualized (runc)
runtimes including kubevirt. This could eliminate the high and low level runtimes from the networking path
- Provide a reference implementation of the KNI network runtime
- Provide the ability to have all the dependencies packaged in the container image (no more CNI binaries in the host file system)
..- No more downloading CNI binaries via initContainers/Mounting /etc/cni/net.d or /opt/cni/bin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Downloading a container image to run the gRPC service is better how?

Currently a cluster operator can just pre-populate these instead of relying on a daemonset, it's pretty straightforward and results in very fast startup.

Copy link

@bleggett bleggett Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's valid, I think - there needs to be a story for supporting pre-baked/pre-provisioned "locked down" nodes.

That could be as simple as shipping node images with KNI as a system service, or shipping a node with a preloaded image - but warrants discussion.

Copy link
Author

@MikeZappa87 MikeZappa87 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make too much sense. Since calico, flannel, cilium are already running as daemonset pods. Since they would be implementing the KNI grpc service and no longer have a need for cni binaries on disk. Are you looking for a migration path here? This becomes a problem solved with kni.

Copy link

@bleggett bleggett Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's 2 (somewhat edge case but still valid) operator questions that can be explicitly answered here.

  1. I am an operator and I want to ship nodes with a specific CNI impl and not allow anyone downstream to change that by deploying privileged pods, how?

  2. I am an operator, and I want to set up airgapped (or at least, "fully preloaded") clusters that don't require dynamically fetching container image bytes on node spinup to become functional for workload deployment.

Both of these can be (and are today) solved with existing K8S primitives and patterns, with or without KNI, so I don't think KNI makes this materially more difficult.

The KNI solution to (1) would be the same as CNI - don't allow privileged pods on your custom nodes. KNI might give you a bit more flexibility here by allowing things like admission webhooks for KNI config, while still allowing privileged pods, etc that are not possible with the current out-of-K8S CNI config model.

The KNI solution to (2) would be the same solution you would employ today to ship any node-required daemonset you didn't want to pull on every node provision (yep, like cilium/flannel/calico/whatever) -> preload the images on the node, or run them as system services and not privileged containers.

- Provide the ability to use native k8s resources for configuration such as a ConfigMap's instead of configuration files in host file system
- Eliminate the need to exec binaries and replace with gRPC
- Make troubleshooting easier by having network runtime logs accessible via kubectl logs
- Improve network pod startup time

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add goal of having the pod object available at network runtime

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dougbtv I am drafting an update. So I might be able to get this. Do you have specific items you want off the Pod spec? Metadata (name, namespace, labels, annotations, ... )

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata nails it, thanks. At least I'm most interested in getting all you listed. Potentially someone might want more?

Copy link

@BlaineEXE BlaineEXE Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to request CIDRs as an available piece of metadata if possible. That would be great for legacy applications (e.g., Ceph) that use CIDR configurations as config values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlaineEXE this sounds reasonable. I notice you are in Colorado. I am in the boulder area. Does the application need the pod cidr? You might be able to infer this via the Pod IP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can offer up our use case (attaching pods to raw devices, potentially with macvlan) as something that seems worthwhile for KNI to consider.

that is exactly what we need, these experiences, and this one is something I've identified in multiple places, attach netdevices to pods, so I feel this is a strong use case ... what I also see is that these interfaces are used as "external' networks that are only relevant to the app running on the specific pod, so I don't feel that these IPs from these interfaces should be represented on the kubernetes topology ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlaineEXE I'm still trying to fully understand your use case, based on your comments it seems you need to have some prior work of setting up the infrastructure and the vlans,
who configures these vlans?
are these vlans configured on all these hosts?
...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly someone must configure the additional hardware. In practice, an admin must add a separate switch (or create a VLAN on an existing switch) that connects to a different interface on the host systems. So if eth0 underpins the k8s pod network, eth1 may be the result of the additional interface, unused by k8s itself.

Our current deployment strategy leverages Multus and NetworkAttachmentDefinitions to connect storage (Ceph) pods to eth1. We recommend CNI=macvlan (lower latency than bridge) with IPAM=whereabouts (ease of use), but users can practically use whatever CNI/IPAM they like.

This does work, but because Multus is such a complex feature to understand, users often seem lost trying to configure an already-complex storage system with NADs. In addition, there is developer complexity, and there are friction points -- like not being able to get a Service with a static IP on a Multus network.

Copy link
Member

@aojea aojea Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like not being able to get a Service with a static IP on a Multus network.

what do you mean by a multus network? some entity that is connected to the additional interface?

i.e , if eth1 is connected to an external vlan, some compute or host on that network?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlaineEXE @dougbtv I want to understand well this use case, I don't quite get what are the expectations here and what is the problem we are trying to solve., definitively kubernetes can not manage the infra, connecting switches or create vlanes, there are other projects that cover that area ...
but this part of Service with staticIP on Multus network is the one I want to understand, the docs are not clear https://github.com/rook/rook/blob/master/design/ceph/multus-network.md, does it mean to make services available out of the cluster?

### Non-Goals

1. Any changes to the kube-scheduler
2. Any specific implementation other than the reference implementation. However we should ensure the KNI-API is flexible enough to support

## Proposal

The proposal of this KEP is to design and implement the KNI-API and make necessary changes to the CRI-API and container runtimes. The scope should be kept to a minimum and we should target feature parity.

### User Stories

We are constantly adding these user stories, please join the community sync to discuss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are constantly adding these user stories [...]

Where?


### Notes/Constraints/Caveats

## Constraints

1. Guarantee the pod interface is setup and in a healthy state before containers are started (ephemeral, init, regular)

## Notes

Additional Information/Diagrams: https://docs.google.com/document/d/1Gz7iNtJNMI-zKJhaOcI3aflPCx3etJ01JMxzbtvruKk/edit?usp=sharing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything relevant to the KEP should be in-lined and committed to the repo.


Changes to the pod specification will require hard evidence.

The specifics of "Network Readiness" is an implementation detail. We need to provide this RPC to the user.

Since the network runtime can be run separated from the container runtime, you can package everything into a pod and not need to have binaries on disk. This allows the CNI plugins to be isolated in the pod and the pod will never need to mount /opt/cni/bin or /etc/cni/net.d. This offers a potentially more ability to control execution. Keep in mind CNI is the implementation however when this is used chaining is still available.

## Ongoing Considerations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be any discussion of how we might improve CNI, CRI instead and why that isn't sufficient versus this entirely new API and RPC service.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could live in the CRI-API as multiple services, no one has indicated that we must use a new API. However the CNI 2.0 was talked about being closer to K8s for years now. This is that effort.

Copy link

@bleggett bleggett Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree we could probably be more clear up front in the KEP about why the current CNI model (slurping out-of-k8s config files from well-known paths, TOCTOU errors, telling CRI implementations via out-of-band mechanisms to exec random binaries on the node by-proxy) is something we could explicitly improve on with KNI, and that KNI is basically "CNI 2.0" - it is the proposed improvement to current CNI.


### KNI Implementations PULL instead of PUSH?

The original KNI POC provides a gRPC API for callbacks which (in the POC) are
added to the Kubelet during `Pod` tasks to callout to the KNI implementation to
get `Pod` networking configured. This is pretty straightforward, and the
initial POC actually showed very good performance characteristics, but it has
a couple of potential downsides:

1. the synchronous nature of callbacks makes it harder to avoid deadlocks
2. in some extremely heavy use cases with lots of `Pods` rapidly deploying and
tearing down, this could be a potential scalability bottleneck.

Additionally, we intend to create Kubernetes APIs for networks and their
configurations, which means that the Kubelet and other component would operate
as something of a middleman consuming Kubernetes APIs via watch mechanisms and
converting them to gRPC calls, and then being responsible for the status,
e.t.c.

As such we've been actively considering whether it might make sense for at
least some of the functionality in KNI (such as domain/namespace
creation/deletion, and then interface attachment/detachment) be done by the KNI
directly via the KNI Kubernetes APIs.

For a simplified example a KNI implementation might watch `Pods` and wait for
kubelet to reach a state (via the status) where it indicates its ready to hand
off for network setup. The KNI implementation does it's work, and then updates
the `Pod` indicating the network setup is complete and the `Pods` containers
are then created.

There are downsides to this approach as well, one in particularly being that it
makes the provision of hookpoints for the KNI a lot more complicated for the
implementations. For now we've added this to our ongoing considerations section
as something to come back to, discuss and review.
48 changes: 48 additions & 0 deletions keps/sig-network/4410-k8s-network-interface/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
title: k8s-network-interface
kep-number: 4410
authors:
- "@mikezappa87"
- "@shaneutt"
owning-sig: sig-network
participating-sigs:
- sig-network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely at least SIG Node should be participating (there's no way this doesn't affect kubelet, CRI)..?

I would also tag Cluster Lifecycle at least as FYI / advisory since cluster lifecycle folks will know about and have suggestions re: node readiness and cluster configuration.

status: provisional
creation-date: 2024-01-11
reviewers:
- @aojea
- @danwinship
- @thockin
approvers:

see-also:
- "/keps/sig-aaa/1234-we-heard-you-like-keps"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete these, or update with relevant keps?

same below for replaces

- "/keps/sig-bbb/2345-everyone-gets-a-kep"
replaces:
- "/keps/sig-ccc/3456-replaced-kep"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.30"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.31"
beta: "v1.32"
stable: "v1.33"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unlikely, without even defining an API yet?


# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: kni
components:
- kubelet
- cri-api
disable-supported: true

# The following PRR answers are required at beta release
metrics:
- my_feature_metric