Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KNI [Kubernetes Networking Interface] Initial Draft KEP #4477
KNI [Kubernetes Networking Interface] Initial Draft KEP #4477
Changes from all commits
2bede53
14eeea2
eae3341
00209db
9e9ef49
eae3f0c
68738dd
9f215c6
217f1c3
64eca47
664c2e0
17a0fa6
d547f62
f957158
abc4210
cefc7c9
a6e3c30
1f05981
e770486
855d5e7
8a33b31
0c3fb89
325bbfc
17baf99
1bfd49b
49e5614
34d21b7
d6d9a5c
82af8a0
3177ee4
61281b5
1c3107b
2081e13
cd3f4b2
9d2ee29
30d4804
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(linewrap! it's too hard to comment on lines that are too long)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. The current formatting has too many discussion points per line, it's difficult to follow and comment on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by
namespaced AND kernel-isolated pods
? isn't that the same thing?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was trying to do here is separate the virtualized oci runtimes from the non-virtualized oci runtimes. Aka kata vs runc. However we just got off a call with Kubevirt where you could use either. Both will leverage network namespaces however the virtualized cases have the additional kernel isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both will use (at least) network namespacing it's probably less confusing to just say that (and provide more detail later in the doc for other cases that use addl isolation, if need be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in this document outlines how adding a gRPC API will reduce layers versus CNI.
Using gRPC will not inherently reduce layers versus RPC via exec.
How specifically does KNI intend to reduce layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A key part of what current CNI provides is a framework for different vendors to independently add extensions (CNI plugins) to a node, and have guarantees about when in the lifecycle of pod creation those are invoked, and what context they have from other plugins.
There may be multiple daemonsets from multiple vendors installing multiple CNI plugins on the same node or the node may come from a cloud provider with a CNI already installed and some other vendor might want to chain another extension onto that - any model we adopt must reckon with this as a first-class concern.
That, for me, is critical to retain for this to be a 1-1 replacement to the existing CNI - we can probably do something simpler than the current model of "kubelet talks to CRI, CRI impl reads shared node config with plugin list, serially execs the plugins as arbitrary privileged binaries", as well.
At the very least, moving that "list of extensions + CNI config" to an etcd-managed kube resource would fix currently non-fixable TOCTOU bugs we have in Istio, for instance: istio/istio#46848
At a minimum it's important for for me that any CNI/KNI extension support meets the following basic criteria:
Ideally as a "this is a good enough replacement" test, I would want to see current CNI plugins from a few different projects implemented as KNI extensions, and installed/managed on the same node. If we can do that, we are effectively proving this can be a well-scoped replacement for the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing me a bit - typically the CNI config is something workload pods are not aware of at all, via
init
containers or otherwise - only the CRI implementation handles them?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod I was referring to was the network plugin daemonset pods (flannel, calico, ...). I can try and clean this up to be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok thanks, yeah - "node agent daemonset" or "privileged node-bound pod" or whatever. Something general and consistent, since I don't think it has to be an init container specifically, in the Pod sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see also #4477 (comment), and the followup comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely not true ...?
CNI plugin binaries can be pre-installed to the host and run entirely on the host without a daemonset and there are major cluster operators working this way.
Through CRI, even the plumbing node IPAM can be done without any pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, some environments bake the nodes and don't let you change them (and the corollary of that is you can't use other CNIs or meshes). Some do.
Either way, the current gate is "you need node privilege to install CNI plugins" - some envs have K8S policies that allow pods to assume node-level privileges, some do not and prebake everything. That k8s-based gate should be maintained for KNI, so either option works, depending on operator/environment need.
I don't think there's a world in which we can effectively separate "able to add new networking extensions" from "require privileged access to the node". That's worth considering tho.
If extensions are driven by K8S resources it makes policy enforcement via operator-managed admission hooks potentially a bit easier, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that to be true, given the position they fill in the stack (!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How you deploy the CNI plugins can be opinionated. However, the piece that is important is that you no longer need to have any files in the host filesystem such as CNI binaries or CNI configuration files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why is that important? I mean taken to an extreme ... You cannot run Kubernetes without files on the host filesystem anyhow, and configuring the pod network is an incredibly privileged place to operate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important because allowing privileged pods to run on a node (where privileged necessarily means "can mutate node state", networking or otherwise) is an operator choice today, and it seems wrong to take that choice away from operators.
This is how, for instance, it is possible to install Cilium in AWS today. Or Calico in GKE. Or Openshift, etc etc.
Anyone that doesn't want to allow privileged pods on a node can already choose to leverage existing K8S APIs to preclude that as a matter of operational policy - it's rather orthogonal to CNI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically this is handled by coordinating the assigned IP range and the max pods setting in the kubelet.
Cluster operators already have the tools to prevent this issue, why would you allow a node to be configured to have more pods than IPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing the layers ... by adding a new GRPC service? Why isn't this CRI? (An existing Kubernetes-Specific gRPC service for Pods, which includes some networking related bits today ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this to the CRI-API is a design detail. The only piece of information relevant to networking is the PodIP coming back in PodSandboxStatus. This talks about eliminating the network setup/teardown and netns creation from the container/oci runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been a significant obstacle for implementers? Examples?
What blocks a CNI revision from handling netns/teardown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a significant departure from current expectations around pod networking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this? We already do something similar with CNI ADD/DEL with an execution model. This is levering gRPC to communicate with the gRPC server that would be flannel, calico, cilium.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely this is just reporting status up through CRI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last place I mention this, that the decision around CRI or a new API is a design detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is not possible in CNI because ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Downloading a container image to run the gRPC service is better how?
Currently a cluster operator can just pre-populate these instead of relying on a daemonset, it's pretty straightforward and results in very fast startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's valid, I think - there needs to be a story for supporting pre-baked/pre-provisioned "locked down" nodes.
That could be as simple as shipping node images with KNI as a system service, or shipping a node with a preloaded image - but warrants discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make too much sense. Since calico, flannel, cilium are already running as daemonset pods. Since they would be implementing the KNI grpc service and no longer have a need for cni binaries on disk. Are you looking for a migration path here? This becomes a problem solved with kni.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 2 (somewhat edge case but still valid) operator questions that can be explicitly answered here.
I am an operator and I want to ship nodes with a specific CNI impl and not allow anyone downstream to change that by deploying privileged pods, how?
I am an operator, and I want to set up airgapped (or at least, "fully preloaded") clusters that don't require dynamically fetching container image bytes on node spinup to become functional for workload deployment.
Both of these can be (and are today) solved with existing K8S primitives and patterns, with or without KNI, so I don't think KNI makes this materially more difficult.
The KNI solution to (1) would be the same as CNI - don't allow privileged pods on your custom nodes. KNI might give you a bit more flexibility here by allowing things like admission webhooks for KNI config, while still allowing privileged pods, etc that are not possible with the current out-of-K8S CNI config model.
The KNI solution to (2) would be the same solution you would employ today to ship any node-required daemonset you didn't want to pull on every node provision (yep, like cilium/flannel/calico/whatever) -> preload the images on the node, or run them as system services and not privileged containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add goal of having the pod object available at network runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dougbtv I am drafting an update. So I might be able to get this. Do you have specific items you want off the Pod spec? Metadata (name, namespace, labels, annotations, ... )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata nails it, thanks. At least I'm most interested in getting all you listed. Potentially someone might want more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to request CIDRs as an available piece of metadata if possible. That would be great for legacy applications (e.g., Ceph) that use CIDR configurations as config values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlaineEXE this sounds reasonable. I notice you are in Colorado. I am in the boulder area. Does the application need the pod cidr? You might be able to infer this via the Pod IP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is exactly what we need, these experiences, and this one is something I've identified in multiple places, attach netdevices to pods, so I feel this is a strong use case ... what I also see is that these interfaces are used as "external' networks that are only relevant to the app running on the specific pod, so I don't feel that these IPs from these interfaces should be represented on the kubernetes topology ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlaineEXE I'm still trying to fully understand your use case, based on your comments it seems you need to have some prior work of setting up the infrastructure and the vlans,
who configures these vlans?
are these vlans configured on all these hosts?
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly someone must configure the additional hardware. In practice, an admin must add a separate switch (or create a VLAN on an existing switch) that connects to a different interface on the host systems. So if
eth0
underpins the k8s pod network,eth1
may be the result of the additional interface, unused by k8s itself.Our current deployment strategy leverages Multus and NetworkAttachmentDefinitions to connect storage (Ceph) pods to
eth1
. We recommend CNI=macvlan (lower latency than bridge) with IPAM=whereabouts (ease of use), but users can practically use whatever CNI/IPAM they like.This does work, but because Multus is such a complex feature to understand, users often seem lost trying to configure an already-complex storage system with NADs. In addition, there is developer complexity, and there are friction points -- like not being able to get a Service with a static IP on a Multus network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by a multus network? some entity that is connected to the additional interface?
i.e , if eth1 is connected to an external vlan, some compute or host on that network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlaineEXE @dougbtv I want to understand well this use case, I don't quite get what are the expectations here and what is the problem we are trying to solve., definitively kubernetes can not manage the infra, connecting switches or create vlanes, there are other projects that cover that area ...
but this part of Service with staticIP on Multus network is the one I want to understand, the docs are not clear https://github.com/rook/rook/blob/master/design/ceph/multus-network.md, does it mean to make services available out of the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything relevant to the KEP should be in-lined and committed to the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be any discussion of how we might improve CNI, CRI instead and why that isn't sufficient versus this entirely new API and RPC service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could live in the CRI-API as multiple services, no one has indicated that we must use a new API. However the CNI 2.0 was talked about being closer to K8s for years now. This is that effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree we could probably be more clear up front in the KEP about why the current CNI model (slurping out-of-k8s config files from well-known paths, TOCTOU errors, telling CRI implementations via out-of-band mechanisms to exec random binaries on the node by-proxy) is something we could explicitly improve on with KNI, and that KNI is basically "CNI 2.0" - it is the proposed improvement to current CNI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely at least SIG Node should be participating (there's no way this doesn't affect kubelet, CRI)..?
I would also tag Cluster Lifecycle at least as FYI / advisory since cluster lifecycle folks will know about and have suggestions re: node readiness and cluster configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete these, or update with relevant keps?
same below for
replaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unlikely, without even defining an API yet?