Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Standard] Stabilize node distribution standard #639

Open
4 tasks
cah-hbaum opened this issue Jun 17, 2024 · 32 comments · May be fixed by #806
Open
4 tasks

[Standard] Stabilize node distribution standard #639

cah-hbaum opened this issue Jun 17, 2024 · 32 comments · May be fixed by #806
Assignees
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling SCS is continuously built and tested SCS is continuously built and tested in order to raise velocity and quality SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10 standards Issues / ADR / pull requests relevant for standardization & certification

Comments

@cah-hbaum
Copy link
Contributor

cah-hbaum commented Jun 17, 2024

Follow-up for #524
The goal is to set the Node distribution standard to Stable after all discussion topics are debated and decided and the necessary changes derived from these discussions are integrated into the Standard and its test.

The following topics need to be discussed:

@cah-hbaum cah-hbaum added Container Issues or pull requests relevant for Team 2: Container Infra and Tooling standards Issues / ADR / pull requests relevant for standardization & certification SCS is continuously built and tested SCS is continuously built and tested in order to raise velocity and quality SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10 labels Jun 17, 2024
@cah-hbaum cah-hbaum self-assigned this Jun 17, 2024
@cah-hbaum cah-hbaum moved this from Backlog to Blocked / On hold in Sovereign Cloud Stack Jun 17, 2024
@cah-hbaum cah-hbaum moved this from Blocked / On hold to Doing in Sovereign Cloud Stack Jun 17, 2024
@cah-hbaum
Copy link
Contributor Author

cah-hbaum commented Jun 24, 2024

Topic 1: How is node distribution handled on installations with shared-control planes nodes?

e.g. Kamaji, Gardener, etc

This question was answered in Container Call 2024-06-27:

  • Standard case kamaji: dedicated controlplane components with shared etcd, everything hosted in k8s (no dedicated nodes), etcd is deployed with antiaffinity (kube-scheduler tries to spread across nodes). Relation of the nodes to each other is unknown to k8s
  • Gardener:

For example, regiocloud supports the Node Failure Tolerance case but not the Zone Failure Tolerance.

@martinmo martinmo mentioned this issue Jun 24, 2024
29 tasks
@cah-hbaum
Copy link
Contributor Author

cah-hbaum commented Jun 25, 2024

Topic 2: Differentiation between Node distribution and things like High Availability, Redundancy, etc.

I think to discuss this topic correctly, most of the wording/concepts need to be established first. I'm going to try and find multiple (if different) sources and link them here for different things.


High Availability

The main goal of HA is to avoid downtime, which is the period of time when a system, service, application, cloud service, or feature is either unavailable or not functioning properly. (https://www.f5.com/glossary/high-availability)
High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period. ... (https://www.cisco.com/c/en/us/solutions/hybrid-work/what-is-high-availability.html)
High availability means that we eliminate single points of failure so that should one of those components go down, the application or system can continue running as intended. In other words, there will be minimal system downtime — or, in a perfect world, zero downtime — as a result of that failure. (https://www.mongodb.com/resources/basics/high-availability)

So things termed with High Availability in general try to avoid downtime of their services with the goal of having zero downtime, which is most times not achievable. This can also be seen in this section: ... In fact, this concept is often expressed using a standard known as "five nines," meaning that 99.999% of the time, systems work as expected. This is the (ambitious) desired availability standard that most of us are aiming for. ... (https://www.mongodb.com/resources/basics/high-availability).
To achieve these goals, services, hardware or networks are most times provided in a redundant setup, which allows automatic fail-over if instances go down.


Redundancy
In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system... )https://en.wikipedia.org/wiki/Redundancy_(engineering))
In cloud computing, redundancy refers to the duplication of certain components or functions of a system with the intention of increasing its reliability and availability. (https://www.economize.cloud/glossary/redundancy)

HINT: WILL BE CONTINUED LATER

@martinmo martinmo changed the title [Standard] Follow-Up Node distribution standard [Standard] Stabilize node distribution standard Jun 27, 2024
@martinmo
Copy link
Member

I brought this issue up in today's Team Container Call and edited the above sections accordingly. As part of #649 we will also get access to Gardener and soon Kamaji clusters.

One thing I want to make you aware of @cah-hbaum: in the call, it was pointed out that term shared control-plane isn't correct. The control-plane isn't shared, instead, the control-plane nodes are shared and thus we should always say shared control-plane node.

(I edited above texts accordingly as well to refer to shared control-plane nodes.)

@joshmue
Copy link
Contributor

joshmue commented Aug 8, 2024

Another potential problem with the topology.scs.community/host-id label:

The concept of using the "host-id" may not play nice with VM live migrations.

I do not have any operational experience with e. g. Openstack live migrations (who is triggering them, when, ...?), but I guess that any provider-initiated live migration (which might be standard practice within zones, I guess) would invalidate any scheduling decision that Kubernetes made based on the "host-id" label. As Kubernetes does not reevaluate scheduling decisions, pods may end up on the same host, anyway (if the label even ends up updated). That in turn may be worked around by using the Kubernetes descheduler project.

If I did not miss anything, I guess there are roughly the following options:

  • Rule out live migrations
  • Remove the "host-id" label requirement
  • Specify how certain scenarios should play out in the standard (e. g. requiring descheduler)

@mbuechse
Copy link
Contributor

@joshmue Thanks for bringing this to our attention!

So let me try to get this straight.

  • We want environments to use some kind of anti-affinity for their control-plane nodes.
  • We need some kind of transparency so we can check for compliance.
  • Our idea with the host-id label somehow doesn't play well with live migrations.

I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:

  1. for the VMs to be scheduled on different hosts
  2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

How does the host-id label play into this process?

@joshmue
Copy link
Contributor

joshmue commented Aug 19, 2024

Still, "I do not have any operational experience with e. g. Openstack live migrations", but AFAIK:

I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live".

Yes (not only Control Plane nodes, though).

But would k8s even notice anything about that? What would the process look like?

Exactly that is the problem: Kubernetes would not (per-se) notice anything about that and the process would be undefined.

How does the host-id label play into this process?

Generally, not well, as relying on it for Pod scheduling (instead of e. g. topology.kubernetes.io/zone) may undermine the whole point of anti affinity for HA - if live migrations do happen as I imagine them.

@piobig2871
Copy link

Please keep in mind that I am researching the topic from scratch but after some digging into the topic I was able to find some useful information about the topic here: https://trilio.io/kubernetes-disaster-recovery/kubernetes-on-openstack/.

I will state some questions after knowing a little bit more about the topic.

@piobig2871
Copy link

piobig2871 commented Nov 5, 2024

@joshmue Thanks for bringing this to our attention!

So let me try to get this straight.

* We want environments to use some kind of anti-affinity for their control-plane nodes.

* We need some kind of transparency so we can check for compliance.

* Our idea with the `host-id` label somehow doesn't play well with live migrations.

I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:

1. for the VMs to be scheduled on different hosts

2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

Placing a landing node in each physical machine helps some nodes tolerate fault or failsafe mechanisms. This is because of the anti-affinity policies that result in the separation of key components across various nodes, which, in effect, reduces the likelihood of failures. Also, around such distribution requirements, certain checks have to be put in place to ensure that node distribution standards are met.

While the host-id label is helpful in distinguishing physical hosts, it can pose some difficulties in a process of live migrations, especially because it is not very responsive to changes when it comes to node relocation.

Instead of host based label we can use a cluster name to designate the ‘logical group’ or ‘cluster zone’ in a software construct which can alter with the ports address migrations in the respective cluster. This label will be less severe and will affect the live eviction optimally.

  1. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
  2. https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
  3. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

this is what a theory says.

EDIT: I have found some problem with what I have written here because I haven't took under consideration that our k8s is standing on the OpenStack instance with separated hardware nodes.

@piobig2871
Copy link

Topic 1

I have been able to install a tenant control plane using Kamaji, but there is several steps that has to be done before it gets to happend.

  1. Create Kind cluster with kind create cluster --name kamaji
  2. Install cert-manager:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install cert-manager bitnami/cert-manager \
    --namespace certmanager-system \
    --create-namespace \
    --set "installCRDs=true"
  1. Install metal LB:
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.7/config/manifests/metallb-native.yaml

This installation is performed using manifest, I am leaving a link here to get to know with documentation
4. Now what we have to do is to create IP address pool that is requiered to get real ips. Since we are running on kind I needed to extract gateway ips of the kind network that I am running on.

GW_IP=$(docker network inspect -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}' kind)
NET_IP=$(echo ${GW_IP} | sed -E 's|^([0-9]+\.[0-9]+)\..*$|\1|g')
  1. Right now we can create create kind-ip-pool by applying this script:
cat <<EOF | sed -E "s|172.19|${NET_IP}|g" | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-ip-pool
  namespace: metallb-system
spec:
  addresses:
  - 172.19.255.200-172.19.255.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: empty
  namespace: metallb-system
EOF
  1. After this initial setup I was able to install the Kamaji
helm repo add clastix https://clastix.github.io/charts
helm upgrade --install kamaji clastix/kamaji --namespace kamaji-system --create-namespace --set 'resources=null'
  1. And Create tenant control plane by kubectl apply -f https://raw.githubusercontent.com/clastix/kamaji/master/config/samples/kamaji_v1alpha1_tenantcontrolplane.yaml

@garloff
Copy link
Member

garloff commented Nov 11, 2024

So let me try to get this straight.

* We want environments to use some kind of anti-affinity for their control-plane nodes.

* We need some kind of transparency so we can check for compliance.

* Our idea with the `host-id` label somehow doesn't play well with live migrations.

I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think we are trying to solve for a special case here. Live migrations don't happen all that often.
In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules. The good news is that those same rules are evaluated by the scheduler (placement service) when chosing a new host on live migration. So unless something really strange happens, the guarantees after live migrations are the same as they were before. The host-id labels would be wrong now, which is somewhat ugly, but they still correctly indicate that we're not on the same hypervisor host.
Now what could happen is that the initial node distribution ended up on different hypervisor hosts just by coïncidence (and not systematically by anti-affinity), so live-migration could change that. In that case, statistics would make this setup break also in the initial setup sooner or later, so this would not go undetected.

I plead for ignoring live migration.

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:

1. for the VMs to be scheduled on different hosts

2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

The control plane node is a VM. So there is only one dimension.

@joshmue
Copy link
Contributor

joshmue commented Nov 12, 2024

In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules.

When speaking about usual workload K8s nodes, this would mean that there can only be either...

  • active Openstack anti-affinities; Maximum number of K8s nodes is limited by number of hypervisors
  • inactive Openstack anti-affinities; K8s is left making potentially wrong scheduling choices based on potentially outdated "host-id" labels

Right?

@piobig2871
Copy link

In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules.

When speaking about usual workload K8s nodes, this would mean that there can only be either...

* active Openstack anti-affinities; Maximum number of K8s nodes is limited by number of hypervisors

Initially, no two Kubernetes nodes can run on a single hypervisor employing Active Anti-Affinity. So yes, that's right. Once you reach this limit, you either need to add more hypervisors or re-evaluate the anti-affinity rule.

* inactive Openstack anti-affinities; K8s is left making potentially wrong scheduling choices based on potentially outdated "host-id" labels

Kubernetes may make scheduling decisions based on outdated host-id labels if any VMs are migrated. This could lead to clusters where multiple nodes end up on the same hypervisor, potentially creating unexpected single points of failure.

@joshmue
Copy link
Contributor

joshmue commented Nov 15, 2024

👍

What I'm trying to get at is this:

If there are active OpenStack anti-affinities, there is no use-case for a "host-id" node label to begin with. If node and hypervisor have a 1:1 (or 0:1) relationship, K8s pod anti-affinities can just target kubernetes.io/hostname.

If there are no OpenStack anti-affinities, ...

Kubernetes may make scheduling decisions based on outdated host-id labels if any VMs are migrated. This could lead to clusters where multiple nodes end up on the same hypervisor, potentially creating unexpected single points of failure.

In conclusion, given live migrations may happen occasionally, I do not see any use case for this label.

@mbuechse
Copy link
Contributor

The labels are not meant to influence scheduling in any way. They are meant to make scheduling transparent to the end user.

@joshmue
Copy link
Contributor

joshmue commented Nov 15, 2024

The labels are not meant to influence scheduling in any way. They are meant to make scheduling transparent to the end user.

Sure?

The current standard says:

Worker node distribution MUST be indicated to the user through some kind of labeling in order to enable (anti)-affinity for workloads over "failure zones".

...and then goes on to describe topology.kubernetes.io/zone, topology.kubernetes.io/region and topology.scs.community/host-id in this context.

This concerns K8s scheduling, not OpenStack scheduling, of course.

@mbuechse
Copy link
Contributor

The relevant point (and the one that describes the labels) is

To provide metadata about the node distribution, which also enables testing of this standard, providers MUST label their K8s nodes with the labels listed below.

@mbuechse
Copy link
Contributor

I must know, because I worked with Hannes on that, and we added this mostly because we needed the labels for the compliance test.

@joshmue
Copy link
Contributor

joshmue commented Nov 15, 2024

So the standard basically is intended to say:

  • We REQUIRE some sort of labeling in order to enable (anti)-affinity for workloads over "failure zones". We will not standardize them, though.
  • On an unrelated note, we REQUIRE labels which are usually used for anti-affinity (the ones defined by upstream, anyway), but they should not be used for anti-affinity.

?

@mbuechse
Copy link
Contributor

I'm not competent to speak on scheduling. It's well possible that these labels are ALSO used for scheduling. In the case of region and availability zone, this is probably true. Question is: how does host-id play into this?

@joshmue
Copy link
Contributor

joshmue commented Nov 15, 2024

I think that I see where you're coming from, having a focus on compliance testing. Do you see my point that it requires a great deal of imagination to interpret the standard like it was intended from a general POV?

Question is: how does host-id play into this?

I do not think it should, because of the reasons above. I also guess compliance tests should reflect the requirements of a standard, and AFAIU the standard does not forbid placing multiple nodes on the same host. unless If a CSP considers a host to be a "failure zone", they could also put the host-id into topology.kubernetes.io/zone - and then also have problems with live migration and the K8s recommendation of...

It should be safe to assume that topology labels do not change. Even though labels are strictly mutable, consumers of them can assume that a given node is not going to be moved between zones without being destroyed and recreated.

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

@mbuechse
Copy link
Contributor

Standard says:

The control plane nodes MUST be distributed over multiple physical machines.

So we need to be able to validate that.

It also says

At least one control plane instance MUST be run in each "failure zone"

But you could have only one failure zone. Then, still, the control plane nodes must be distributed over multiple physical hosts.

The host-id field is not necessarily meant for scheduling (particularly for the control plane, where the user cannot schedule anything, right)?

Does that make sense?

@mbuechse
Copy link
Contributor

BTW, I'm open to improving the wording to avoid any misunderstanding here. At this point, though, we first have to agree on what's reasonable at all.

@joshmue
Copy link
Contributor

joshmue commented Nov 15, 2024

The control plane nodes MUST be distributed over multiple physical machines.

Did not see that, actually!

Still, let's go through some cases of what "failure zone" may mean:

  • zone equals one of many co-located buildings
  • zone equals one of many rooms within a building
  • zone equals one of many racks within a room
  • zone equals one of many machines within a rack

If topology.kubernetes.io/zone is defined as any of these things, it can be used to test the standard and the above requirement is satisfied (In a world where a single VM is always local to one hypervisor at any point of time).

Theoretically, one may define "failure zone" as something like:

  • zone equals one of many isolation groups within a machine

But the standard already implicitly says that the smallest imaginable unit is a single unit machine.

Zones could be set from things like single machines or racks up to whole datacenters or even regions

EDIT: But yes, introducing this specific requirement may be a bit confusing, having the other wording referring to logical failure zones. And mandating it may only be checked by having a "host-id" with some strict definition - or (better) defining that topology.kubernetes.io/zone must be at least be a physical machine.

@joshmue
Copy link
Contributor

joshmue commented Nov 15, 2024

But you could have only one failure zone.

I see that this is not explicitly forbidden in the standard, but all the texts hints towards it being forbidden, so I assumed it:

It is therefore necessary for important data or services to not be present just on one failure zone

At least one control plane instance MUST be run in each "failure zone"

Since some providers only have small environments to work with and therefore couldn't comply with this standard, it will be treated as a RECOMMENDED standard, where providers can OPT OUT.

@piobig2871
Copy link

Theoretically, one may define "failure zone" as something like:

* zone equals one of many isolation groups within a machine

Like a network?

But you could have only one failure zone.

I see that this is not explicitly forbidden in the standard, but all the texts hints towards it being forbidden, so I assumed it:

It is therefore necessary for important data or services to not be present just on one failure zone

I have thought about it like we have 1 failure zone by one control plane and the workers may be diverse on the different machines physical or virtual

At least one control plane instance MUST be run in each "failure zone"

Like here is mentioned

@mbuechse
Copy link
Contributor

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

@joshmue
Copy link
Contributor

joshmue commented Nov 18, 2024

Theoretically, one may define "failure zone" as something like:

* zone equals one of many isolation groups within a machine

Like a network?

I just wanted to give an example of a theoretically viable, yet hypothetical runtime unit within a single machine.

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

Yes. CSP's with hosts as failure zones still would have problems with live-migrations and the assumption that topology labels do not change, but by removing the "host-id" requirement, this problem should be exclusive to such small/tiny providers.

On another note, the recommendation here...

At least one control plane instance MUST be run in each "failure zone", more are RECOMMENDED in each "failure zone" to provide fault-tolerance for each zone.

does not seem to take etcd quorum and/or etcd scaling sweet spots into account ( https://etcd.io/docs/v3.5/faq/ ). But it does not strictly mandate questionable design choices (only slightly hints at them), so I will not go into too much detail, here.

@piobig2871
Copy link

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

You raised an important point about the potential misalignment between the concepts of failure zones and physical hosts. AFAIU from Kubernetes' perspective, failure zones are abstract constructs defined to ensure redundancy and fault isolation. The actual granularity of these zones (e.g., a rack, a data center, or even an individual physical host) depends on the cloud service provider's (CSP's) design.

Kubernetes treats all nodes within a failure zone as equally vulnerable because the assumption is that a failure impacting one could potentially affect all others in the same zone. This approach is why zones matter more than individual hosts when scheduling workloads. For smaller CSPs, defining each host or rack as its own failure zone might be a practical approach to increase redundancy, especially when physical resources are limited. It aligns with your suggestion to mandate multiple zones while dropping specific focus on physical hosts.

At least one control plane instance MUST be run in each "failure zone", more are RECOMMENDED in each "failure zone" to provide fault-tolerance for each zone.

does not seem to take etcd quorum and/or etcd scaling sweet spots into account ( https://etcd.io/docs/v3.5/faq/ ). But it does not strictly mandate questionable design choices (only slightly hints at them), so I will not go into too much detail, here.

Etcd’s own documentation highlights the challenges of maintaining quorum and scalability in distributed systems, particularly as the cluster size increases beyond the optimal sweet spot of 3-5 nodes.

Right now I am wondering what alternative strategies could be employed to balance the need for fault tolerance across failure zones while adhering to etcd’s quorum and scaling best practices?

@garloff
Copy link
Member

garloff commented Nov 18, 2024

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

We have an availability zone standard (0121), you probably know it better than me.
Many providers do not have several AZs, either because they are too small or because they use shared-nothing architectures with several regions rather than several AZs.

I would highly discourage to now disconnect the notion of infra-layer availability zones from "Failure Zones" in Kubernetes. A recipe for confusion.

Single hosts can fail for a variety of reasons, e.g. broken RAM or broken PSU or broken network port or even just a regular maintenance operation (hypervisor or firmware upgrade). In a data center, these events happen much more often than the outage of a complete room/zone/AZ. We want to avoid one host to take down several control plane nodes in the cluster, that is the whole point of having several nodes in the first place. Yes, multi-AZ is nicer, but that is a luxury that we don't always have. Having multiple physical hosts is much better than not. If we can not succeed with an upstream host-id label, we have a difficult time to test this from within the cluster. We can still easily test this if we have access to the IaaS layer that hosts the cluster, of course. Not ideal, but no reason to drop the requirement, IMVHO.

@piobig2871
Copy link

Single hosts can fail for a variety of reasons, e.g. broken RAM or broken PSU or broken network port or even just a regular maintenance operation (hypervisor or firmware upgrade). In a data center, these events happen much more often than the outage of a complete room/zone/AZ. We want to avoid one host to take down several control plane nodes in the cluster, that is the whole point of having several nodes in the first place. Yes, multi-AZ is nicer, but that is a luxury that we don't always have. Having multiple physical hosts is much better than not. If we can not succeed with an upstream host-id label, we have a difficult time to test this from within the cluster. We can still easily test this if we have access to the IaaS layer that hosts the cluster, of course. Not ideal, but no reason to drop the requirement, IMVHO.

With that comment can we assume that the Node distribution and High Availability topics will be separated for the standard purposes? Would separated standards be more clear than creation of the corner cases?

@piobig2871
Copy link

piobig2871 commented Nov 19, 2024

Also, I have found that in standard k8s-node-anti-affinity there is a note already regarding high availability but it still not defines how those machines has to be connected to each other.

In a productive environment, the control plane usually runs across multiple machines and
a cluster usually contains multiple worker nodes in order to provide fault-tolerance and
high availability.

That is why I have created a separate scs-0219-v1-high-availability to consider on my branch in a draft mode to discuss

@piobig2871
Copy link

piobig2871 commented Nov 22, 2024

points established on the container call (21.11.2024):

  • drop any mention of this label (host-id) from the standard - remove host-id label and we have to start looking on the infrastructure during updates in order to be able to test if the configuration uses OpenStack (because we cantest it) otherwise provider has to implement the check if the control plane is on another node

  • in the meantime: just demand an assertion by the CSP?

  • The test script can not do its jobs currently due to the missing host-id label, need to rewrod the language to make sense before stabilization

    ToDo: Check whether we just stick to v1 then or whether we can do a quick fix for v2 (@piobig2871, @garloff)

  • More generic K8s HA work (@piobig2871): issue standards/#639, PR standards/#806

    Needs renumbering (0219->0220)
    Good work that should continue
    Will possibly supercede 0214 some day

  • External etcd (or even redis) is possible (and does not contradict any of our standards)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling SCS is continuously built and tested SCS is continuously built and tested in order to raise velocity and quality SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10 standards Issues / ADR / pull requests relevant for standardization & certification
Projects
Status: Doing
6 participants