Secondary etcd-only nodes do not reconnect to apiserver after outage if joined against an etcd-only node #11311

brandond · 2024-11-13T02:02:29Z

Environmental Info:
K3s Version:
v1.30.6+k3s1 but earlier releases are affected as well... probably going back to May/June when the load-balancer stuff was last worked on.

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
Minimally, 2 etcd-only, 1 control-plane-only.
Will affect any split-role cluster with more than 1 etcd-only node.

Do NOT use fixed registration endpoint, but register all nodes against the first etcd-only node.

Describe the bug:
When all control-plane nodes are unavailable, the apiserver load-balancer on the secondary etcd-only nodes will fall back to the default server, which is the initial etcd-only node. When the control-plane nodes come back, the load-balancer does not close the connections to the etcd-only node, which leaves the kubelet and internal controllers connected to the etcd-only node, continuously logging apiserver disabled errors.

Nov 13 01:56:40 systemd-node-2 k3s[874]: E1113 01:56:40.186329     874 kubelet_node_status.go:544] "Error updating node status, will retry" err="error getting node \"systemd-node-2\": apiserver disabled"
Nov 13 01:56:40 systemd-node-2 k3s[874]: W1113 01:56:40.218182     874 reflector.go:547] k8s.io/[email protected]/tools/cache/reflector.go:232: failed to list *v1.Node: apiserver disabled
Nov 13 01:56:40 systemd-node-2 k3s[874]: E1113 01:56:40.218211     874 reflector.go:150] k8s.io/[email protected]/tools/cache/reflector.go:232: Failed to watch *v1.Node: failed to list *v1.Node: apiserver disabled
Nov 13 01:56:40 systemd-node-2 k3s[874]: E1113 01:56:40.742852     874 leaderelection.go:347] error retrieving resource lock kube-system/k3s-cloud-controller-manager: apiserver disabled
Nov 13 01:56:41 systemd-node-2 k3s[874]: E1113 01:56:41.411435     874 webhook.go:154] Failed to make webhook authenticator request: apiserver disabled
Nov 13 01:56:41 systemd-node-2 k3s[874]: E1113 01:56:41.411485     874 server.go:304] "Unable to authenticate the request due to an error" err="apiserver disabled"
Nov 13 01:56:43 systemd-node-2 k3s[874]: E1113 01:56:43.621369     874 controller.go:145] "Failed to ensure lease exists, will retry" err="apiserver disabled" interval="7s"

Steps To Reproduce:

Start a etcd-only node
Start another etcd-only node, joined to the first etcd-only node
Start a control-plane-only node, joined to the first etcd-only node
Once the cluster is up, restart k3s on the control-plane-only node
Note that after a short period of time, the second etcd-only node goes NotReady
Restart k3s on either of the etcd-only nodes. Note that all nodes become Ready again

Expected behavior:
All nodes reconnect to apiserver after outage

Actual behavior:
Secondary etcd-only nodes fail over to the default server (an etcd-only node with no apiserver) and get stuck there.

Additional context / logs:
cc @ShylajaDevadiga

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to K3s Development Nov 13, 2024

github-project-automation bot moved this to New in K3s Development Nov 13, 2024

brandond self-assigned this Nov 13, 2024

brandond moved this from New to Accepted in K3s Development Nov 13, 2024

brandond added the kind/bug Something isn't working label Nov 13, 2024

susesgartner mentioned this issue Nov 13, 2024

Nodes may fail to reconnect after K8s upgrade for K3s clusters with multiple etcd-only nodes rancher/rancher#48050

Open

caroline-suse-rancher moved this from Accepted to To Test in K3s Development Nov 14, 2024

caroline-suse-rancher added this to the 2024-11 Release Cycle milestone Nov 14, 2024

mdrahman-suse assigned aganesh-suse Nov 14, 2024

This was referenced Nov 17, 2024

Embedded load-balancer behavior is flakey and hard to understand #11334

Open

etcd-only and agent nodes do not properly fail back during apiserver outage #11349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Secondary etcd-only nodes do not reconnect to apiserver after outage if joined against an etcd-only node #11311

Secondary etcd-only nodes do not reconnect to apiserver after outage if joined against an etcd-only node #11311

brandond commented Nov 13, 2024 •

edited

Loading

Secondary etcd-only nodes do not reconnect to apiserver after outage if joined against an etcd-only node #11311

Secondary etcd-only nodes do not reconnect to apiserver after outage if joined against an etcd-only node #11311

Comments

brandond commented Nov 13, 2024 • edited Loading

brandond commented Nov 13, 2024 •

edited

Loading