Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secondary etcd-only nodes do not reconnect to apiserver after outage if joined against an etcd-only node #11311

Open
brandond opened this issue Nov 13, 2024 · 0 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@brandond
Copy link
Member

brandond commented Nov 13, 2024

Environmental Info:
K3s Version:
v1.30.6+k3s1 but earlier releases are affected as well... probably going back to May/June when the load-balancer stuff was last worked on.

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
Minimally, 2 etcd-only, 1 control-plane-only.
Will affect any split-role cluster with more than 1 etcd-only node.

Do NOT use fixed registration endpoint, but register all nodes against the first etcd-only node.

Describe the bug:
When all control-plane nodes are unavailable, the apiserver load-balancer on the secondary etcd-only nodes will fall back to the default server, which is the initial etcd-only node. When the control-plane nodes come back, the load-balancer does not close the connections to the etcd-only node, which leaves the kubelet and internal controllers connected to the etcd-only node, continuously logging apiserver disabled errors.

Nov 13 01:56:40 systemd-node-2 k3s[874]: E1113 01:56:40.186329     874 kubelet_node_status.go:544] "Error updating node status, will retry" err="error getting node \"systemd-node-2\": apiserver disabled"
Nov 13 01:56:40 systemd-node-2 k3s[874]: W1113 01:56:40.218182     874 reflector.go:547] k8s.io/[email protected]/tools/cache/reflector.go:232: failed to list *v1.Node: apiserver disabled
Nov 13 01:56:40 systemd-node-2 k3s[874]: E1113 01:56:40.218211     874 reflector.go:150] k8s.io/[email protected]/tools/cache/reflector.go:232: Failed to watch *v1.Node: failed to list *v1.Node: apiserver disabled
Nov 13 01:56:40 systemd-node-2 k3s[874]: E1113 01:56:40.742852     874 leaderelection.go:347] error retrieving resource lock kube-system/k3s-cloud-controller-manager: apiserver disabled
Nov 13 01:56:41 systemd-node-2 k3s[874]: E1113 01:56:41.411435     874 webhook.go:154] Failed to make webhook authenticator request: apiserver disabled
Nov 13 01:56:41 systemd-node-2 k3s[874]: E1113 01:56:41.411485     874 server.go:304] "Unable to authenticate the request due to an error" err="apiserver disabled"
Nov 13 01:56:43 systemd-node-2 k3s[874]: E1113 01:56:43.621369     874 controller.go:145] "Failed to ensure lease exists, will retry" err="apiserver disabled" interval="7s"

Steps To Reproduce:

  1. Start a etcd-only node
  2. Start another etcd-only node, joined to the first etcd-only node
  3. Start a control-plane-only node, joined to the first etcd-only node
  4. Once the cluster is up, restart k3s on the control-plane-only node
  5. Note that after a short period of time, the second etcd-only node goes NotReady
  6. Restart k3s on either of the etcd-only nodes. Note that all nodes become Ready again

Expected behavior:
All nodes reconnect to apiserver after outage

Actual behavior:
Secondary etcd-only nodes fail over to the default server (an etcd-only node with no apiserver) and get stuck there.

Additional context / logs:
cc @ShylajaDevadiga

@brandond brandond self-assigned this Nov 13, 2024
@brandond brandond moved this from New to Accepted in K3s Development Nov 13, 2024
@brandond brandond added the kind/bug Something isn't working label Nov 13, 2024
@caroline-suse-rancher caroline-suse-rancher moved this from Accepted to To Test in K3s Development Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: To Test
Development

No branches or pull requests

3 participants