-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rancher new cluster node registration failing #2053
Comments
From logs when creating the cluster in Rancher:
|
Contents of the clusters mcc bundle Chart.yaml value:
downstream cluster we're seeing the issue with is |
debug log output from the
|
I don't know golang at all but I've been digging around trying to find out if there's something in our clusters that's wrong. tag: release/v0.8.1+security1
so at this point it's using the Can't find |
I've done some local hacking of the code to add some logging and changed the this is the output when creating a new cluster called
|
I hacked the fleet code to work round the "chart requires kubeVersion: >= 1.23.0-0" issue, created a new fleet-controller container updated the fleet-controller deployment to run my hacked container on our dev cluster and its made no difference to the problem of the machine-plan Secret not being populated with data. That unrelated The issue remains that the nodes |
and
|
I'm having the exact same issue. Is there any workaround to get past this issue? Or maybe any specific version to use? |
I've not found a workaround, sometimes a registration works but mostly is stuck on the empty machine-plan for us. |
@rgomez-eng a long shot but; are you registering the node(s) with all three roles or have the problematic nodes got a sub-set of the
As it implies the node(s) with one of those roles isn't registered. |
We're not able to reliably re-create the issue and don't have the time to investigate further. |
We're still seeing this issue:
|
We use ansible to register the nodes (pull the registration cmd from the rancher api and run it on each node). The first set of nodes to get registered are the ones with control and etcd roles. After they're registered we register worker nodes. Rancher won't, by design, populate the machine-plans for the nodes until at least one node of all types is registered. I tried manually registering 3 nodes that had all three roles but still see the machine-plan having 0 bytes. |
Does it still happen with Rancher 2.8.1 ? |
I've just upgraded to 2.8.2 and wanted to re-create the node in a single-node k3s cluster.
provisioning log:
|
We have an similar issue caused by an empty machine-plan for the new nodes in a new cluster. A workaround that helped was this:
Suddenly the plan for master-1 was populated and the cluster bootstrapping started. We absolutely have no idea why the hell this works... |
We might have found a cause in our env: we uppped the timout on the the load balancer we have in front of the nodes running Rancher as we thought it was probably killing the rancher-system-agent watch of the websocket. |
@kfehrenbach your workaround sorted this issue out for me thanks, did you ever find a more permanent solution? |
I'm also having the issue Rancher v2.8.5 |
Still getting this issue, workaround not helped so far. Upgraded to Rancher v2.9.3 and seems to be exactly the same, one control node registers ok, triggering the other control nodes to try but none of the worker nodes (registered 10s after the control ones)
I tried the workaround; register a single control and a worker quickly in succession but they both stay stuck on "Waiting for node ref" I have managed to hack round this by registering a worker then a control node in quick succession then joining all the other nodes I need. |
What exactly do you mean by "in quick succession"? I have the same issue also with Rancher v2.8.5 and v1.28.15+rke2r1. Only machine-plan secret, that is being filled is the one from the control plane node(s). |
We register at least one worker and control within a second of each other, we use Ansible to do this so it's not a manual step for us. |
I tried that too. I also installed the nodes using Ansible, so they join in the same second, basically. We don't have any LoadBalancer in front of Rancher, so nothing we could do there. Still, the macine-plan secret stays emtpy for all worker nodes :( |
Is there an existing issue for this?
Current Behavior
When trying to register a new node with a new downstream RKE2 cluster in Rancher 2.7.9 (also 2.7.5) we see the nodes plan Secret is never populated so the
rancher-system-agent
endlessly polls for a plan.If we re-deploy the
fleet-agent
Deployment prior to creating the new downstream cluster definition in Rancher we can occasionally register nodes.We have to re-deploy
fleet-agent
each time we need to create a new cluster, though this does not consistently work around the issue.if the registration fails or we need to re-create the cluster we wipe the nodes, delete the cluster from Rancher and repeat the steps above.
From the
fleet-controller
logs when creating the downstream cluster named "test":The workaround of restarting the
fleet-agent
is not consistent, sometimes repeated manual loops of create cluster, register, delete cluster work.Registration of nodes to k3s clusters appears to work, I've not tested that as much
Expected Behavior
We can create register nodes to newly created downstream clusters.
Steps To Reproduce
Environment
Logs
Logs from
fleet-agent
after a restart followed by a failed node registration:Logs from
fleet-agent
after a restart, create new cluster and successful registration:Anything else?
Ref rancher/rancher#43901 specifically rancher/rancher#43901 (comment)
The text was updated successfully, but these errors were encountered: