Rancher new cluster node registration failing #2053

P-n-I · 2024-01-09T14:48:11Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When trying to register a new node with a new downstream RKE2 cluster in Rancher 2.7.9 (also 2.7.5) we see the nodes plan Secret is never populated so the rancher-system-agent endlessly polls for a plan.

If we re-deploy the fleet-agent Deployment prior to creating the new downstream cluster definition in Rancher we can occasionally register nodes.

We have to re-deploy fleet-agent each time we need to create a new cluster, though this does not consistently work around the issue.

re-deploy fleet-agent Deployment on the Rancher cluster (k -n cattle-fleet-local-system rollout restart deployment fleet-agent)
create new downstream cluster definition
register node(s) to cluster

if the registration fails or we need to re-create the cluster we wipe the nodes, delete the cluster from Rancher and repeat the steps above.

From the fleet-controller logs when creating the downstream cluster named "test":

2024-01-09T14:14:27.714641430Z time="2024-01-09T14:14:27Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

The workaround of restarting the fleet-agent is not consistent, sometimes repeated manual loops of create cluster, register, delete cluster work.

Registration of nodes to k3s clusters appears to work, I've not tested that as much

Expected Behavior

We can create register nodes to newly created downstream clusters.

Steps To Reproduce

create new rke2 cluster
run registration command on cluster bootstrap node

Environment

- Architecture: x86_64
- Fleet Version: 1.7.1 and 1.8.1
- Cluster:
  - Provider: rke2
  - Options:
  - Kubernetes Version: v1.26.11+rke2r1

Logs

Logs from fleet-agent after a restart followed by a failed node registration:

I0109 14:34:16.884697       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:34:20.761215643Z I0109 14:34:20.760567       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:34:21.514842587Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
2024-01-09T14:34:21.515239711Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:34:21.515651076Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:34:21.515921289Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:34:22.245467409Z E0109 14:34:22.245355       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
time="2024-01-09T14:34:22Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="getting history for release fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:23Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
time="2024-01-09T14:34:24Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:25Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Logs from fleet-agent after a restart, create new cluster and successful registration:

I0109 14:37:40.958163       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:37:44.767848536Z I0109 14:37:44.767654       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:37:45.799901278Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:37:45.799938559Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:37:45.799944609Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:37:45.799949489Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
E0109 14:37:45.966607       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
2024-01-09T14:37:45.991817525Z time="2024-01-09T14:37:45Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
2024-01-09T14:37:45.992046547Z time="2024-01-09T14:37:45Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:46.002690980Z time="2024-01-09T14:37:46Z" level=info msg="getting history for release fleet-agent-local"
2024-01-09T14:37:46.255440243Z time="2024-01-09T14:37:46Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:47.041131051Z time="2024-01-09T14:37:47Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
2024-01-09T14:37:48.276516222Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:48.527326573Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Anything else?

Ref rancher/rancher#43901 specifically rancher/rancher#43901 (comment)

The text was updated successfully, but these errors were encountered:

P-n-I · 2024-01-10T08:33:11Z

From logs when creating the cluster in Rancher:
fleet-agent

W0110 08:31:07.744207       1 reflector.go:442] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:167: watch of *v1alpha1.BundleDeployment ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 7; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

fleet-controller

time="2024-01-10T08:31:02Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-dev-sandbox-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

P-n-I · 2024-01-10T11:38:47Z

Contents of the clusters mcc bundle Chart.yaml value:

annotations:
  catalog.cattle.io/certified: rancher
  catalog.cattle.io/hidden: "true"
  catalog.cattle.io/kube-version: '>= 1.23.0-0 < 1.27.0-0'
  catalog.cattle.io/namespace: cattle-system
  catalog.cattle.io/os: linux
  catalog.cattle.io/permits-os: linux,windows
  catalog.cattle.io/rancher-version: '>= 2.7.0-0 < 2.8.0-0'
  catalog.cattle.io/release-name: system-upgrade-controller
apiVersion: v1
appVersion: v0.11.0
description: General purpose controller to make system level updates to nodes.
home: https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
kubeVersion: '>= 1.23.0-0'
name: system-upgrade-controller
sources:
- https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
version: 102.1.0+up0.5.0

downstream cluster we're seeing the issue with is v1.26.11+rke2r1

P-n-I · 2024-01-11T13:34:37Z

debug log output from the fleet-controller when creating a new downstream cluster:


time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="shorten bundle name test-managed-system-agent to test-managed-system-agent"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 32.236433ms"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 183.411µs"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="shorten bundle name mcc-test-managed-system-upgrade-controller to mcc-test-managed-system-upgrade-controller"
time="2024-01-11T11:08:10Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 5.27411ms"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 289.752µs"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"

P-n-I · 2024-01-11T15:14:57Z

I don't know golang at all but I've been digging around trying to find out if there's something in our clusters that's wrong.

tag: release/v0.8.1+security1

controller.OnBundleChange
controller.setResourceKey
helmdeployer.Template (sets Helm defaults including useGlobalCfg: true and globalCfg.Capabilities to chartutil.DefaultCapabilities)
Helm.Deploy
Helm.install
Helm.getCfg (if useGlobalCfg return globalCfg)

so at this point it's using the globalCfg which has default (1.20.0) as the kubeversion in Capabilities
therefore Helm.install doesn't execute the cfg.RESTClientGetter.ToRESTMapper()

Can't find useGlobalCfg getting set anywhere other than true in Template so:
I think it's unset when called via agent.manager so is the bool default: false

P-n-I · 2024-01-12T15:33:09Z

I've done some local hacking of the code to add some logging and changed the fleet-controller Deployment to use our in-house hacked version.

this is the output when creating a new cluster called bobbins:

time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller calling setResourceKey"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 in setResourceKey mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template patched with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template calling Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install for bundle mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install cfg kubeversion v1.20.0"
time="2024-01-12T15:22:34Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-bobbins-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

P-n-I · 2024-01-17T15:00:50Z

I hacked the fleet code to work round the "chart requires kubeVersion: >= 1.23.0-0" issue, created a new fleet-controller container updated the fleet-controller deployment to run my hacked container on our dev cluster and its made no difference to the problem of the machine-plan Secret not being populated with data.

That unrelated kubeVersion issue relates to the bundle mcc-<cluster>-managed-system-upgrade-controller.

The issue remains that the nodes custom-<id>-machine-plan Secret doesn't get populated so rancher-system-agent endlessly polls Rancher.

P-n-I · 2024-01-17T15:37:25Z

rancher-system-agent output with CATTLE_AGENT_LOGLEVEL=debug

Jan 17 15:34:23 packer systemd[1]: Started Rancher System Agent.
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=debug msg="Instantiated new image utility with imagesDir: /var/lib/rancher/agent/images, imageCredentialProviderConfig: /var/lib/rancher/credentialprovider/config.yaml, imageCredentialProviderBinDir: /var/lib/rancher/credentialprovider/bin, agentRegistriesFile: /etc/rancher/agent/registries.yaml"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Starting remote watch of plans"
Jan 17 15:34:27 packer rancher-system-agent[18569]: E0117 15:34:27.619141   18569 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=info msg="Starting /v1, Kind=Secret controller"
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=debug msg="[K8s] Processing secret custom-aede8c2b641f-machine-plan in namespace fleet-default at generation 0 with resource version 48393246"

and

k -n fleet-default get secret custom-aede8c2b641f-machine-plan
NAME                               TYPE                         DATA   AGE
custom-aede8c2b641f-machine-plan   rke.cattle.io/machine-plan   0      101s

rgomez-eng · 2024-01-18T21:14:57Z

I'm having the exact same issue. Is there any workaround to get past this issue? Or maybe any specific version to use?

P-n-I · 2024-01-19T09:50:35Z

I've not found a workaround, sometimes a registration works but mostly is stuck on the empty machine-plan for us.

P-n-I · 2024-01-26T13:20:34Z

@rgomez-eng a long shot but; are you registering the node(s) with all three roles or have the problematic nodes got a sub-set of the etcd, controlplane and worker roles?
Check the logs from the Rancher pods for occurences of

[INFO] [planner] rkecluster fleet-default/<CLUSTER NAME>: waiting for at least one control plane, etcd, and worker node to be registered

As it implies the node(s) with one of those roles isn't registered.
Until each of the three roles is fulfilled by at least one registered node the cluster is not considered 'sane' and no node plan is delivered, therefore the rancher-system-agent endlessly polls for the plan Secret.

P-n-I · 2024-01-29T10:10:54Z

We're not able to reliably re-create the issue and don't have the time to investigate further.
Sometimes it just takes a while for the plan Secret to populate even though we have 6 nodes (3*etc/control, 3 worker) wanting to join.

P-n-I · 2024-02-15T08:06:39Z

We're still seeing this issue:

fleet-default                                    custom-2a6a4339a879-machine-plan                           rke.cattle.io/machine-plan            0	 14m
fleet-default                                    custom-638e801c6183-machine-plan                           rke.cattle.io/machine-plan            0	 14m
fleet-default                                    custom-60117bd68fdd-machine-plan                           rke.cattle.io/machine-plan            0	 15m
fleet-default                                    custom-4623c9642380-machine-plan                           rke.cattle.io/machine-plan            0	 16m
fleet-default                                    custom-6aa4c775aee0-machine-plan                           rke.cattle.io/machine-plan            0	 16m
fleet-default                                    custom-a8dd42fd6fcc-machine-plan                           rke.cattle.io/machine-plan            0	 16m

P-n-I · 2024-02-15T12:54:26Z

We use ansible to register the nodes (pull the registration cmd from the rancher api and run it on each node).

The first set of nodes to get registered are the ones with control and etcd roles. After they're registered we register worker nodes.

Rancher won't, by design, populate the machine-plans for the nodes until at least one node of all types is registered.

I tried manually registering 3 nodes that had all three roles but still see the machine-plan having 0 bytes.

kkaempf · 2024-02-19T08:51:36Z

Does it still happen with Rancher 2.8.1 ?

P-n-I · 2024-02-19T11:47:13Z

I've just upgraded to 2.8.2 and wanted to re-create the node in a single-node k3s cluster.
I deleted the node from the downstream cluster when on 2.7.9.
Got notified of your comment so upgraded to 2.8.2.
Joining the node to the cluster is still stuck on and empty machine-plan Secret.

fleet-default                                    custom-624a3f13e536-machine-plan                           rke.cattle.io/machine-plan            0      8m32s

provisioning log:

[INFO ] waiting for infrastructure ready
[INFO ] waiting for at least one control plane, etcd, and worker node to be registered
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for bootstrap etcd to be available
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for agent to check in and apply initial plan
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-scheduler
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] non-ready bootstrap machine(s) custom-624a3f13e536 and join url to be available on bootstrap node
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] marking control plane as initialized and ready
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: Node condition Ready is False., waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for kubelet to update
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.

P-n-I · 2024-03-07T14:55:16Z

Rancher 1.8.2, created a new k3s cluster, registered one node to it. Destroyed cluster and re-created with same name, ran k3s-uninstall on the node and joined it to the new cluster and we see, what looks like the old node from the first cluster attempt in the UI:

Note; the age of the cluster and working on v's the age of the node in error.

kfehrenbach · 2024-03-26T05:37:11Z

We have an similar issue caused by an empty machine-plan for the new nodes in a new cluster. A workaround that helped was this:

Run command for joining 1st Master (don't wait and get to the second step)
Run command for joining 1st Worker. You will see 1st Master change his status from WaitingNodeRef
Run command on 2nd and 3rd masters. After that cattle-cluster agents will come up and Worker1 change its status from WaitingNodeRef
Join other workers

Suddenly the plan for master-1 was populated and the cluster bootstrapping started. We absolutely have no idea why the hell this works...
Note: Fleet-Agent is in version 0.81

P-n-I · 2024-07-10T11:13:11Z

We might have found a cause in our env: we uppped the timout on the the load balancer we have in front of the nodes running Rancher as we thought it was probably killing the rancher-system-agent watch of the websocket.

walker-tom · 2024-12-12T11:34:13Z

@kfehrenbach your workaround sorted this issue out for me thanks, did you ever find a more permanent solution?

vonhutrong · 2024-12-24T10:37:44Z

I'm also having the issue custom-xyz-machine-plan secret is also empty.
And the workaround above didn't help. Maybe because of older fleet-agent: v0.7.1.

Rancher v2.8.5
RKE2 v1.28.15+rke2r1

P-n-I · 2025-01-08T11:21:11Z

Still getting this issue, workaround not helped so far.
Rancher v2.8.5
rancher/fleet-agent:v0.9.5
RKE2: v1.28.15+rke2r1

Upgraded to Rancher v2.9.3 and seems to be exactly the same, one control node registers ok, triggering the other control nodes to try but none of the worker nodes (registered 10s after the control ones)
Worker node machine plans:

k -n fleet-default get secret | grep machine-plan | sort -k 3 | head -n 10
custom-2214fa96e9a5-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s
custom-5c1309321ed5-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s
custom-72951cae746b-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s

I tried the workaround; register a single control and a worker quickly in succession but they both stay stuck on "Waiting for node ref"

I have managed to hack round this by registering a worker then a control node in quick succession then joining all the other nodes I need.

maxnitze · 2025-01-08T18:15:58Z

What exactly do you mean by "in quick succession"? I have the same issue also with Rancher v2.8.5 and v1.28.15+rke2r1. Only machine-plan secret, that is being filled is the one from the control plane node(s).

P-n-I · 2025-01-09T10:18:40Z

We register at least one worker and control within a second of each other, we use Ansible to do this so it's not a manual step for us.
We have found that raising the idle timeout on the load balancer we have in front of the Rancher cluster has fixed the issue for us, a higher idle timeout seems to prevent the rancher-system-agent websocket getting closed.

maxnitze · 2025-01-09T10:44:25Z

I tried that too. I also installed the nodes using Ansible, so they join in the same second, basically.

We don't have any LoadBalancer in front of Rancher, so nothing we could do there. Still, the macine-plan secret stays emtpy for all worker nodes :(

P-n-I added [zube]: To Triage kind/bug labels Jan 9, 2024

rancherbot added this to Fleet Jan 9, 2024

github-project-automation bot moved this to 🆕 New in Fleet Jan 9, 2024

github-actions bot added the team/fleet label Jan 9, 2024

P-n-I changed the title ~~Rancher cluster node registration requires fleet-agent restart~~ Rancher new cluster node registration failing Jan 11, 2024

P-n-I mentioned this issue Jan 17, 2024

[BUG] registering node for new cluster is stuck with "Waiting for cluster agent to connect" rancher/rancher#44054

Open

P-n-I closed this as completed Jan 29, 2024

zube bot added [zube]: Done and removed [zube]: To Triage labels Jan 29, 2024

P-n-I reopened this Feb 15, 2024

github-project-automation bot moved this from 🆕 New to 📋 Backlog in Fleet Feb 15, 2024

zube bot added [zube]: To Triage and removed [zube]: Done labels Feb 15, 2024

manno moved this from 📋 Backlog to To Triage in Fleet Mar 13, 2024

manno moved this from To Triage to 🆕 New in Fleet Mar 20, 2024

manno removed the [zube]: To Triage label Apr 3, 2024

kkaempf removed team/fleet labels Apr 10, 2024

spimmer mentioned this issue Dec 12, 2024

[BUG] Unable to add machines to RKE2 custom cluster with Cilium rancher/rancher#46157

Open

maxnitze mentioned this issue Jan 20, 2025

[BUG] New custom cluster stuck in "Non-ready bootstrap machine(s) custom-abcdef123456 and join url to be available on bootstrap node" rancher/rancher#48783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rancher new cluster node registration failing #2053

Rancher new cluster node registration failing #2053

P-n-I commented Jan 9, 2024 •

edited

Loading

P-n-I commented Jan 10, 2024

P-n-I commented Jan 10, 2024

P-n-I commented Jan 11, 2024

P-n-I commented Jan 11, 2024 •

edited

Loading

P-n-I commented Jan 12, 2024

P-n-I commented Jan 17, 2024 •

edited

Loading

P-n-I commented Jan 17, 2024

rgomez-eng commented Jan 18, 2024

P-n-I commented Jan 19, 2024

P-n-I commented Jan 26, 2024 •

edited

Loading

P-n-I commented Jan 29, 2024

P-n-I commented Feb 15, 2024

P-n-I commented Feb 15, 2024 •

edited

Loading

kkaempf commented Feb 19, 2024

P-n-I commented Feb 19, 2024 •

edited

Loading

P-n-I commented Mar 7, 2024

kfehrenbach commented Mar 26, 2024 •

edited

Loading

P-n-I commented Jul 10, 2024

walker-tom commented Dec 12, 2024

vonhutrong commented Dec 24, 2024 •

edited

Loading

P-n-I commented Jan 8, 2025 •

edited

Loading

maxnitze commented Jan 8, 2025

P-n-I commented Jan 9, 2025

maxnitze commented Jan 9, 2025

Rancher new cluster node registration failing #2053

Rancher new cluster node registration failing #2053

Comments

P-n-I commented Jan 9, 2024 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Logs

Anything else?

P-n-I commented Jan 10, 2024

P-n-I commented Jan 10, 2024

P-n-I commented Jan 11, 2024

P-n-I commented Jan 11, 2024 • edited Loading

P-n-I commented Jan 12, 2024

P-n-I commented Jan 17, 2024 • edited Loading

P-n-I commented Jan 17, 2024

rgomez-eng commented Jan 18, 2024

P-n-I commented Jan 19, 2024

P-n-I commented Jan 26, 2024 • edited Loading

P-n-I commented Jan 29, 2024

P-n-I commented Feb 15, 2024

P-n-I commented Feb 15, 2024 • edited Loading

kkaempf commented Feb 19, 2024

P-n-I commented Feb 19, 2024 • edited Loading

P-n-I commented Mar 7, 2024

kfehrenbach commented Mar 26, 2024 • edited Loading

P-n-I commented Jul 10, 2024

walker-tom commented Dec 12, 2024

vonhutrong commented Dec 24, 2024 • edited Loading

P-n-I commented Jan 8, 2025 • edited Loading

maxnitze commented Jan 8, 2025

P-n-I commented Jan 9, 2025

maxnitze commented Jan 9, 2025

P-n-I commented Jan 9, 2024 •

edited

Loading

P-n-I commented Jan 11, 2024 •

edited

Loading

P-n-I commented Jan 17, 2024 •

edited

Loading

P-n-I commented Jan 26, 2024 •

edited

Loading

P-n-I commented Feb 15, 2024 •

edited

Loading

P-n-I commented Feb 19, 2024 •

edited

Loading

kfehrenbach commented Mar 26, 2024 •

edited

Loading

vonhutrong commented Dec 24, 2024 •

edited

Loading

P-n-I commented Jan 8, 2025 •

edited

Loading