Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher new cluster node registration failing #2053

Open
1 task done
P-n-I opened this issue Jan 9, 2024 · 24 comments
Open
1 task done

Rancher new cluster node registration failing #2053

P-n-I opened this issue Jan 9, 2024 · 24 comments
Labels

Comments

@P-n-I
Copy link

P-n-I commented Jan 9, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When trying to register a new node with a new downstream RKE2 cluster in Rancher 2.7.9 (also 2.7.5) we see the nodes plan Secret is never populated so the rancher-system-agent endlessly polls for a plan.

If we re-deploy the fleet-agent Deployment prior to creating the new downstream cluster definition in Rancher we can occasionally register nodes.

We have to re-deploy fleet-agent each time we need to create a new cluster, though this does not consistently work around the issue.

  • re-deploy fleet-agent Deployment on the Rancher cluster (k -n cattle-fleet-local-system rollout restart deployment fleet-agent)
  • create new downstream cluster definition
  • register node(s) to cluster

if the registration fails or we need to re-create the cluster we wipe the nodes, delete the cluster from Rancher and repeat the steps above.

From the fleet-controller logs when creating the downstream cluster named "test":

2024-01-09T14:14:27.714641430Z time="2024-01-09T14:14:27Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

The workaround of restarting the fleet-agent is not consistent, sometimes repeated manual loops of create cluster, register, delete cluster work.

Registration of nodes to k3s clusters appears to work, I've not tested that as much

Expected Behavior

We can create register nodes to newly created downstream clusters.

Steps To Reproduce

  • create new rke2 cluster
  • run registration command on cluster bootstrap node

Environment

- Architecture: x86_64
- Fleet Version: 1.7.1 and 1.8.1
- Cluster:
  - Provider: rke2
  - Options:
  - Kubernetes Version: v1.26.11+rke2r1

Logs

Logs from fleet-agent after a restart followed by a failed node registration:

I0109 14:34:16.884697       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:34:20.761215643Z I0109 14:34:20.760567       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:34:21.514842587Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
2024-01-09T14:34:21.515239711Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:34:21.515651076Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:34:21.515921289Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:34:22.245467409Z E0109 14:34:22.245355       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
time="2024-01-09T14:34:22Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="getting history for release fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:23Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
time="2024-01-09T14:34:24Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:25Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Logs from fleet-agent after a restart, create new cluster and successful registration:

I0109 14:37:40.958163       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:37:44.767848536Z I0109 14:37:44.767654       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:37:45.799901278Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:37:45.799938559Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:37:45.799944609Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:37:45.799949489Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
E0109 14:37:45.966607       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
2024-01-09T14:37:45.991817525Z time="2024-01-09T14:37:45Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
2024-01-09T14:37:45.992046547Z time="2024-01-09T14:37:45Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:46.002690980Z time="2024-01-09T14:37:46Z" level=info msg="getting history for release fleet-agent-local"
2024-01-09T14:37:46.255440243Z time="2024-01-09T14:37:46Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:47.041131051Z time="2024-01-09T14:37:47Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
2024-01-09T14:37:48.276516222Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:48.527326573Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Anything else?

Ref rancher/rancher#43901 specifically rancher/rancher#43901 (comment)

@P-n-I
Copy link
Author

P-n-I commented Jan 10, 2024

From logs when creating the cluster in Rancher:
fleet-agent

W0110 08:31:07.744207       1 reflector.go:442] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:167: watch of *v1alpha1.BundleDeployment ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 7; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

fleet-controller

time="2024-01-10T08:31:02Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-dev-sandbox-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

@P-n-I
Copy link
Author

P-n-I commented Jan 10, 2024

Contents of the clusters mcc bundle Chart.yaml value:

annotations:
  catalog.cattle.io/certified: rancher
  catalog.cattle.io/hidden: "true"
  catalog.cattle.io/kube-version: '>= 1.23.0-0 < 1.27.0-0'
  catalog.cattle.io/namespace: cattle-system
  catalog.cattle.io/os: linux
  catalog.cattle.io/permits-os: linux,windows
  catalog.cattle.io/rancher-version: '>= 2.7.0-0 < 2.8.0-0'
  catalog.cattle.io/release-name: system-upgrade-controller
apiVersion: v1
appVersion: v0.11.0
description: General purpose controller to make system level updates to nodes.
home: https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
kubeVersion: '>= 1.23.0-0'
name: system-upgrade-controller
sources:
- https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
version: 102.1.0+up0.5.0

downstream cluster we're seeing the issue with is v1.26.11+rke2r1

@P-n-I P-n-I changed the title Rancher cluster node registration requires fleet-agent restart Rancher new cluster node registration failing Jan 11, 2024
@P-n-I
Copy link
Author

P-n-I commented Jan 11, 2024

debug log output from the fleet-controller when creating a new downstream cluster:


time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="shorten bundle name test-managed-system-agent to test-managed-system-agent"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 32.236433ms"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 183.411µs"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="shorten bundle name mcc-test-managed-system-upgrade-controller to mcc-test-managed-system-upgrade-controller"
time="2024-01-11T11:08:10Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 5.27411ms"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 289.752µs"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"

@P-n-I
Copy link
Author

P-n-I commented Jan 11, 2024

I don't know golang at all but I've been digging around trying to find out if there's something in our clusters that's wrong.

tag: release/v0.8.1+security1

controller.OnBundleChange
controller.setResourceKey
helmdeployer.Template (sets Helm defaults including useGlobalCfg: true and globalCfg.Capabilities to chartutil.DefaultCapabilities)
Helm.Deploy
Helm.install
Helm.getCfg (if useGlobalCfg return globalCfg)

so at this point it's using the globalCfg which has default (1.20.0) as the kubeversion in Capabilities
therefore Helm.install doesn't execute the cfg.RESTClientGetter.ToRESTMapper()

Can't find useGlobalCfg getting set anywhere other than true in Template so:
I think it's unset when called via agent.manager so is the bool default: false

@P-n-I
Copy link
Author

P-n-I commented Jan 12, 2024

I've done some local hacking of the code to add some logging and changed the fleet-controller Deployment to use our in-house hacked version.

this is the output when creating a new cluster called bobbins:

time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller calling setResourceKey"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 in setResourceKey mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template patched with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template calling Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install for bundle mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install cfg kubeversion v1.20.0"
time="2024-01-12T15:22:34Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-bobbins-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

@P-n-I
Copy link
Author

P-n-I commented Jan 17, 2024

I hacked the fleet code to work round the "chart requires kubeVersion: >= 1.23.0-0" issue, created a new fleet-controller container updated the fleet-controller deployment to run my hacked container on our dev cluster and its made no difference to the problem of the machine-plan Secret not being populated with data.

That unrelated kubeVersion issue relates to the bundle mcc-<cluster>-managed-system-upgrade-controller.

The issue remains that the nodes custom-<id>-machine-plan Secret doesn't get populated so rancher-system-agent endlessly polls Rancher.

@P-n-I
Copy link
Author

P-n-I commented Jan 17, 2024

rancher-system-agent output with CATTLE_AGENT_LOGLEVEL=debug

Jan 17 15:34:23 packer systemd[1]: Started Rancher System Agent.
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=debug msg="Instantiated new image utility with imagesDir: /var/lib/rancher/agent/images, imageCredentialProviderConfig: /var/lib/rancher/credentialprovider/config.yaml, imageCredentialProviderBinDir: /var/lib/rancher/credentialprovider/bin, agentRegistriesFile: /etc/rancher/agent/registries.yaml"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Starting remote watch of plans"
Jan 17 15:34:27 packer rancher-system-agent[18569]: E0117 15:34:27.619141   18569 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=info msg="Starting /v1, Kind=Secret controller"
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=debug msg="[K8s] Processing secret custom-aede8c2b641f-machine-plan in namespace fleet-default at generation 0 with resource version 48393246"

and

k -n fleet-default get secret custom-aede8c2b641f-machine-plan
NAME                               TYPE                         DATA   AGE
custom-aede8c2b641f-machine-plan   rke.cattle.io/machine-plan   0      101s

@rgomez-eng
Copy link

I'm having the exact same issue. Is there any workaround to get past this issue? Or maybe any specific version to use?

@P-n-I
Copy link
Author

P-n-I commented Jan 19, 2024

I've not found a workaround, sometimes a registration works but mostly is stuck on the empty machine-plan for us.

@P-n-I
Copy link
Author

P-n-I commented Jan 26, 2024

@rgomez-eng a long shot but; are you registering the node(s) with all three roles or have the problematic nodes got a sub-set of the etcd, controlplane and worker roles?
Check the logs from the Rancher pods for occurences of

[INFO] [planner] rkecluster fleet-default/<CLUSTER NAME>: waiting for at least one control plane, etcd, and worker node to be registered

As it implies the node(s) with one of those roles isn't registered.
Until each of the three roles is fulfilled by at least one registered node the cluster is not considered 'sane' and no node plan is delivered, therefore the rancher-system-agent endlessly polls for the plan Secret.

@P-n-I
Copy link
Author

P-n-I commented Jan 29, 2024

We're not able to reliably re-create the issue and don't have the time to investigate further.
Sometimes it just takes a while for the plan Secret to populate even though we have 6 nodes (3*etc/control, 3 worker) wanting to join.

@P-n-I
Copy link
Author

P-n-I commented Feb 15, 2024

We're still seeing this issue:

fleet-default                                    custom-2a6a4339a879-machine-plan                           rke.cattle.io/machine-plan            0	 14m
fleet-default                                    custom-638e801c6183-machine-plan                           rke.cattle.io/machine-plan            0	 14m
fleet-default                                    custom-60117bd68fdd-machine-plan                           rke.cattle.io/machine-plan            0	 15m
fleet-default                                    custom-4623c9642380-machine-plan                           rke.cattle.io/machine-plan            0	 16m
fleet-default                                    custom-6aa4c775aee0-machine-plan                           rke.cattle.io/machine-plan            0	 16m
fleet-default                                    custom-a8dd42fd6fcc-machine-plan                           rke.cattle.io/machine-plan            0	 16m

image

@P-n-I P-n-I reopened this Feb 15, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to 📋 Backlog in Fleet Feb 15, 2024
@P-n-I
Copy link
Author

P-n-I commented Feb 15, 2024

We use ansible to register the nodes (pull the registration cmd from the rancher api and run it on each node).

The first set of nodes to get registered are the ones with control and etcd roles. After they're registered we register worker nodes.

Rancher won't, by design, populate the machine-plans for the nodes until at least one node of all types is registered.

I tried manually registering 3 nodes that had all three roles but still see the machine-plan having 0 bytes.

@kkaempf
Copy link
Collaborator

kkaempf commented Feb 19, 2024

Does it still happen with Rancher 2.8.1 ?

@P-n-I
Copy link
Author

P-n-I commented Feb 19, 2024

I've just upgraded to 2.8.2 and wanted to re-create the node in a single-node k3s cluster.
I deleted the node from the downstream cluster when on 2.7.9.
Got notified of your comment so upgraded to 2.8.2.
Joining the node to the cluster is still stuck on and empty machine-plan Secret.

fleet-default                                    custom-624a3f13e536-machine-plan                           rke.cattle.io/machine-plan            0      8m32s

provisioning log:

[INFO ] waiting for infrastructure ready
[INFO ] waiting for at least one control plane, etcd, and worker node to be registered
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for bootstrap etcd to be available
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for agent to check in and apply initial plan
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-scheduler
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] non-ready bootstrap machine(s) custom-624a3f13e536 and join url to be available on bootstrap node
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] marking control plane as initialized and ready
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: Node condition Ready is False., waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for kubelet to update
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.

@P-n-I
Copy link
Author

P-n-I commented Mar 7, 2024

Rancher 1.8.2, created a new k3s cluster, registered one node to it. Destroyed cluster and re-created with same name, ran k3s-uninstall on the node and joined it to the new cluster and we see, what looks like the old node from the first cluster attempt in the UI:
image
Note; the age of the cluster and working on v's the age of the node in error.

@manno manno moved this from 📋 Backlog to To Triage in Fleet Mar 13, 2024
@manno manno moved this from To Triage to 🆕 New in Fleet Mar 20, 2024
@kfehrenbach
Copy link

kfehrenbach commented Mar 26, 2024

We have an similar issue caused by an empty machine-plan for the new nodes in a new cluster. A workaround that helped was this:

  1. Run command for joining 1st Master (don't wait and get to the second step)
  2. Run command for joining 1st Worker. You will see 1st Master change his status from WaitingNodeRef
  3. Run command on 2nd and 3rd masters. After that cattle-cluster agents will come up and Worker1 change its status from WaitingNodeRef
  4. Join other workers

Suddenly the plan for master-1 was populated and the cluster bootstrapping started. We absolutely have no idea why the hell this works...
Note: Fleet-Agent is in version 0.81

@P-n-I
Copy link
Author

P-n-I commented Jul 10, 2024

We might have found a cause in our env: we uppped the timout on the the load balancer we have in front of the nodes running Rancher as we thought it was probably killing the rancher-system-agent watch of the websocket.

@walker-tom
Copy link

@kfehrenbach your workaround sorted this issue out for me thanks, did you ever find a more permanent solution?

@vonhutrong
Copy link

vonhutrong commented Dec 24, 2024

I'm also having the issue custom-xyz-machine-plan secret is also empty.
And the workaround above didn't help. Maybe because of older fleet-agent: v0.7.1.

Rancher v2.8.5
RKE2 v1.28.15+rke2r1

@P-n-I
Copy link
Author

P-n-I commented Jan 8, 2025

Still getting this issue, workaround not helped so far.
Rancher v2.8.5
rancher/fleet-agent:v0.9.5
RKE2: v1.28.15+rke2r1

Upgraded to Rancher v2.9.3 and seems to be exactly the same, one control node registers ok, triggering the other control nodes to try but none of the worker nodes (registered 10s after the control ones)
Worker node machine plans:

k -n fleet-default get secret | grep machine-plan | sort -k 3 | head -n 10
custom-2214fa96e9a5-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s
custom-5c1309321ed5-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s
custom-72951cae746b-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s

I tried the workaround; register a single control and a worker quickly in succession but they both stay stuck on "Waiting for node ref"

I have managed to hack round this by registering a worker then a control node in quick succession then joining all the other nodes I need.

@maxnitze
Copy link

maxnitze commented Jan 8, 2025

What exactly do you mean by "in quick succession"? I have the same issue also with Rancher v2.8.5 and v1.28.15+rke2r1. Only machine-plan secret, that is being filled is the one from the control plane node(s).

@P-n-I
Copy link
Author

P-n-I commented Jan 9, 2025

We register at least one worker and control within a second of each other, we use Ansible to do this so it's not a manual step for us.
We have found that raising the idle timeout on the load balancer we have in front of the Rancher cluster has fixed the issue for us, a higher idle timeout seems to prevent the rancher-system-agent websocket getting closed.

@maxnitze
Copy link

maxnitze commented Jan 9, 2025

I tried that too. I also installed the nodes using Ansible, so they join in the same second, basically.

We don't have any LoadBalancer in front of Rancher, so nothing we could do there. Still, the macine-plan secret stays emtpy for all worker nodes :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 New
Development

No branches or pull requests

8 participants