Flooding of machine state reconciles during condition message field change #2463

mikejoh · 2025-03-05T22:00:59Z

/kind bug

What steps did you take and what happened:
Hello!

First of all, thanks for all the hard work put into this provider!

I've been playing around with CAPI + CAPO + DevStack quite a bit lately, i've got everything running locally on my laptop (ontop of libvirt + kind).

Today when i restarted my DevStack VM, and when the CAPO controller started up i noticed that it logged a lot of log messages that looked like this (--v=2):

I0305 20:42:08.217999       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="c7af5672-ef14-4e54-a65a-dc2384272f35" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"
I0305 20:42:08.291317       1 openstackmachine_controller.go:455] "Waiting for instance to become ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="c7af5672-ef14-4e54-a65a-dc2384272f35" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01" id="5d357a3e-be75-4ed0-99c5-e03dc283d795" status="SHUTOFF"
I0305 20:42:08.314153       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="214a54a0-7959-49f5-8c36-7c3b556ca83d" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"
I0305 20:42:08.411959       1 openstackmachine_controller.go:455] "Waiting for instance to become ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="214a54a0-7959-49f5-8c36-7c3b556ca83d" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01" id="5d357a3e-be75-4ed0-99c5-e03dc283d795" status="SHUTOFF"
I0305 20:42:08.429494       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="527802f6-4066-4e48-91a1-dbcfca2a0598" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"
I0305 20:42:08.492560       1 openstackmachine_controller.go:455] "Waiting for instance to become ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="527802f6-4066-4e48-91a1-dbcfca2a0598" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01" id="5d357a3e-be75-4ed0-99c5-e03dc283d795" status="SHUTOFF"
I0305 20:42:08.512663       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="5e0aef1f-249a-4517-a3cb-37c3950f65d8" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"

I counted the rate of logs to ~13/s, if i checked the Nova compute API in the DevStack end it's the same rate, it logged this:

Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-91fcfda4-b4e2-4aa0-a902-c13d915ac5c4 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-568f1596-3b9c-4346-acc4-f7aff64684b1 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-34d8ba23-dc2e-4c61-92ca-1af2e3459b28 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-8217ec12-bdde-49da-8305-f1694882c004 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-b7e57d26-638d-49ab-82f5-892202ee3d33 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-dc1d4323-8ac1-426d-a7e2-c6293a97f661 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}

At this point the state of the instances and load balancer (Octavia) in DevStack were:

Load balancer in Error state
Amphorae instance reporting SHUTOFF
My single node k8s cluster instance also reporting SHUTOFF

Not the best state 😏, and probably a corner case but still, it was enough to make the OpenStack machine controller behaving a bit off.

When i fetched the OpenStackMachine object in a watch loop from the CAPI + CAPO (kind) i observed that these two fields constantly changed, at the same rate as the logs were omitted:

resourceVersion
The message field in the InstanceReady condition

That message field was being set to:

message: 'Instance state is not handled: 0xc001129610'

The problem seems to be that we're writing the pointer memory address to the message field here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/controllers/openstackmachine_controller.go#L456. When we set that in the condition the resourceVersion changes and we retrigger a reconcile immediatly (AFAICT).

If i dereference that instance state variable instead it works as expected (SHUTOFF is written instead) and the reconcile flooding stops.

The OpenStack server controller does something similar but instead reads the state off the object which will be correct in this case and not trigger flooding: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/controllers/openstackserver_controller.go#L390

What did you expect to happen:

The OpenStackMachine controller should not write the pointer address to a status condition field causing a flood of reconcile attempts and logs.

Anything else you would like to add:

OpenStackMachine `status`

Here we can see the full status of the object in CAPO:

...
status:
  addresses:
  - address: 10.6.0.177
    type: InternalIP
  - address: k8s-devstack01-control-plane-hmjss
    type: InternalDNS
  conditions:
  - lastTransitionTime: "2025-03-02T15:07:10Z"
    message: 'Reconciling load balancer member failed: Expected HTTP response code
      [200 204 300] when accessing [GET http://192.168.11.20/load-balancer/v2.0/lbaas/pools?
name=k8s-clusterapi-cluster-default-k8s-devstack01-kubeapi-6443],
      but got 500 instead: Internal Server Error'
    reason: LoadBalancerMemberError
    severity: Error
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-03-02T15:07:10Z"
    message: 'Reconciling load balancer member failed: Expected HTTP response code
      [200 204 300] when accessing [GET http://192.168.11.20/load-balancer/v2.0/lbaas/pools?
name=k8s-clusterapi-cluster-default-k8s-devstack01-kubeapi-6443],
      but got 500 instead: Internal Server Error'
    reason: LoadBalancerMemberError
    severity: Error
    status: "False"
    type: APIServerIngressReadyCondition
  - lastTransitionTime: "2025-03-05T21:03:48Z"
    message: 'Instance state is not handled: 0xc00026ed90'
    reason: InstanceNotReady
    status: Unknown
    type: InstanceReady
  instanceID: 5d357a3e-be75-4ed0-99c5-e03dc283d795
  ready: true

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): fc2c57a
Cluster-API version: v1.9.4
OpenStack version: 2024.1
Minikube/KIND version: v0.26.0
Kubernetes version (use kubectl version): v1.32.0
OS (e.g. from /etc/os-release): Arch Linux

The text was updated successfully, but these errors were encountered:

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 5, 2025

github-project-automation bot added this to CAPO Roadmap Mar 5, 2025

github-project-automation bot moved this to Inbox in CAPO Roadmap Mar 5, 2025

mikejoh linked a pull request Mar 5, 2025 that will close this issue

🐛 fix: dereference the instance state in machine state reconcile calls #2464

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flooding of machine state reconciles during condition message field change #2463

Flooding of machine state reconciles during condition message field change #2463

mikejoh commented Mar 5, 2025

Flooding of machine state reconciles during condition message field change #2463

Flooding of machine state reconciles during condition message field change #2463

Comments

mikejoh commented Mar 5, 2025