Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flooding of machine state reconciles during condition message field change #2463

Open
mikejoh opened this issue Mar 5, 2025 · 0 comments · May be fixed by #2464
Open

Flooding of machine state reconciles during condition message field change #2463

mikejoh opened this issue Mar 5, 2025 · 0 comments · May be fixed by #2464
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mikejoh
Copy link

mikejoh commented Mar 5, 2025

/kind bug

What steps did you take and what happened:
Hello!

First of all, thanks for all the hard work put into this provider!

I've been playing around with CAPI + CAPO + DevStack quite a bit lately, i've got everything running locally on my laptop (ontop of libvirt + kind).

Today when i restarted my DevStack VM, and when the CAPO controller started up i noticed that it logged a lot of log messages that looked like this (--v=2):

I0305 20:42:08.217999       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="c7af5672-ef14-4e54-a65a-dc2384272f35" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"
I0305 20:42:08.291317       1 openstackmachine_controller.go:455] "Waiting for instance to become ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="c7af5672-ef14-4e54-a65a-dc2384272f35" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01" id="5d357a3e-be75-4ed0-99c5-e03dc283d795" status="SHUTOFF"
I0305 20:42:08.314153       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="214a54a0-7959-49f5-8c36-7c3b556ca83d" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"
I0305 20:42:08.411959       1 openstackmachine_controller.go:455] "Waiting for instance to become ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="214a54a0-7959-49f5-8c36-7c3b556ca83d" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01" id="5d357a3e-be75-4ed0-99c5-e03dc283d795" status="SHUTOFF"
I0305 20:42:08.429494       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="527802f6-4066-4e48-91a1-dbcfca2a0598" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"
I0305 20:42:08.492560       1 openstackmachine_controller.go:455] "Waiting for instance to become ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="527802f6-4066-4e48-91a1-dbcfca2a0598" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01" id="5d357a3e-be75-4ed0-99c5-e03dc283d795" status="SHUTOFF"
I0305 20:42:08.512663       1 openstackmachine_controller.go:361] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/k8s-devstack01-control-plane-hmjss" namespace="default" name="k8s-devstack01-control-plane-hmjss" reconcileID="5e0aef1f-249a-4517-a3cb-37c3950f65d8" openStackMachine="k8s-devstack01-control-plane-hmjss" machine="k8s-devstack01-control-plane-hmjss" cluster="k8s-devstack01" openStackCluster="k8s-devstack01"

I counted the rate of logs to ~13/s, if i checked the Nova compute API in the DevStack end it's the same rate, it logged this:

Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-91fcfda4-b4e2-4aa0-a902-c13d915ac5c4 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-568f1596-3b9c-4346-acc4-f7aff64684b1 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-34d8ba23-dc2e-4c61-92ca-1af2e3459b28 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-8217ec12-bdde-49da-8305-f1694882c004 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-b7e57d26-638d-49ab-82f5-892202ee3d33 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}
Mar 05 20:45:53 devstack01 [email protected][1267]: DEBUG nova.compute.api [None req-dc1d4323-8ac1-426d-a7e2-c6293a97f661 admin admin] [instance: 5d357a3e-be75-4ed0-99c5-e03dc283d795] Fetching instance by UUID {{(pid=1267) get /opt/stack/nova/nova/compute/api.py:2981}}

At this point the state of the instances and load balancer (Octavia) in DevStack were:

  • Load balancer in Error state
  • Amphorae instance reporting SHUTOFF
  • My single node k8s cluster instance also reporting SHUTOFF

Not the best state 😏, and probably a corner case but still, it was enough to make the OpenStack machine controller behaving a bit off.

When i fetched the OpenStackMachine object in a watch loop from the CAPI + CAPO (kind) i observed that these two fields constantly changed, at the same rate as the logs were omitted:

  • resourceVersion
  • The message field in the InstanceReady condition

That message field was being set to:

message: 'Instance state is not handled: 0xc001129610'

The problem seems to be that we're writing the pointer memory address to the message field here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/controllers/openstackmachine_controller.go#L456. When we set that in the condition the resourceVersion changes and we retrigger a reconcile immediatly (AFAICT).

If i dereference that instance state variable instead it works as expected (SHUTOFF is written instead) and the reconcile flooding stops.

The OpenStack server controller does something similar but instead reads the state off the object which will be correct in this case and not trigger flooding: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/controllers/openstackserver_controller.go#L390

What did you expect to happen:

The OpenStackMachine controller should not write the pointer address to a status condition field causing a flood of reconcile attempts and logs.

Anything else you would like to add:

OpenStackMachine `status`

Here we can see the full status of the object in CAPO:

...
status:
  addresses:
  - address: 10.6.0.177
    type: InternalIP
  - address: k8s-devstack01-control-plane-hmjss
    type: InternalDNS
  conditions:
  - lastTransitionTime: "2025-03-02T15:07:10Z"
    message: 'Reconciling load balancer member failed: Expected HTTP response code
      [200 204 300] when accessing [GET http://192.168.11.20/load-balancer/v2.0/lbaas/pools?
name=k8s-clusterapi-cluster-default-k8s-devstack01-kubeapi-6443],
      but got 500 instead: Internal Server Error'
    reason: LoadBalancerMemberError
    severity: Error
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-03-02T15:07:10Z"
    message: 'Reconciling load balancer member failed: Expected HTTP response code
      [200 204 300] when accessing [GET http://192.168.11.20/load-balancer/v2.0/lbaas/pools?
name=k8s-clusterapi-cluster-default-k8s-devstack01-kubeapi-6443],
      but got 500 instead: Internal Server Error'
    reason: LoadBalancerMemberError
    severity: Error
    status: "False"
    type: APIServerIngressReadyCondition
  - lastTransitionTime: "2025-03-05T21:03:48Z"
    message: 'Instance state is not handled: 0xc00026ed90'
    reason: InstanceNotReady
    status: Unknown
    type: InstanceReady
  instanceID: 5d357a3e-be75-4ed0-99c5-e03dc283d795
  ready: true

Environment:

  • Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): fc2c57a
  • Cluster-API version: v1.9.4
  • OpenStack version: 2024.1
  • Minikube/KIND version: v0.26.0
  • Kubernetes version (use kubectl version): v1.32.0
  • OS (e.g. from /etc/os-release): Arch Linux
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Inbox
2 participants