Clusterctl upgrade flake #11610

cahillsf · 2024-12-21T22:09:52Z

Which jobs are flaking?

periodic-cluster-api-e2e-release-1-9
periodic-cluster-api-e2e-mink8s-release-1-8
periodic-cluster-api-e2e-latestk8s-main

Which tests are flaking?

When testing clusterctl upgrades (v0.4=>v1.6=>current) Should create a management cluster and then upgrade all the providers
When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers

Since when has it been flaking?

Looks like maybe for awhile... failures from:

2024-11-05: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-6/1853834134728740864

and can see similar patterns in the timeline back in august/september: https://storage.googleapis.com/k8s-triage/index.html?date=2024-09-04&text=Timed%20out%20waiting%20for%20Cluster&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-latestk8s-main/1870177294711001088

Reason for failure (if possible)

Seems like the issue is that the docker controller fails to start up the load balancer container of the workload cluster due to a port conflict when it tries to provision using v0.3 and v0.4 of clusterctl which causes the DockerCluster to get stuck:

here's one example failure:
- can see in the CAPD controller logs :

I1105 16:32:55.679921       1 loadbalancer.go:118] controller/dockercluster "msg"="Creating load balancer container" "cluster"="clusterctl-upgrade-workload-z2d5zb" "ipFamily"=1 "name"="clusterctl-upgrade-workload-z2d5zb" 
...
E1105 16:33:00.426105       1 controller.go:304] controller/dockercluster "msg"="Reconciler error" "error"="failed to create load balancer: error starting container \"clusterctl-upgrade-workload-z2d5zb-lb\": Error response from daemon: driver failed programming external connectivity on endpoint clusterctl-upgrade-workload-z2d5zb-lb (d1372a2ea0ae9c74717c57bf5a785cc6e73b38808c48ca01668abc62d90d8b3c): failed to bind port 0.0.0.0:32769/tcp: Error starting userland proxy: listen tcp4 0.0.0.0:32769: bind: address already in use" "name"="clusterctl-upgrade-workload-z2d5zb"

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-12-21T22:10:00Z

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cahillsf · 2024-12-21T22:11:30Z

heres the source code for where this load balancer gets provisioned in v0.3.25:

https://github.com/cahillsf/cluster-api/blob/4acc91a59250d5ae59ba177d4d8f6c91f65b2281/test/infrastructure/docker/docker/kind_manager.go#L69

and now on main for comparison: https://github.com/cahillsf/cluster-api/blob/fb2cc16480a62c1b43c0a1e3a7f48c328e283012/test/infrastructure/docker/internal/docker/manager.go#L110

cahillsf · 2024-12-30T14:53:27Z

/area e2e-testing

cprivitere · 2025-01-14T22:50:10Z

This commit may be relevant, it's when the newer method of load balancer provisioning was added: 88b3662

chrischdi · 2025-01-15T10:03:41Z

I think the fix was: f726a2e

When not trying to get a port, but let docker choose the port automatically by passing through 0.

Maybe there's a smart way we can workaround that.
Some braindump ideas:

pre-create the loadbalancer container for versions we know the command is racy in terms of finding a port
be able to do a retry in some way on these versions (not sure how this could be done), maybe delete the failed container and trigger a reconcile on capd cluster

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 21, 2024

k8s-ci-robot added the area/e2e-testing Issues or PRs related to e2e testing label Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusterctl upgrade flake #11610

Clusterctl upgrade flake #11610

cahillsf commented Dec 21, 2024

k8s-ci-robot commented Dec 21, 2024

cahillsf commented Dec 21, 2024 •

edited

Loading

cahillsf commented Dec 30, 2024

cprivitere commented Jan 14, 2025 •

edited

Loading

chrischdi commented Jan 15, 2025 •

edited

Loading

Clusterctl upgrade flake #11610

Clusterctl upgrade flake #11610

Comments

cahillsf commented Dec 21, 2024

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

k8s-ci-robot commented Dec 21, 2024

cahillsf commented Dec 21, 2024 • edited Loading

cahillsf commented Dec 30, 2024

cprivitere commented Jan 14, 2025 • edited Loading

chrischdi commented Jan 15, 2025 • edited Loading

cahillsf commented Dec 21, 2024 •

edited

Loading

cprivitere commented Jan 14, 2025 •

edited

Loading

chrischdi commented Jan 15, 2025 •

edited

Loading