Disk partition is too low on vSphere clusters in wallaby workload clusters #3777

njuettner · 2024-11-22T12:59:17Z

I got paged because the diskspace on the root partition was too low.
If the customer runs Java applications there's already a super high chance that this might kill nodes.

From talking with @vxav it should already be increased to 64GB.

Could you please investigate why this hasn't been adjusted?

Cluster which has been affected : wallaby/plant-cassino-dev

plant-cassino-dev-worker-x76tj-7hnn4 ~ # df -h
Filesystem                                                                    Size  Used Avail Use% Mounted on
devtmpfs                                                                      4.0M     0  4.0M   0% /dev
tmpfs                                                                         7.9G     0  7.9G   0% /dev/shm
tmpfs                                                                         3.2G   18M  3.2G   1% /run
/dev/sda9                                                                      17G   15G  1.6G  91% /
sysext                                                                        7.9G   12K  7.9G   1% /usr

Additional information on that:

https://kubernetes.slack.com/archives/CKFGK3SSD/p1622112033033700?thread_ts=1622108128.030700&cid=CKFGK3SSD

The text was updated successfully, but these errors were encountered:

vxav · 2024-11-22T13:02:05Z

Actually nevermind the linked clone thing
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/config/default/crd/bases/infrastructure.cluster.x-k8s.io_vspheremachinetemplates.yaml#L833-L834

vxav · 2024-11-23T11:50:49Z

On this worker I see the full disk is partitioned :

plant-cassino-dev-worker-2cmt9-k4w2w ~ # lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
loop3     7:3    0  1.5M  1 loop
loop4     7:4    0   68M  1 loop
loop5     7:5    0 39.3M  1 loop
sda       8:0    0   64G  0 disk    <<<<<<<<<<<<<<<<
|-sda1    8:1    0  128M  0 part  /boot
|-sda2    8:2    0    2M  0 part
|-sda3    8:3    0    1G  0 part
| `-usr 254:0    0 1016M  1 crypt /usr
|-sda4    8:4    0    1G  0 part
|-sda6    8:6    0  128M  0 part  /oem
|-sda7    8:7    0   64M  0 part
`-sda9    8:9    0 61.7G  0 part  /var/lib/kubelet/pods/218a38ff-7b8b-46cb-a805-48e2c2607dc4/volume-subpaths/hubble-ui-nginx-conf/frontend/0
                                  /
sdb       8:16   0  300M  0 disk  /var/lib/kubelet/pods/3d9560dc-84d2-499c-a724-502f46475662/volumes/kubernetes.io~csi/pvc-2959bf8e-1986-4e25-913d-56f58d95679e/mount
                                  /var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/2211dda774de736397749bf6132a951066e3ee90a3a728d1b8ab1aa80366168b/globalmount
sr0      11:0    1 1024M  0 rom

On the one Nick mentions it is not:

plant-cassino-dev-worker-x76tj-7hnn4 ~ # lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
loop3     7:3    0   68M  1 loop
loop4     7:4    0 39.3M  1 loop
loop5     7:5    0  1.5M  1 loop
sda       8:0    0   20G  0 disk    <<<<<<<<<<<<<<<<
|-sda1    8:1    0  128M  0 part  /boot
|-sda2    8:2    0    2M  0 part
|-sda3    8:3    0    1G  0 part
| `-usr 254:0    0 1016M  1 crypt /usr
|-sda4    8:4    0    1G  0 part
|-sda6    8:6    0  128M  0 part  /oem
|-sda7    8:7    0   64M  0 part
`-sda9    8:9    0 17.7G  0 part  /var/lib/kubelet/pods/f5709e9c-adf5-4858-9924-429fca75f9dc/volume-subpaths/config/opensearch/10
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/base-config/postgres-1/4
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/custom-config/postgres-1/3
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/custom-config/postgres-1/2
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/base-config/postgres-1/1
                                  /var/lib/kubelet/pods/d6d521d2-db18-45f8-a62a-8022ac0ac687/volume-subpaths/dashboards-config/dashboards/0
                                  /var/lib/kubelet/pods/42bb8125-9f7d-4a1b-8b36-01043dcef6c4/volume-subpaths/config-file/ui/0
                                  /

Disk size was increased on October 10th: https://github.com/WEPA-digital/gitops-mc-wallaby/commit/3b3d6f8cd6d4cedb49dded3dc443357a81400636

The problematic node is 54 days old, same as it's vsphereMachineTemplate.

I observe that the machineDeployment is in Scaling down state.

There is a vsphereMachineTemplate which is 43 days olds (matches that disk change commit's timeline).

plant-cassino-dev-worker-2cmt9-fdzxh    plant-cassino-dev    plant-cassino-dev-worker-2cmt9-fdzxh    vsphere://42396d54-b28a-3d56-7130-ae4dc8125b0d   Running   43d     v1.27.14
plant-cassino-dev-worker-2cmt9-k4w2w    plant-cassino-dev    plant-cassino-dev-worker-2cmt9-k4w2w    vsphere://4239d785-9432-f771-ce19-14c07405f074   Running   2d20h   v1.27.14
plant-cassino-dev-worker-x76tj-7hnn4    plant-cassino-dev    plant-cassino-dev-worker-x76tj-7hnn4    vsphere://4239e4cc-bf5d-5940-6291-565fae6298bd   Running   54d     v1.27.14
plant-cassino-dev-worker-x76tj-kf9lh    plant-cassino-dev    plant-cassino-dev-worker-x76tj-kf9lh    vsphere://4239fd8a-4d48-d51e-3660-fc7bfee707d2   Running   54d     v1.27.14

The capi controller has a lot of errors like

E1123 03:34:44.267413       1 controller.go:329] "Reconciler error" err="failed to update Machines: failed to update InfrastructureMachine org-plant-cassino/plant-cassino-dev-worker-a45d26c0-pkhvp: failed to update org-plant-cassino/plant-cassino-dev-worker-a45d26c0-pkhvp: failed to apply VSphereMachine org-plant-cassino/plant-cassino-dev-worker-a45d26c0-pkhvp: Internal error occurred: failed calling webhook \"default.vspheremachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capv-webhook-service.giantswarm.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine?timeout=10s\": dial tcp 172.31.241.248:443: connect: operation not permitted" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="org-plant-cassino/plant-cassino-dev-worker-2cmt9" namespace="org-plant-cassino" name="plant-cassino-dev-worker-2cmt9" reconcileID="529950fa-4980-477c-adc2-e159e29b5288"

Followed by

E1123 11:44:20.011137       1 controller.go:329] "Reconciler error" err="failed to retrieve VSphereMachineTemplate external object \"org-plant-cassino\"/\"plant-cassino-dev-worker-ad331fa6\": VSphereMachineTemplate.infrastructure.cluster.x-k8s.io \"plant-cassino-dev-worker-ad331fa6\" not found" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="org-plant-cassino/plant-cassino-dev-worker-x76tj" namespace="org-plant-cassino" name="plant-cassino-dev-worker-x76tj" reconcileID="fc5e03fc-9d92-4419-8af9-89cb84708b3d"

So it looks like this node rolling got stuck for some reason.

vxav · 2024-11-23T12:03:05Z

Deleting the machine objects from the old machine deployment.

Machines are gone, but there are now only 2 machines and the machine deployment still shows wrong values.

NAME                        CLUSTER              REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE         AGE     VERSION
plant-cassino-dev-worker    plant-cassino-dev    5          5       2         0             ScalingDown   82d     v1.27.14

Because the old machineset still exists (calculated 3+2 as it turns out).

> kg machineset
NAME                              CLUSTER              REPLICAS   READY   AVAILABLE   AGE     VERSION
plant-cassino-dev-worker-2cmt9    plant-cassino-dev    2          2       2           43d     v1.27.14
plant-cassino-dev-worker-x76tj    plant-cassino-dev    3          3       3           75d     v1.27.14

Restarting the capi controller didn't help.

I manually deleted the machine set and the machine deployment picked up the first machine set only, now scaling up the extra 2 machines.

I manually cleaned the old vspheremachinetemplate

Not sure if this ☝ is perfectly clean but couldn't think of something else.

vxav · 2024-11-23T12:08:38Z

The CAPI controller no longer logs issues so I think we are okish for now.

njuettner added this to Roadmap Nov 22, 2024

njuettner converted this from a draft issue Nov 22, 2024

njuettner added kind/bug provider/vsphere Related to a VMware vSphere based on-premises solution labels Nov 22, 2024

vxav closed this as completed Nov 25, 2024

github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk partition is too low on vSphere clusters in wallaby workload clusters #3777

Disk partition is too low on vSphere clusters in wallaby workload clusters #3777

njuettner commented Nov 22, 2024

vxav commented Nov 22, 2024

vxav commented Nov 23, 2024

vxav commented Nov 23, 2024

vxav commented Nov 23, 2024

Disk partition is too low on vSphere clusters in wallaby workload clusters #3777

Disk partition is too low on vSphere clusters in wallaby workload clusters #3777

Comments

njuettner commented Nov 22, 2024

vxav commented Nov 22, 2024

vxav commented Nov 23, 2024

vxav commented Nov 23, 2024

vxav commented Nov 23, 2024