Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk partition is too low on vSphere clusters in wallaby workload clusters #3777

Closed
njuettner opened this issue Nov 22, 2024 · 4 comments
Closed
Labels
kind/bug provider/vsphere Related to a VMware vSphere based on-premises solution

Comments

@njuettner
Copy link
Member

I got paged because the diskspace on the root partition was too low.
If the customer runs Java applications there's already a super high chance that this might kill nodes.

From talking with @vxav it should already be increased to 64GB.

Could you please investigate why this hasn't been adjusted?

Cluster which has been affected : wallaby/plant-cassino-dev

plant-cassino-dev-worker-x76tj-7hnn4 ~ # df -h
Filesystem                                                                    Size  Used Avail Use% Mounted on
devtmpfs                                                                      4.0M     0  4.0M   0% /dev
tmpfs                                                                         7.9G     0  7.9G   0% /dev/shm
tmpfs                                                                         3.2G   18M  3.2G   1% /run
/dev/sda9                                                                      17G   15G  1.6G  91% /
sysext                                                                        7.9G   12K  7.9G   1% /usr

Additional information on that:

https://kubernetes.slack.com/archives/CKFGK3SSD/p1622112033033700?thread_ts=1622108128.030700&cid=CKFGK3SSD

@njuettner njuettner converted this from a draft issue Nov 22, 2024
@njuettner njuettner added kind/bug provider/vsphere Related to a VMware vSphere based on-premises solution labels Nov 22, 2024
@vxav
Copy link

vxav commented Nov 23, 2024

On this worker I see the full disk is partitioned :

plant-cassino-dev-worker-2cmt9-k4w2w ~ # lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
loop3     7:3    0  1.5M  1 loop
loop4     7:4    0   68M  1 loop
loop5     7:5    0 39.3M  1 loop
sda       8:0    0   64G  0 disk    <<<<<<<<<<<<<<<<
|-sda1    8:1    0  128M  0 part  /boot
|-sda2    8:2    0    2M  0 part
|-sda3    8:3    0    1G  0 part
| `-usr 254:0    0 1016M  1 crypt /usr
|-sda4    8:4    0    1G  0 part
|-sda6    8:6    0  128M  0 part  /oem
|-sda7    8:7    0   64M  0 part
`-sda9    8:9    0 61.7G  0 part  /var/lib/kubelet/pods/218a38ff-7b8b-46cb-a805-48e2c2607dc4/volume-subpaths/hubble-ui-nginx-conf/frontend/0
                                  /
sdb       8:16   0  300M  0 disk  /var/lib/kubelet/pods/3d9560dc-84d2-499c-a724-502f46475662/volumes/kubernetes.io~csi/pvc-2959bf8e-1986-4e25-913d-56f58d95679e/mount
                                  /var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/2211dda774de736397749bf6132a951066e3ee90a3a728d1b8ab1aa80366168b/globalmount
sr0      11:0    1 1024M  0 rom

On the one Nick mentions it is not:

plant-cassino-dev-worker-x76tj-7hnn4 ~ # lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
loop3     7:3    0   68M  1 loop
loop4     7:4    0 39.3M  1 loop
loop5     7:5    0  1.5M  1 loop
sda       8:0    0   20G  0 disk    <<<<<<<<<<<<<<<<
|-sda1    8:1    0  128M  0 part  /boot
|-sda2    8:2    0    2M  0 part
|-sda3    8:3    0    1G  0 part
| `-usr 254:0    0 1016M  1 crypt /usr
|-sda4    8:4    0    1G  0 part
|-sda6    8:6    0  128M  0 part  /oem
|-sda7    8:7    0   64M  0 part
`-sda9    8:9    0 17.7G  0 part  /var/lib/kubelet/pods/f5709e9c-adf5-4858-9924-429fca75f9dc/volume-subpaths/config/opensearch/10
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/base-config/postgres-1/4
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/custom-config/postgres-1/3
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/custom-config/postgres-1/2
                                  /var/lib/kubelet/pods/80130db6-1a5e-4046-a5c6-701cc8a3b502/volume-subpaths/base-config/postgres-1/1
                                  /var/lib/kubelet/pods/d6d521d2-db18-45f8-a62a-8022ac0ac687/volume-subpaths/dashboards-config/dashboards/0
                                  /var/lib/kubelet/pods/42bb8125-9f7d-4a1b-8b36-01043dcef6c4/volume-subpaths/config-file/ui/0
                                  /

Disk size was increased on October 10th: https://github.com/WEPA-digital/gitops-mc-wallaby/commit/3b3d6f8cd6d4cedb49dded3dc443357a81400636

The problematic node is 54 days old, same as it's vsphereMachineTemplate.

I observe that the machineDeployment is in Scaling down state.

There is a vsphereMachineTemplate which is 43 days olds (matches that disk change commit's timeline).

plant-cassino-dev-worker-2cmt9-fdzxh    plant-cassino-dev    plant-cassino-dev-worker-2cmt9-fdzxh    vsphere://42396d54-b28a-3d56-7130-ae4dc8125b0d   Running   43d     v1.27.14
plant-cassino-dev-worker-2cmt9-k4w2w    plant-cassino-dev    plant-cassino-dev-worker-2cmt9-k4w2w    vsphere://4239d785-9432-f771-ce19-14c07405f074   Running   2d20h   v1.27.14
plant-cassino-dev-worker-x76tj-7hnn4    plant-cassino-dev    plant-cassino-dev-worker-x76tj-7hnn4    vsphere://4239e4cc-bf5d-5940-6291-565fae6298bd   Running   54d     v1.27.14
plant-cassino-dev-worker-x76tj-kf9lh    plant-cassino-dev    plant-cassino-dev-worker-x76tj-kf9lh    vsphere://4239fd8a-4d48-d51e-3660-fc7bfee707d2   Running   54d     v1.27.14

The capi controller has a lot of errors like

E1123 03:34:44.267413       1 controller.go:329] "Reconciler error" err="failed to update Machines: failed to update InfrastructureMachine org-plant-cassino/plant-cassino-dev-worker-a45d26c0-pkhvp: failed to update org-plant-cassino/plant-cassino-dev-worker-a45d26c0-pkhvp: failed to apply VSphereMachine org-plant-cassino/plant-cassino-dev-worker-a45d26c0-pkhvp: Internal error occurred: failed calling webhook \"default.vspheremachine.infrastructure.cluster.x-k8s.io\": failed to call webhook: Post \"https://capv-webhook-service.giantswarm.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine?timeout=10s\": dial tcp 172.31.241.248:443: connect: operation not permitted" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="org-plant-cassino/plant-cassino-dev-worker-2cmt9" namespace="org-plant-cassino" name="plant-cassino-dev-worker-2cmt9" reconcileID="529950fa-4980-477c-adc2-e159e29b5288"

Followed by

E1123 11:44:20.011137       1 controller.go:329] "Reconciler error" err="failed to retrieve VSphereMachineTemplate external object \"org-plant-cassino\"/\"plant-cassino-dev-worker-ad331fa6\": VSphereMachineTemplate.infrastructure.cluster.x-k8s.io \"plant-cassino-dev-worker-ad331fa6\" not found" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="org-plant-cassino/plant-cassino-dev-worker-x76tj" namespace="org-plant-cassino" name="plant-cassino-dev-worker-x76tj" reconcileID="fc5e03fc-9d92-4419-8af9-89cb84708b3d"

So it looks like this node rolling got stuck for some reason.

@vxav
Copy link

vxav commented Nov 23, 2024

Deleting the machine objects from the old machine deployment.

Machines are gone, but there are now only 2 machines and the machine deployment still shows wrong values.

NAME                        CLUSTER              REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE         AGE     VERSION
plant-cassino-dev-worker    plant-cassino-dev    5          5       2         0             ScalingDown   82d     v1.27.14

Because the old machineset still exists (calculated 3+2 as it turns out).

> kg machineset
NAME                              CLUSTER              REPLICAS   READY   AVAILABLE   AGE     VERSION
plant-cassino-dev-worker-2cmt9    plant-cassino-dev    2          2       2           43d     v1.27.14
plant-cassino-dev-worker-x76tj    plant-cassino-dev    3          3       3           75d     v1.27.14

Restarting the capi controller didn't help.

I manually deleted the machine set and the machine deployment picked up the first machine set only, now scaling up the extra 2 machines.

I manually cleaned the old vspheremachinetemplate

Not sure if this ☝ is perfectly clean but couldn't think of something else.

@vxav
Copy link

vxav commented Nov 23, 2024

The CAPI controller no longer logs issues so I think we are okish for now.

@vxav vxav closed this as completed Nov 25, 2024
@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug provider/vsphere Related to a VMware vSphere based on-premises solution
Projects
Status: Done ✅
Development

No branches or pull requests

2 participants