Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PV may be deleted when change cluster cpu/memory/storage fails due to quota limit #5375

Closed
ldming opened this issue Oct 10, 2023 · 0 comments · Fixed by #5398 or #5403
Closed
Assignees
Labels
bug kind/bug Something isn't working severity/critical Blocking or critical issues
Milestone

Comments

@ldming
Copy link
Collaborator

ldming commented Oct 10, 2023

Describe the bug

The cluster PV may be deleted when change cluster cpu/memory/storage fails due to quota limit.

To Reproduce

Versoin:

$ kbcli version
Kubernetes: v1.27.3-gke.100
KubeBlocks: 0.6.3-beta.3
kbcli: 0.7.0-alpha.20
  1. create a namespace
kubectl create ns ns-ukltji
  1. Create resource quota
kubectl apply -f -<<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: quota-ns-ukltji
  namespace: ns-ukltji
spec:
  hard:
    limits.cpu: "1.5"
    limits.ephemeral-storage: 10Gi
    limits.memory: 1.5Gi
    requests.storage: 20Gi
---
apiVersion: v1
kind: LimitRange
metadata:
  name: range-ns-ukltji
  namespace: ns-ukltji
spec:
  limits:
  - default:
      cpu: 100m
      memory: 100Mi
    type: Container
EOF
  1. Create mongodb cluster with one replica
kubectl create -f -<<EOF
apiVersion: apps.kubeblocks.io/v1alpha1
kind: Cluster
metadata:
  labels:
    clusterdefinition.kubeblocks.io/name: mongodb
    clusterversion.kubeblocks.io/name: mongodb-5.0
  generateName: mongo-
  namespace: ns-ukltji
spec:
  affinity:
    nodeLabels: {}
    podAntiAffinity: Preferred
    tenancy: SharedNode
    topologyKeys: []
  clusterDefinitionRef: mongodb
  clusterVersionRef: mongodb-5.0
  componentSpecs:
  - componentDefRef: mongodb
    monitor: true
    name: mongodb
    replicas: 1
    resources:
      limits:
        cpu: 1000m
        memory: 1024Mi
      requests:
        cpu: 100m
        memory: 102Mi
    serviceAccountName: dbname
    volumeClaimTemplates:
    - name: data
      spec:
        storageClassName: standard-rwo
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
  terminationPolicy: WipeOut
  tolerations: []
EOF
  1. wait cluster running, and check cluster status, pv and pvc
kubectl -n ns-ukltji get cluster,pvc
kubectl get pv | grep ns-ukltji
  1. edit the cluster yaml, set cpu to 2, memory to 2Gi, storage to 6Gi
kubectl -n ns-ukltji edit cluster mongo-bkrcw

check the cluster, sts, pvc and pv, it should as follows:

$ kubectl -n ns-ukltji get sts           
NAME                  READY   AGE
mongo-bkrcw-mongodb   0/1     6m27s

$ kubectl -n ns-ukltji describe sts mongo-bkrcw-mongodb
...
Events:
  Type     Reason            Age                 From                    Message
  ----     ------            ----                ----                    -------
  Normal   SuccessfulCreate  6m36s               statefulset-controller  create Claim data-mongo-bkrcw-mongodb-0 Pod mongo-bkrcw-mongodb-0 in StatefulSet mongo-bkrcw-mongodb success
  Normal   SuccessfulCreate  6m36s               statefulset-controller  create Pod mongo-bkrcw-mongodb-0 in StatefulSet mongo-bkrcw-mongodb successful
  Warning  FailedCreate      10s (x12 over 20s)  statefulset-controller  create Pod mongo-bkrcw-mongodb-0 in StatefulSet mongo-bkrcw-mongodb failed error: pods "mongo-bkrcw-mongodb-0" is forbidden: exceeded quota: quota-ns-ukltji, requested: limits.cpu=2,limits.memory=2Gi, used: limits.cpu=0,limits.memory=0, limited: limits.cpu=1500m,limits.memory=1536Mi

And the PV will be resized to 6Gi, the PVC is Waiting for user to (re-)start a pod to finish file system resize of volume on node.

  1. create a opsRequest, set storage to 7Gi
 kbcli -n ns-ukltji cluster volume-expand  mongo-bkrcw --storage 7Gi --components mongodb --volume-claim-templates data

the PV will be resized to 7Gi, and the PVC resource is 7Gi, and status.capacity.storage is 5Gi, and the condition message is also Waiting for user to (re-)start a pod to finish file system resize of volume on node. as follows:

$ kubectl -n ns-ukltji get pvc
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-mongo-bkrcw-mongodb-0   Bound    pvc-77b6416f-d067-437a-9f0e-1b1d5d1b6a23   5Gi        RWO            standard-rwo   11m

$ kubectl -n ns-ukltji get pvc data-mongo-bkrcw-mongodb-0 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
    volume.kubernetes.io/selected-node: gke-yjtest-default-pool-c51609d3-jp0w
    volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
  creationTimestamp: "2023-10-10T07:58:24Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/instance: mongo-bkrcw
    app.kubernetes.io/managed-by: kubeblocks
    app.kubernetes.io/name: mongodb
    apps.kubeblocks.io/component-name: mongodb
    apps.kubeblocks.io/vct-name: data
    kubeblocks.io/volume-type: data
  name: data-mongo-bkrcw-mongodb-0
  namespace: ns-ukltji
  ownerReferences:
  - apiVersion: apps.kubeblocks.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Cluster
    name: mongo-bkrcw
    uid: db5d4060-11e4-4578-bd94-474112862ae1
  resourceVersion: "1320720"
  uid: 77b6416f-d067-437a-9f0e-1b1d5d1b6a23
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 7Gi
  storageClassName: standard-rwo
  volumeMode: Filesystem
  volumeName: pvc-77b6416f-d067-437a-9f0e-1b1d5d1b6a23
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-10-10T08:05:46Z"
    message: Waiting for user to (re-)start a pod to finish file system resize of
      volume on node.
    status: "True"
    type: FileSystemResizePending
  phase: Bound

$ kubectl -n ns-ukltji describe pvc data-mongo-bkrcw-mongodb-0
Name:          data-mongo-bkrcw-mongodb-0
Namespace:     ns-ukltji
StorageClass:  standard-rwo
Status:        Bound
Volume:        pvc-77b6416f-d067-437a-9f0e-1b1d5d1b6a23
Labels:        app.kubernetes.io/instance=mongo-bkrcw
               app.kubernetes.io/managed-by=kubeblocks
               app.kubernetes.io/name=mongodb
               apps.kubeblocks.io/component-name=mongodb
               apps.kubeblocks.io/vct-name=data
               kubeblocks.io/volume-type=data
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
               volume.kubernetes.io/selected-node: gke-yjtest-default-pool-c51609d3-jp0w
               volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      5Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       <none>
Conditions:
  Type                      Status  LastProbeTime                     LastTransitionTime                Reason  Message
  ----                      ------  -----------------                 ------------------                ------  -------
  FileSystemResizePending   True    Mon, 01 Jan 0001 00:00:00 +0000   Tue, 10 Oct 2023 16:05:46 +0800           Waiting for user to (re-)start a pod to finish file system resize of volume on node.
Events:
  Type     Reason                    Age    From                                                                                              Message
  ----     ------                    ----   ----                                                                                              -------
  Normal   WaitForFirstConsumer      12m    persistentvolume-controller                                                                       waiting for first consumer to be created before binding
  Normal   Provisioning              12m    pd.csi.storage.gke.io_gke-98a758e07af3462e91e0-9db8-9246-vm_27d54d86-ae64-4ee6-aed9-34daf51b64e4  External provisioner is provisioning volume for claim "ns-ukltji/data-mongo-bkrcw-mongodb-0"
  Normal   ExternalProvisioning      12m    persistentvolume-controller                                                                       waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator
  Normal   ProvisioningSucceeded     12m    pd.csi.storage.gke.io_gke-98a758e07af3462e91e0-9db8-9246-vm_27d54d86-ae64-4ee6-aed9-34daf51b64e4  Successfully provisioned volume pvc-77b6416f-d067-437a-9f0e-1b1d5d1b6a23
  Warning  ExternalExpanding         5m23s  volume_expand                                                                                     waiting for an external controller to expand this PVC
  Normal   Resizing                  5m22s  external-resizer pd.csi.storage.gke.io                                                            External resizer is resizing volume pvc-77b6416f-d067-437a-9f0e-1b1d5d1b6a23
  Normal   FileSystemResizeRequired  5m19s  external-resizer pd.csi.storage.gke.io                                                            Require file system resize of volume on node

kubectl get pv |grep ns-ukltji
pvc-77b6416f-d067-437a-9f0e-1b1d5d1b6a23   7Gi        RWO            Delete           Bound      ns-ukltji/data-mongo-bkrcw-mongodb-0       standard-rwo                               18m
  1. edit cluster cpu to 1, memory to 1Gi and storage to 5Gi(it should be equal to the PVC Capacity, otherwise this change will be forbidden)

  2. check the cluster, sts, pod, pvc and pv

The pod will be created, and maybe have three results:

a. PV is Bound, and PVC is Bound, pod is running, all is well.

b. PV claim policy is retain, and status is Terminating, the pod is failed to attach volume, because it is marked for deletion.

➜  ~ kubectl -n ns-ukltji get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS       
pvc-5ad53b87-fcfa-4f1e-9ade-54c24d36397d   9Gi        RWO            Retain           Terminating   ns-ukltji/data-mongo-26hqj-mongodb-0       standard-rwo                               35m

$ kubectl describe pod mongo-26hqj-mongodb-0
...

c. PV claim policy is delete, and status is Terminating, the pod is failed to attach volume, because it is marked for deletion. This is not expected, the PV is deleting.

NOTICE: The following content is the result after I retried multiple times, so the CAPACITY is 11Gi, and the pvc name is not consistent with above example.

$ k get pv|grep ns-ukltji/data-mongo-jblmt-mongodb-0
pvc-4fd8ec3b-0d6d-45e4-b12d-2df7246b4893   11Gi       RWO            Delete           Terminating   ns-ukltji/data-mongo-jblmt-mongodb-0       standard-rwo                               105m

$ kubectl describe pod mongo-***-mongodb-0
Events:
  Type     Reason              Age                    From                     Message
  ----     ------              ----                   ----                     -------
  Normal   Scheduled           2m40s                  default-scheduler        Successfully assigned ns-ukltji/mongo-jblmt-mongodb-0 to gke-yjtest-default-pool-c51609d3-jp0w
  Warning  FailedAttachVolume  2m39s                  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-4fd8ec3b-0d6d-45e4-b12d-2df7246b4893" : rpc error: code = Internal desc = ControllerPublish not permitted on node "projects/kubeblocks/zones/us-central1-c/instances/gke-yjtest-default-pool-c51609d3-jp0w" due to backoff condition
  Warning  FailedAttachVolume  2m36s (x2 over 2m38s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-4fd8ec3b-0d6d-45e4-b12d-2df7246b4893" : rpc error: code = NotFound desc = Could not find disk Key{"pvc-4fd8ec3b-0d6d-45e4-b12d-2df7246b4893", zone: "us-central1-c"}: googleapi: Error 404: The resource 'projects/kubeblocks/zones/us-central1-c/disks/pvc-4fd8ec3b-0d6d-45e4-b12d-2df7246b4893' was not found, notFound
  Warning  FailedMount         37s                    kubelet                  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data], failed to process volumes=[]: timed out waiting for the condition
  Warning  FailedAttachVolume  27s (x6 over 2m34s)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-4fd8ec3b-0d6d-45e4-b12d-2df7246b4893" : PersistentVolume "pvc-4fd8ec3b-0d6d-45e4-b12d-2df7246b4893" is marked for deletion

For the case of deleting the PV (Persistent Volume), it may take multiple retries to encounter, and it is not necessarily reproducibly present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug kind/bug Something isn't working severity/critical Blocking or critical issues
Projects
None yet
4 participants