Disable DR when a cluster is not responsive #1139

nirs · 2023-11-19T21:25:27Z

So far we tested disable DR when both primary and secondary clusters are up. In disaster use case we may need to disable DR when the one of the clusters is not responsive. In this case we may not be able to clean up a cluster or even get the status of the cluster using ManagedClusterView.

Simulating non responsive cluster is easy with virsh:

virsh -c qemu:///system suspend dr1

Recover a cluster:

virsh -c qemu:///system resume dr1

Tested during failover, suspend cluster before failover, resume after application running on the failover cluster.

Fix

Support marking a drcluster as unavailable. When cluster is unavailable:

On the remaining managed cluster, do not access the s3 store from the unavailable cluster. This allows vrg delete flow to complete
On the hub, ignore vrg from unavailable managed cluster, so waiting for vrg count to become zero succeeds, and drpc deletion flow completes.
On the hub, do not try to validate the unavailable drcluster

Recommended flow

Mark the cluster as unavailable
Failover the application to the good cluster
Fix the drpolicy predicates if needed
Delete the drpc
Delete the policy annotation disabling OCM scheduling
When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
Replace the unavailable cluster
Enable DR again for the applications

Alternative flow

It the user will forget to mark a cluster as unavailable before disabling DR, disable dr will be stuck:

The vrg on the remaining cluster will have the s3 profile name for the unavailable cluster, the vrg will be stuck in retry loop trying to access the unavailable s3 store.
The drpc on the hub will be stuck waiting for the the stuck vrg and the stale vrg from the unavailable cluster reported by managedclusterview

Marking the cluster as unavailable should fix the issue but may require more manual work.

Failover the application to the good cluster
Fix the drpolicy predicates if needed
Delete the drpc - stuck because the cluster is unavailable
Mark the drcluster as unavailable to make delete drpc finish
Delete the policy annotation disabling OCM scheduling
When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
Replace the unavailable cluster
Enable DR again for the applications

Issues:

After deleting the drpc and the vrg manifestwork changes in the manifestwork spec are not propagated to the managed cluster.
- May need to edit the vrg and remove the s3 profile name for the bad cluster

Tasks

Test primary cluster failure: failover + disable dr
Test secondary cluster failure: deploy + disable dr
- any change compared to first case?
Support marking a drcluster as unavailable
Skip unavailable drcluster when reconciling drcluster
When creating VRG for manifestwork, include only s3 profiles from available drclusters
When waiting for vrg count to become zero, ignore vrgs from unavailable drclusters
Document replace cluster flow in docs/replace-cluster.md

Similar k8s flows:

https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/

The text was updated successfully, but these errors were encountered:

nirs · 2023-11-20T13:50:43Z

Testing non-responsive cluster flow using #1133

Steps:

Configure test

$ git diff test/basic-test/config.yaml
...
 ---
-repo: https://github.com/ramendr/ocm-ramen-samples.git
-path: subscription
-branch: main
-name: busybox-sample
-namespace: busybox-sample
+repo: https://github.com/nirs/ocm-ramen-samples.git
+path: k8s/busybox-regional-rbd-deploy/sub
+branch: test
+name: busybox-regional-rbd-deploy
+namespace: busybox-regional-rbd-deploy
 dr_policy: ramen-basic-test
  pvc_label: busybox

Deploy and enable dr with regional-dr env

env=$PWD/test/envs/regional-dr.yaml
test/basic-test/setup $env
test/basic-test/deploy $env
test/basic-test/enale-dr $env

Simulate disaster in current cluster (dr1)
```
virsh -c qemu:///system suspend dr1
```

Failover application to secondary cluster (dr2)

kubectl patch drpc kubevirt-drpc \
    --patch '{"spec": {"action": "Failover", "failoverCluster": "dr2"}}' \
    --type merge \
    --namespace busybox-regional-rbd-deploy \
    --context hub

Wait until application is running on dr2
Disable dr
```
test/basic/disable-dr $env
```
(stuck)

Actual result

Deleting drpc stuck

In ramen hub logs we see:

2023-11-20T13:31:20.864Z	INFO	controllers.DRPlacementControl	controllers/drplacementcontrol_controller.go:628	Error in deleting DRPC: (waiting for VRGs count to go to zero)	{"DRPC": "busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc", "rid": "25743882-77dc-4572-bdca-60d18d26c97d"}

ManagedClusterViews

We don't have any visibility on the cluster status in dr1 - we simply see the last reported status.

$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr1 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
    drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
  creationTimestamp: "2023-11-20T13:09:33Z"
  generation: 1
  name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
  namespace: dr1
  resourceVersion: "9665"
  uid: d039bf36-90ac-47eb-abed-fb044d3f5e03
spec:
  scope:
    apiGroup: ramendr.openshift.io
    kind: VolumeReplicationGroup
    name: busybox-regional-rbd-deploy-drpc
    namespace: busybox-regional-rbd-deploy
    version: v1alpha1
status:
  conditions:
  - lastTransitionTime: "2023-11-20T13:10:03Z"
    message: Watching resources successfully
    reason: GetResourceProcessing
    status: "True"
    type: Processing
  result:
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: VolumeReplicationGroup
    metadata:
      creationTimestamp: "2023-11-20T13:09:34Z"
      finalizers:
      - volumereplicationgroups.ramendr.openshift.io/vrg-protection
      generation: 1
      managedFields:
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:finalizers:
              .: {}
              v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
        manager: manager
        operation: Update
        time: "2023-11-20T13:09:34Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:ownerReferences:
              .: {}
              k:{"uid":"d59b5f73-9b77-4643-a95f-cbeeb9439ac3"}: {}
          f:spec:
            .: {}
            f:async:
              .: {}
              f:replicationClassSelector: {}
              f:schedulingInterval: {}
              f:volumeSnapshotClassSelector: {}
            f:pvcSelector: {}
            f:replicationState: {}
            f:s3Profiles: {}
            f:volSync: {}
        manager: work
        operation: Update
        time: "2023-11-20T13:09:34Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:status:
            .: {}
            f:conditions: {}
            f:kubeObjectProtection: {}
            f:lastGroupSyncBytes: {}
            f:lastGroupSyncDuration: {}
            f:lastGroupSyncTime: {}
            f:lastUpdateTime: {}
            f:observedGeneration: {}
            f:protectedPVCs: {}
            f:state: {}
        manager: manager
        operation: Update
        subresource: status
        time: "2023-11-20T13:11:38Z"
      name: busybox-regional-rbd-deploy-drpc
      namespace: busybox-regional-rbd-deploy
      ownerReferences:
      - apiVersion: work.open-cluster-management.io/v1
        kind: AppliedManifestWork
        name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
        uid: d59b5f73-9b77-4643-a95f-cbeeb9439ac3
      resourceVersion: "17240"
      uid: 0c9bf489-bde7-4c74-a644-ea94f39701b1
    spec:
      async:
        replicationClassSelector: {}
        schedulingInterval: 1m
        volumeSnapshotClassSelector: {}
      pvcSelector:
        matchLabels:
          appname: busybox
      replicationState: primary
      s3Profiles:
      - minio-on-dr1
      - minio-on-dr2
      volSync: {}
    status:
      conditions:
      - lastTransitionTime: "2023-11-20T13:09:37Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-11-20T13:09:36Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-11-20T13:09:34Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-11-20T13:09:36Z"
        message: Kube objects protected
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      kubeObjectProtection: {}
      lastGroupSyncBytes: 81920
      lastGroupSyncDuration: 0s
      lastGroupSyncTime: "2023-11-20T13:11:01Z"
      lastUpdateTime: "2023-11-20T13:11:38Z"
      observedGeneration: 1
      protectedPVCs:
      - accessModes:
        - ReadWriteOnce
        conditions:
        - lastTransitionTime: "2023-11-20T13:09:37Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: "True"
          type: DataReady
        - lastTransitionTime: "2023-11-20T13:09:36Z"
          message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [minio-on-dr1
            minio-on-dr2]'
          observedGeneration: 1
          reason: Uploaded
          status: "True"
          type: ClusterDataProtected
        - lastTransitionTime: "2023-11-20T13:09:37Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: "False"
          type: DataProtected
        csiProvisioner: rook-ceph.rbd.csi.ceph.com
        labels:
          app: busybox-regional-rbd-deploy
          app.kubernetes.io/part-of: busybox-regional-rbd-deploy
          appname: busybox
          ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
          ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
        lastSyncBytes: 81920
        lastSyncDuration: 0s
        lastSyncTime: "2023-11-20T13:11:01Z"
        name: busybox-pvc
        namespace: busybox-regional-rbd-deploy
        replicationID:
          id: ""
        resources:
          requests:
            storage: 1Gi
        storageClassName: rook-ceph-block
        storageID:
          id: ""
      state: Primary

On dr2 we see an error condition trying to upload data to s3 store on dr1

$ kubectl get managedclusterview busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv -n dr2 --context hub -o yaml
apiVersion: view.open-cluster-management.io/v1beta1
kind: ManagedClusterView
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-regional-rbd-deploy-drpc
    drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-regional-rbd-deploy
  creationTimestamp: "2023-11-20T13:09:33Z"
  generation: 1
  name: busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mcv
  namespace: dr2
  resourceVersion: "10795"
  uid: c17d2721-c321-4328-90d9-dcda09eb2608
spec:
  scope:
    apiGroup: ramendr.openshift.io
    kind: VolumeReplicationGroup
    name: busybox-regional-rbd-deploy-drpc
    namespace: busybox-regional-rbd-deploy
    version: v1alpha1
status:
  conditions:
  - lastTransitionTime: "2023-11-20T13:12:33Z"
    message: Watching resources successfully
    reason: GetResourceProcessing
    status: "True"
    type: Processing
  result:
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: VolumeReplicationGroup
    metadata:
      creationTimestamp: "2023-11-20T13:12:27Z"
      deletionGracePeriodSeconds: 0
      deletionTimestamp: "2023-11-20T13:18:10Z"
      finalizers:
      - volumereplicationgroups.ramendr.openshift.io/vrg-protection
      generation: 2
      managedFields:
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:finalizers:
              .: {}
              v:"volumereplicationgroups.ramendr.openshift.io/vrg-protection": {}
        manager: manager
        operation: Update
        time: "2023-11-20T13:12:27Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:ownerReferences:
              .: {}
              k:{"uid":"35837d74-d9b2-49d9-be70-1b2cb8db754a"}: {}
          f:spec:
            .: {}
            f:action: {}
            f:async:
              .: {}
              f:replicationClassSelector: {}
              f:schedulingInterval: {}
              f:volumeSnapshotClassSelector: {}
            f:pvcSelector: {}
            f:replicationState: {}
            f:s3Profiles: {}
            f:volSync: {}
        manager: work
        operation: Update
        time: "2023-11-20T13:12:27Z"
      - apiVersion: ramendr.openshift.io/v1alpha1
        fieldsType: FieldsV1
        fieldsV1:
          f:status:
            .: {}
            f:conditions: {}
            f:kubeObjectProtection: {}
            f:lastUpdateTime: {}
            f:observedGeneration: {}
            f:protectedPVCs: {}
            f:state: {}
        manager: manager
        operation: Update
        subresource: status
        time: "2023-11-20T13:13:44Z"
      name: busybox-regional-rbd-deploy-drpc
      namespace: busybox-regional-rbd-deploy
      ownerReferences:
      - apiVersion: work.open-cluster-management.io/v1
        kind: AppliedManifestWork
        name: da6717d4434fc933ac3d041c0fe2591a3f6eb404c56acf93a31ce29681455949-busybox-regional-rbd-deploy-drpc-busybox-regional-rbd-deploy-vrg-mw
        uid: 35837d74-d9b2-49d9-be70-1b2cb8db754a
      resourceVersion: "18956"
      uid: 1527508b-cbaf-4edd-9ec1-83758d7c466a
    spec:
      action: Failover
      async:
        replicationClassSelector: {}
        schedulingInterval: 1m
        volumeSnapshotClassSelector: {}
      pvcSelector:
        matchLabels:
          appname: busybox
      replicationState: primary
      s3Profiles:
      - minio-on-dr1
      - minio-on-dr2
      volSync: {}
    status:
      conditions:
      - lastTransitionTime: "2023-11-20T13:13:44Z"
        message: PVCs in the VolumeReplicationGroup are ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-11-20T13:13:20Z"
        message: VolumeReplicationGroup is replicating
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2023-11-20T13:12:55Z"
        message: Restored cluster data
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2023-11-20T13:13:20Z"
        message: Cluster data of one or more PVs are unprotected
        observedGeneration: 1
        reason: UploadError
        status: "False"
        type: ClusterDataProtected
      kubeObjectProtection: {}
      lastUpdateTime: "2023-11-20T13:13:44Z"
      observedGeneration: 1
      protectedPVCs:
      - accessModes:
        - ReadWriteOnce
        conditions:
        - lastTransitionTime: "2023-11-20T13:13:20Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Ready
          status: "True"
          type: DataReady
        - lastTransitionTime: "2023-11-20T13:13:07Z"
          message: |-
            error uploading PV to s3Profile minio-on-dr1, failed to protect cluster data for PVC busybox-pvc, failed to upload data of bucket:busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c, RequestError: send request failed
            caused by: Put "http://192.168.122.208:30000/bucket/busybox-regional-rbd-deploy/busybox-regional-rbd-deploy-drpc/v1.PersistentVolume/pvc-410f8000-67d6-49bb-8574-cb57c6b4f13c": dial tcp 192.168.122.208:30000: connect: no route to host
          observedGeneration: 1
          reason: UploadError
          status: "False"
          type: ClusterDataProtected
        - lastTransitionTime: "2023-11-20T13:13:20Z"
          message: PVC in the VolumeReplicationGroup is ready for use
          observedGeneration: 1
          reason: Replicating
          status: "False"
          type: DataProtected
        csiProvisioner: rook-ceph.rbd.csi.ceph.com
        labels:
          app: busybox-regional-rbd-deploy
          app.kubernetes.io/part-of: busybox-regional-rbd-deploy
          appname: busybox
          ramendr.openshift.io/owner-name: busybox-regional-rbd-deploy-drpc
          ramendr.openshift.io/owner-namespace-name: busybox-regional-rbd-deploy
        name: busybox-pvc
        namespace: busybox-regional-rbd-deploy
        replicationID:
          id: ""
        resources:
          requests:
            storage: 1Gi
        storageClassName: rook-ceph-block
        storageID:
          id: ""
      state: Primary

DRCluster

There is no visibility on cluster status in drclusters:

$ kubectl get drcluster --context hub -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr1"},"spec":{"region":"west","s3ProfileName":"minio-on-dr1"}}
    creationTimestamp: "2023-11-20T13:03:21Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: dr1
    resourceVersion: "8064"
    uid: 2e1cfaec-d6b3-4b46-bca6-058281ca285f
  spec:
    region: west
    s3ProfileName: minio-on-dr1
  status:
    conditions:
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    - lastTransitionTime: "2023-11-20T13:03:22Z"
      message: Validated the cluster
      observedGeneration: 1
      reason: Succeeded
      status: "True"
      type: Validated
    phase: Available
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ramendr.openshift.io/v1alpha1","kind":"DRCluster","metadata":{"annotations":{},"name":"dr2"},"spec":{"region":"east","s3ProfileName":"minio-on-dr2"}}
    creationTimestamp: "2023-11-20T13:03:21Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
    name: dr2
    resourceVersion: "8071"
    uid: f6b37742-35b9-4738-96e6-8a920739b9fc
  spec:
    region: east
    s3ProfileName: minio-on-dr2
  status:
    conditions:
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-20T13:03:21Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    - lastTransitionTime: "2023-11-20T13:03:22Z"
      message: Validated the cluster
      observedGeneration: 1
      reason: Succeeded
      status: "True"
      type: Validated
    phase: Available
kind: List
metadata:
  resourceVersion: ""

nirs self-assigned this Nov 19, 2023

nirs mentioned this issue Nov 22, 2023

Support disable DR when cluster is unavailable #1147

Merged

12 tasks

ShyamsundarR closed this as completed in #1147 Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable DR when a cluster is not responsive #1139

Disable DR when a cluster is not responsive #1139

nirs commented Nov 19, 2023 •

edited

Loading

nirs commented Nov 20, 2023 •

edited

Loading

Disable DR when a cluster is not responsive #1139

Disable DR when a cluster is not responsive #1139

Comments

nirs commented Nov 19, 2023 • edited Loading

Fix

Recommended flow

Alternative flow

Tasks

nirs commented Nov 20, 2023 • edited Loading

Actual result

Deleting drpc stuck

ManagedClusterViews

DRCluster

nirs commented Nov 19, 2023 •

edited

Loading

nirs commented Nov 20, 2023 •

edited

Loading