Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate PVC remap feature to replication controller for single cluster failover #147

Draft
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

YichenYa0
Copy link
Contributor

@YichenYa0 YichenYa0 commented Jul 2, 2024

Description

  • The feature will detect a successful failover on a single cluster, and remap the direction of PV pairs for all PVCs under the same replication group.
  • Expected result after a planned/unplanned failover: The old PVC will be deleted and a new PVC with the corresponding remote PV, remote storage class name, and the remote replication group. The reclaim policy of both PVs will stay the same.
  • The feature can be disabled by adding an argument flag --disable-pvc-remap=True to the replication controller deployment config helm-charts/charts/csm-replication/templates/controller.yaml when deploying with helm chart, as shown below:
    image
    Alternatively, you can edit the replication controller after its deployment, and add that argument inside the yaml file with k edit -n dell-replication-controller deployment.

GitHub Issues

List the GitHub issues impacted by this PR:

GitHub Issue #
N/A

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have maintained backward compatibility

How Has This Been Tested?

  • Manual test planned failover for a replication group with 40 PVCs, costing 2m. Manual test unplanned failover.
  • Unit testing for planned, unplanned, disable pvcremap, pvc remote pv mismatched.
  • Integrated testing for init, synchronize, failover with pvc remap, reprotect, and cleanup.

Manual Testing Instruction

  • Deploy an app. Check if PV, PVC, and RG show up as expected. For example,
    image
  • Helm delete the app deployment pod.
  • Perform a failover: repctl --rg <rg-id> failover --target <replicated-rg-id>
  • The expected result: The Reclaim Policy for each PV should be what it originally was before failover and the PVC should be bound to the replicated PV.
    vd2gzgag
  • Note that the IsSource is not reflected until reprotect. To perform a reprotect: repctl --rg <replicated-rg-id> reprotect
  • To go back to the original state, run a second failback (with arguments switcher) and reprotect (on primary rg).

Test Coverage Change

image

/controllers/replication-controller/dellcsireplicationgroup_controller.go change:

Original test coverage of main branch Latest test coverage due to added code
58.4% 68.2%

falfaroc and others added 26 commits June 12, 2024 09:07
* Initial pvcremap script

* Update pvcremap script
* initial setup for command and get pv list

* retrieve all pvcs, experiment on listoption

* retrieve list of PVCs associated with RG name argument

* passed list of PVCs to swapPVC method

* test and complete

* added swapAllPVC method

* cleaned up unused code

---------

Co-authored-by: shainabanduri3 <[email protected]>
* update annotation and label of newly created pvc for remote-rg

* recognized single cluster + began integrating pvcremap into controller

* added back remote client

* integrate swapallpvc intc into controller

* added custom methods

* used connection package methods

* changed listpersistentvolumes method

* added functions to interface

* changed remoteClient to client

* modified ListPersistentVolumeClaims

* fix listoption

* add list option

* add list pvc

* adjust build ubi micro

* changes to controller

* changes to connections

* changes to interface

* adjust the error in deleting pvc

* organize function calls and documentation

* removed log parameter, cleaned up code

* remove logger

* remove blank lines

---------

Co-authored-by: shainabanduri3 <[email protected]>
Co-authored-by: hynguyenw4dell <[email protected]>
* update annotation and label of newly created pvc for remote-rg

* recognized single cluster + began integrating pvcremap into controller

* added back remote client

* integrate swapallpvc intc into controller

* added custom methods

* used connection package methods

* changed listpersistentvolumes method

* added functions to interface

* changed remoteClient to client

* modified ListPersistentVolumeClaims

* fix listoption

* add list option

* add list pvc

* adjust build ubi micro

* changes to controller

* changes to connections

* changes to interface

* adjust the error in deleting pvc

* organize function calls and documentation

* removed log parameter, cleaned up code

* adding disable-pvc-remap

* adding disablepvcremap

* adding disablepvcremap

* adding disablepvcremap

* clean up unnecessary changes

* remove csm-common

---------

Co-authored-by: shainabanduri3 <[email protected]>
Co-authored-by: hynguyenw4dell <[email protected]>
* add pvcremap test

* add go routines to remapAllPVC

* sucessfully create fake pv

* test with no pv/pvc

* remove html

* test swapPVC

* added code to unit test function to ensure reclaim policy remains after remap

* started testing if failover action is processed

* removed unnecessary Get in updatePVClaimRef and removePVClaimRef + fixed error message

* added complete code for failover test

* fix claim ref loop

* removed unnecessary reconciles, cleaned up code

* add test remap pvc unplanned

* add disabled remap pvc

* add  remap pvc with mismatched target

* tests include disablePVCRemap, unplanned failover, and refactoring

* linting and rename functions

* remove report

---------

Co-authored-by: hynguyenw4dell <[email protected]>
Co-authored-by: shainabanduri3 <[email protected]>
@YichenYa0 YichenYa0 marked this pull request as ready for review July 29, 2024 13:03
Copy link
Contributor

@santhoshatdell santhoshatdell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tom @falfaroc @YichenYa0
I see that rep-controller deletes the local PVC and there is a manual test step to delete application pods. If the pods are running, delete pvc call would not succeed.
As per my understanding, the purpose of this PR is to automatically remap target PVs to running application pods after failover operation. Please correct me if I got it wrong.

With the current changes, I see that user has to perform below steps when storage array on source side becomes unavailable:

  • Delete application pods,
  • run repctl failover cmd (once completed, replication controller remaps new PVCs to target PVs),
  • Recreate application pods with same yaml which also triggers exports from target storage array.

Copy link
Contributor

@santhoshatdell santhoshatdell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint check failed. Please check.

Copy link
Contributor

@santhoshatdell santhoshatdell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Two things to check before the PR can be merged.

  1. Need to check with POs if this feature can be merged.
  2. There were multiple commits made after the initial testing. We may need to retest once.

@falfaroc
Copy link
Contributor

LGTM

Two things to check before the PR can be merged.

  1. Need to check with POs if this feature can be merged.
  2. There were multiple commits made after the initial testing. We may need to retest once.

Unit tests are still passing and look good, I will update the description with coverage. I will run the e2e as well to ensure that everything is working there as well.

@santhoshatdell santhoshatdell marked this pull request as draft October 15, 2024 13:34
@santhoshatdell
Copy link
Contributor

Moving to Draft state, need to check with POs regarding this feature's milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants