-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete Velero Backup when user sets the deleteVeleroBackup #67
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mpryc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Depends on #65 as it's rebased on it. |
34133b5
to
72ef619
Compare
@mpryc I am confused, how this fixes related issue? This will delete Velero Backup on demand, right? But NAB will be still around Should not be a mechanism that when NAB is deleted, Velero Backup is also deleted? |
No, we discussed this with @shubham-pampattiwar and @shawn-hurley. We really can't do the above as we don't want end up in situation where there is namespace removed that triggers the NAB to be deleted and we also loose Bacups from which namespace could be restored. In k8s there isn't really separation of the events that triggered NAB deletion. |
@mpryc Thoughts on why we made this feature enablement via |
@@ -280,6 +281,29 @@ func CheckVeleroBackupLabels(backup *velerov1api.Backup) bool { | |||
return exists && value == constant.ManagedByLabelValue | |||
} | |||
|
|||
// DeleteVeleroBackup deletes Velero Backup object based on the NonAdminBackup object name and namespace | |||
func DeleteVeleroBackup(ctx context.Context, r client.Client, logger logr.Logger, nab *nacv1alpha1.NonAdminBackup) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the Deletion logic flawed here ? Why are we using GenerateVeleroBackupName
function to fetch the backup to be deleted ? Isnt there a high possibility that the Generated Name won't match the actual VeleroBackup Name that we want to delete ? Shouldn't we List the Velero backups and then check the annotations of these Velero backup objects and match the nab name with the annotation value of the key "openshift.io/oadp-nab-origin-name"
to get the VeleroBackup to be deleted ? Let me know if I am missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our velero backup name is always generated based on https://github.com/migtools/oadp-non-admin/blob/master/internal/common/function/function.go#L115-L145
fmt.Sprintf("%s-%s-%s", constant.VeleroBackupNamePrefix, namespace, nameHash)
So we have always nab-NAMESPACE-HASH
where the HASH is sha256 from NonAdminBackup name.
VeleroBackup could get deleted only if one matches that entire 3 parts.
If we want to add extra protection we could, but then I would argue the UUID of the NonAdminBackup needs to be used from the VeleroBackup to match the UUID of the requestor.
It is currently possible, because we store this annotation within VeleroBackup:
openshift.io/oadp-nab-origin-uuid=<UUID_OF_NAB>
Let me know if I should add that protection on top of current name match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not to just use
VeleroBackupName string `json:"veleroBackupName,omitempty"` |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but with the current implementation I can:
- Create NAB
- VeleroBackup will get created
- Remove NAB
At this stage I have still can remove VeleroBackup from 2 if I create new NAB with the same name as the previous one with flag deleteVeleroBackup: true
.
Using status will disallow me to do so, because 2 will never get created as this object will be there forever until sysadmin removes it. It's because Status can only be updated by controller when 2 is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to have corner case where for some reason NAB did not got proper update of the Status.
To be honest we may want also add some safety feature to only delete VeleroBackup if it has the UID or other linkage to VeleroBackup from NAB object, this imo is separate design ?
There needs to be one source of truth here and I've chosen our function, where it could also be Status, so either way. Function is less check if status exists as we have name and only check if VeleroBackup exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NAB will not have a status only when the NAB controller is not running right ? or if the NAB controller is running and the status update call fails ? I think this makes it even more prudent that we check for the status of NAB first:
- check if NAB has status
- lookup of VeleroBackup object from NAB.status.VeleroBackupName
- VeleroBackup is in terminal state
- then delete the VeleroBackup object and then subsequently NAB
Picking up the VeleroBackupName from NAB status at the very least makes sure that VeleroBackup exists for that NAB object and Delete Call makes sense from user perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a time period when the NAB was created, but Backup was not yet created. If we are in that time window there is a problem with getting proper velero backup name from status. If that happens we won't reconcile again and that object will be out of sync. That is the only additional use case which I can think of when it comes to inconsistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See thats the thing, we are not allowing Delete for such cases, grabbing the name from status would not allow a delete for such cases(because there is no VeleroBackup object in terminal phase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, more then half of myself tend to agree with you now :-) Thanks.
@@ -86,31 +87,91 @@ func (r *NonAdminBackupReconciler) Reconcile(ctx context.Context, req ctrl.Reque | |||
return ctrl.Result{}, err | |||
} | |||
|
|||
reconcileExit, reconcileErr := r.DeleteVeleroBackup(ctx, rLog, &nab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't we have a case to requeue here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint job complained that the requeue was not used in the r.DeleteVeleroBackup function, so I removed it from the logic.
/hold |
72ef619
to
3db9cc3
Compare
Rebase and small update to the PR to include design. @mateusoliveira43 if you could check the design update and let me know if this is fine. To be done before is ready to merge:
|
Sure will move that update there, but now I wonder if we should completely drop status updates while deleting? Now the way it currently works is: |
We need some sort of status update on NAB object once VeleroBackup is deleted because if there is an err for VeleroBackup deletion call we should be able to tell about that to the user via NAB status. |
If there is any error with VeleroBackup deletion, I think we should not try to delete NAB and only then update NAB status with: |
yes correct that's exactly what I meant, deletion of NAB should be stopped and only status should be updated for NAB object. |
Enhancement which will delete original VeleroBackup when the user sets the deleteVeleroBackup witin NonAdminBackup Spec. Fixes Issue migtools#58 Signed-off-by: Michal Pryc <[email protected]>
3db9cc3
to
c5fb526
Compare
Updated the logic to delete NonAdminBackup when there is no error while removing VeleroBackup and when the VeleroBackup was not in the cluster. Updated manual tests in the description of this PR that were performed. |
@@ -39,6 +39,10 @@ type NonAdminBackupSpec struct { | |||
// BackupSpec defines the specification for a Velero backup. | |||
BackupSpec *velerov1api.BackupSpec `json:"backupSpec,omitempty"` | |||
|
|||
// DeleteVeleroBackup tells the controller to remove created Velero Backup. | |||
// +optional | |||
DeleteVeleroBackup *bool `json:"deleteVeleroBackup,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also deletes NAB itself
Would be better to inform this in doc comment and change field name to simply delete
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete isn't enough imo, we need to be more precise here, however I agree that current name isn't ideal as we also remove NAB so we need to find out better one.
How about DeleteBackup
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah DeleteBackup
sounds good !
// if one exists in the s3 storage, and this will as well cause NonAdminBackup | ||
// to be recreated by the NAC sync controller, but at this stage we have current | ||
// proper state reflected. | ||
logger.V(1).Info(fmt.Sprintf("Velero Backup not deleted for NAB: %s", nab.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be interesting to add reason here?
something like "Velero Backup %s, referenced in for %s, was not found in %s namespace and was not deleted"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this will cause NAB to be in phase Backing Off, and all the info about the referenced velero backup is in the status of that NAB, do we want to duplicate this info within debug logs?
logger := logrLogger.WithValues("DeleteVeleroBackup", nab.Namespace) | ||
|
||
// Delete Velero Backup if the deleteVeleroBackup was set to true | ||
if nab.Spec.DeleteVeleroBackup != nil && *nab.Spec.DeleteVeleroBackup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check here NAB status?
for example, I create a NAB with delete: true, it will be immediately deleted, right? should we only delete finished NABs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, if someone want to abuse spec then it's free to remove object that has been just created. Similarly to:
mkdir /tmp/test && rmdir /tmp/test
Nobody stops anyone from doing above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mateusoliveira43 we should only delete VeleroBackup that are in terminal phase. Currently, I don't think we can delete an in-progress VeleroBackup
} | ||
} else if apierrors.IsNotFound(err) { | ||
return false, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think implementation is flawed here
we ignore case that err
is not nil and different than apierrors.IsNotFound
, not?
that means that Velero Backup was not even retrieved, but we say it got deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a suggestion is to always check if err != nil
when dealing with errors to avoid forgetting anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree on that, will add that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found a little error in code flow checking err
Small fix for the situation where fetching Backup object gave error other then Non existing backup object. Also reverte nab_status_update.md design. Signed-off-by: Michal Pryc <[email protected]>
Closing this in favor of #108 |
Enhancement which will delete original VeleroBackup when the user sets the deleteVeleroBackup witin NonAdminBackup Spec.
Fixes #58
This allows NonAdminBackup object to have additional field within spec which will trigger deletion of the original VeleroBackup object:
To test:
Test 1:
deleteVeleroBackup on non existing Velero Backup object
BackingOff
Type: Accepted
Status: False
Reason: InvalidBackupSpec
Message: BackupSpec is not defined
DEBUG Deleted NonAdminBackup: backupwithoutspec
DEBUG NonAdminBackupPredicate: Accepted Delete event
DEBUG >>> Reconcile NonAdminBackup - loop start
DEBUG Non existing NonAdminBackup CR
Test 2:
deleteVeleroBackup on existing Velero Backup object set to true
Created
Status: True
Reason: BackupAccepted
Message: Backup accepted
Status: True
Reason: BackupAccepted
Message: Backup accepted
Ensure VeleroBackup is created in the openshift-adp namespace:
$ oc get Backup -n openshift-adp nab-nacproject-eeaa9cd35c77f4
NAME AGE
nab-nacproject-eeaa9cd35c77f4 12s
Update the NonAdminBackup object to include:
DEBUG Deleted NonAdminBackup: backupwithoutspec
DEBUG NonAdminBackupPredicate: Accepted Delete event
DEBUG >>> Reconcile NonAdminBackup - loop start
DEBUG Non existing NonAdminBackup CR
Test 3:
deleteVeleroBackup on existing Velero Backup object set to false
DEBUG: NonAdminBackup BackupSpec and BackupStatus - nothing to update
Test 4:
deleteVeleroBackup on NON existing Velero Backup object set to true
$ oc delete -n openshift-adp nab-nacproject-eeaa9cd35c77f4
DEBUG Deleted NonAdminBackup: backupwithoutspec
DEBUG NonAdminBackupPredicate: Accepted Delete event
DEBUG >>> Reconcile NonAdminBackup - loop start
DEBUG Non existing NonAdminBackup CR