Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: Dont depend on image state to issue resync #4076

Merged
merged 2 commits into from
Aug 30, 2023

Conversation

Madhu-1
Copy link
Collaborator

@Madhu-1 Madhu-1 commented Aug 24, 2023

Previously we were dependent on the image mirror state to issue the resync command, What we came to know is that we cannot depend on the image mirror state as the state can change anytime from one to another, We are following the below steps to fix this problem.

  • During the Demote volume store the image creation timestamp.
  • During Resync do the below operation
    • Check the image creation timestamp stored during the Demote operation and the current creation timestamp during Resync and check both are equal and it for force resync then issue resync
    • If the image on both sides is not in the unknown state, check last_snapshot_timestamp on the local mirror description, if it's present send volumeReady as false, or else return an error message.
    • If both the images are in up+unknown the send volumeReady as true.

Storing the image creation timestamp on the rbd image as metadata and it gets cleaned up when the image is recreated as it gets synced up from the primary cluster.

@Madhu-1 Madhu-1 requested review from idryomov, ShyamsundarR and a team August 24, 2023 11:30
@mergify mergify bot added the component/rbd Issues related to RBD label Aug 24, 2023
@Madhu-1 Madhu-1 force-pushed the fix-rbd-resync branch 2 times, most recently from b765827 to 6f3af4d Compare August 24, 2023 11:41
@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Aug 24, 2023

Looks like yum update on base image is broken :(

@@ -637,14 +658,28 @@ func (rs *ReplicationServer) ResyncVolume(ctx context.Context,
ready = checkRemoteSiteStatus(ctx, mirrorStatus)
}

err = rbdVol.ResyncVol(localStatus, req.Force)
err = rbdVol.GetImageInfo()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetImageInfo() only returns an error? Can you not return whatever information you need? Checking attributes like rbdVol.CreatedAt is dangerous, specially because of the required calling of GetImageInfo() (that probably was a reason the function was not public).

Maybe introduce rbdVol.GetCreationTime() that internally calls getImageInfo() if the creation time is not set yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic yes it internally updates the image details in local structure and returns an error if that fails. am not sure we should start introduces functions to get a field of the rbdVol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename GetImageInfo() to UpdateImageInfo(), but accessing attributes of rbdVolume outside of the rbs package is not very clean. There is no guarantee that the attribute has been set (unless you remember to call getImageInfo(), which is rather error prone.

Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetImgeInfo() really should be an internal function of rbdVolume. A Get...() function should return something in addition to an error.

Using attributes from rbdVolume outside of its own package is not very clean. Use functions that return requires attributes instead (so that the can be populated/checked in rbdVolume functions).

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Aug 24, 2023

GetImgeInfo() really should be an internal function of rbdVolume. A Get...() function should return something in addition to an error.

Using attributes from rbdVolume outside of its own package is not very clean. Use functions that return requires attributes instead (so that the can be populated/checked in rbdVolume functions).

Added one extra function to return CreationTimestamp, @nixpanic PTAL

return false, errors.New("failed to get image creation time")
}

log.UsefulLog(ctx, "savedImageTime=%v, currentImageTime=%v", savedImageTime, currentImageTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this, or phrase it a little nicer, or move it to debug level.

@@ -126,3 +98,23 @@ func (rv *rbdVolume) DisableVolumeReplication(

return nil
}

func (rv *rbdVolume) CheckImageIsPrimary() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by the naming of the function. I expected a true/false return in case the image is (not) primary...

Instead, if the image is primary, it returns a ErrInvalidArgument?

What is the intention of this function, and can it be made clearer by renaming and/or returning a bool? A comment with a description of the function would also help.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

things got messed up when I tried to refractor, for now removed refractoring as it looks complicated to do as part of this PR.

@Madhu-1 Madhu-1 force-pushed the fix-rbd-resync branch 3 times, most recently from f32f641 to daa2550 Compare August 24, 2023 13:15
@Madhu-1 Madhu-1 requested a review from nixpanic August 24, 2023 13:15
if len(strings.Split(parts[0], ":")) != 2 || len(strings.Split(parts[1], ":")) != 2 {
return nil, status.Errorf(codes.Internal, "failed to parse image creation time: %s", savedImageTime)
}
secondsStr := strings.Split(parts[0], ":")[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion it would be to good to have a different function to parse the creation time and get the required format, we can also further add unit test for it. What do you think @Madhu-1 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, definitely a good suggestion!

if len(strings.Split(parts[0], ":")) != 2 || len(strings.Split(parts[1], ":")) != 2 {
return nil, status.Errorf(codes.Internal, "failed to parse image creation time: %s", savedImageTime)
}
secondsStr := strings.Split(parts[0], ":")[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, definitely a good suggestion!

// store the image creation time for resync
_, err = rbdVol.GetMetadata(imageCreationTimeKey)
if err != nil && errors.Is(err, librbd.ErrNotFound) {
err = rbdVol.SetMetadata(imageCreationTimeKey, creationTime.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using your own timestampToString() and timestampFromString() functions. There is no guarantee that creationTime.String() produces the same string format across package versions.

@@ -1558,6 +1558,19 @@ func (rv *rbdVolume) setImageOptions(ctx context.Context, options *librbd.ImageO
return nil
}

// getImageCreationTime returns the creation time of the image. if the image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getImageCreationTime -> GetImageCreationTime

if err != nil {
return nil, status.Error(codes.Internal, err.Error())
return nil, status.Errorf(codes.Internal, "failed to get image creation time: %s", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider including the image name and key in the error message, if something goes wrong it gives a hint to check while troubleshooting

if err != nil {
return nil, getGRPCError(err)
return nil, status.Errorf(codes.Internal, "failed to get image info: %s", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the error message to include the name of the image?

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.28

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Aug 29, 2023
@mergify
Copy link
Contributor

mergify bot commented Aug 29, 2023

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at e013cfe

Not sure why but go-lint is failing
with below error and this fix is required
to make it pass

```
directive `//nolint:staticcheck // See comment above.`
is unused for linter "staticcheck" (nolintlint)
```

Signed-off-by: Madhu Rajanna <[email protected]>
During the Demote volume store
the image creation timestamp.

During Resync do below operation

* Check image creation timestamp
stored during Demote operation
and current creation timestamp during Resync
and check both are equal and its for
force resync then issue resync
* If the image on both sides is
not in unknown state, check
last_snapshot_timestamp on the
local mirror description, if its present
send volumeReady as false or else return
error message.

If both the images are in up+unknown the
send volumeReady as true.

Signed-off-by: Madhu Rajanna <[email protected]>
@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Aug 29, 2023
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.27

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Aug 29, 2023
@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Aug 30, 2023

/test ci/centos/mini-e2e/k8s-1.27

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Aug 30, 2023

/test ci/centos/mini-e2e/k8s-1.26

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Aug 30, 2023

/test ci/centos/k8s-e2e-external-storage/1.27

@mergify mergify bot merged commit e013cfe into ceph:devel Aug 30, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants