fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition (backport #3566) #3567
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
longhorn/longhorn#10416
RCA
This happens with encrypted volume only
While Longhorn CSI plugin is doing
NodeStageVolume
, if the Longhorn engine crash right before GetDiskFormat and recover quickly after that, the following race condition can happen:/dev/longhorn/volume-name
getDiskFormat
returns empty value fordiskFormat
andnil
error. Reflonghorn-manager/vendor/k8s.io/mount-utils/mount_linux.go
Lines 687 to 693 in 5f9ec86
/dev/longhorn/volume-name
diskFormat
having empty value, Longhorn re-encrypts the volume which wipes out the all data. Reflonghorn-manager/csi/node_server.go
Lines 513 to 517 in 5f9ec86
getDiskFormat
is usingblkid
. Theblkid
command cannot differentiate 2 cases: the device doesn't exist VS the device doesn't have filesystem inside itProposal
Use cryptsetup isLuks instead. It can differentiate between:
Ref: https://gitlab.com/cryptsetup/cryptsetup/-/blob/main/FAQ.md?plain=1#L2848
With this proposal, we can safely make decision of only doing encryption in the 2nd case
Additional documentation or context
We were discussing of the below ideas with @derekbit and @shuo-wu but it seems cannot remove the race condition 100%
Idea 1: Check if the device path exist before doing GetDiskFormat
-> Cannot avoid the race condition as the device can still disappear before GetDiskFormat and re-appear after GetDiskFormat
Idea 2: Check if the device path exist after doing GetDiskFormat
-> Cannot avoid the race condition as the device can still disappear before GetDiskFormat and re-appear after GetDiskFormat
This is an automatic backport of pull request #3566 done by [Mergify](https://mergify.com).