fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition #1365
Triggered via pull request
February 14, 2025 05:36
PhanLe1010
synchronize
#3566
Status
Failure
Total duration
24s
Artifacts
–
conventional_commits.yml
on: pull_request_target
commit-lint
16s
Annotations
1 error
commit-lint
You have commit messages with errors
⧗ input: fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition
This happens with encrypted volume only
While Longhorn CSI plugin is doing NodeStageVolume, if the Longhorn
engine crash right before GetDiskFormat
(https://github.com/longhorn/longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/csi/node_server.go#L486)
and recover quickly after that, the following race condition can happen:
1. When engine is crash, the engine removes the Longhorn device at
/dev/longhorn/volume-name
2. Because the device path is gone, getDiskFormat returns empty value for diskFormat and nil error.
Ref https://github.com/longhorn/longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/vendor/k8s.io/mount-utils/mount_linux.go#L687-L693
3. Now engine is recovered, so it recreate the device at
/dev/longhorn/volume-name
4. However, due to diskFormat having empty value, Longhorn re-encrypts
the volume which wipes out the all data. Ref https://github.com/longhorn/longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/csi/node_server.go#L513-L517
5. Finally, Longhorn cannot detect a filesystem for the volume any more
so it reformat the filesystem
RCA
The root cause of the issue is that getDiskFormat is using blkid. The
blkid command cannot differentiate 2 cases: the device doesn't exist VS
the device doesn't have filesystem inside it
Proposal
Use cryptsetup isLuks instead. can differentiate between:
1. The device is already encrypted -> return exit code 0
2. The device is not encrypted -> return exit code 1
3. All other errors -> return exit code 4
Ref: https://gitlab.com/cryptsetup/cryptsetup/-/blob/main/FAQ.md?plain=1#L2848
Now we can safely make decision of only doing encryption in the 2nd case
longhorn-10416
Signed-off-by: Phan Le <[email protected]>
✖ footer's lines must not be longer than 100 characters [footer-max-line-length]
✖ header must not be longer than 100 characters, current length is 105 [header-max-length]
⚠ footer must have leading blank line [footer-leading-blank]
✖ found 2 problems, 1 warnings
ⓘ Get help: https://github.com/conventional-changelog/commitlint/#what-is-commitlint
|