Skip to content

fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition #1365

fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition

fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition #1365

Triggered via pull request February 14, 2025 05:36
@PhanLe1010PhanLe1010
synchronize #3566
Status Failure
Total duration 24s
Artifacts

conventional_commits.yml

on: pull_request_target
commit-lint
16s
commit-lint
Fit to window
Zoom out
Zoom in

Annotations

1 error
commit-lint
You have commit messages with errors ⧗ input: fix: data lost caused by Longhorn CSI plugin doing a wrong re-encryption of volume in rare race condition This happens with encrypted volume only While Longhorn CSI plugin is doing NodeStageVolume, if the Longhorn engine crash right before GetDiskFormat (https://github.com/longhorn/longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/csi/node_server.go#L486) and recover quickly after that, the following race condition can happen: 1. When engine is crash, the engine removes the Longhorn device at /dev/longhorn/volume-name 2. Because the device path is gone, getDiskFormat returns empty value for diskFormat and nil error. Ref https://github.com/longhorn/longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/vendor/k8s.io/mount-utils/mount_linux.go#L687-L693 3. Now engine is recovered, so it recreate the device at /dev/longhorn/volume-name 4. However, due to diskFormat having empty value, Longhorn re-encrypts the volume which wipes out the all data. Ref https://github.com/longhorn/longhorn-manager/blob/5f9ec86d18612afaa9f97c78cee3e34f7b653ad6/csi/node_server.go#L513-L517 5. Finally, Longhorn cannot detect a filesystem for the volume any more so it reformat the filesystem RCA The root cause of the issue is that getDiskFormat is using blkid. The blkid command cannot differentiate 2 cases: the device doesn't exist VS the device doesn't have filesystem inside it Proposal Use cryptsetup isLuks instead. can differentiate between: 1. The device is already encrypted -> return exit code 0 2. The device is not encrypted -> return exit code 1 3. All other errors -> return exit code 4 Ref: https://gitlab.com/cryptsetup/cryptsetup/-/blob/main/FAQ.md?plain=1#L2848 Now we can safely make decision of only doing encryption in the 2nd case longhorn-10416 Signed-off-by: Phan Le <[email protected]> ✖ footer's lines must not be longer than 100 characters [footer-max-line-length] ✖ header must not be longer than 100 characters, current length is 105 [header-max-length] ⚠ footer must have leading blank line [footer-leading-blank] ✖ found 2 problems, 1 warnings ⓘ Get help: https://github.com/conventional-changelog/commitlint/#what-is-commitlint