-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mount] Addition of "checkAndRepairXfsFilesystem" inadvertently prevents XFS self-recovery via mounting #141
Comments
@nktpro , you actually encounter a "dirty log" issue of an xfs device. It is handled in pr #132 . But it is still under review. You are welcome to add your comments there. |
/assign |
Thanks @27149chen, either #132 or #137 would address this. I believe we need to raise more awareness on the seriousness of this regression since it's a ticking time bomb in production for any recent k8s clusters with XFS PVs. High availability is critically compromised. Dirty logs happen all the time during node loss / unclean shutdowns and instead of the volumes simply be mounted and fix themselves by auto-replaying logs as it used to be, they are now stuck waiting for manual human operators to mount and unmount those volumes. @gnufied @dims could you guys help accelerating reviewing @27149chen PR, due to the level of impact this has? A rollback of the |
@nktpro I think we can't revert the previous pr, because there is another issue, which can be fixed by xfs_repair. |
@27149chen Testing your branch would be rather complicated. Specifically in my case, this is a transitive dependency of That's a bit tedious but it's all doable. However, to actually verify the fix I'd also need to force a volume with XFS to be in an inconsistent state with dirty logs, maybe via intentionally trigger kernel panic on a node while having lots of writes coming in? Do you have any suggestion on a better way to deterministically induce that? Also that brings up a good conversation on the need to automate an integration / e2e test suite for this. |
@nktpro how did you encounter this issue before? You said it happened all the time during node loss / unclean shutdowns. I was thinking that it was easy to reproduce. Sorry, I don't know how to deterministically cause a dirty log. But according to the document and your manually try, mount and unmount is the right way to fix it, what do you think? |
@27149chen We hit it in production, never had to manually intervene before until this change made it into
Yes, hence that's one known way to test this (hard-resetting a node, triggering a kernel panic, pulling the power cord in the middle of database writes, etc.). However it's non-deterministic, manual, and hard to automate as a scripted test suite to prevent similar regression in the future. Anyway I'll report back the result if we have some bandwidth to test your fix. In the meantime, we are forced to either downgrade to a version right before this change, or switch to Ext4 since this only affects XFS. |
Thanks for opening this issue. I still think that somehow doing filesystem repairs should be an opt-in rather than default. I am not a XFS expert but if there is a chance that #132 could somehow worsen the problem rather than allowing an admin to manually fix it, we should be careful. It should also be noted that because mount operations are retried, the filesystem repair will be retried in a continuous loop as well. |
@gnufied , we have been doing filesystem repairs (by fsck) by default all the time. I agree that it is expensive as I mentioned in issue #137, it will be great if we can find a way to reduce the unnecessary repair. But before that, I think we should try to repair the filesystem by default. Regarding xfs_repair, I think replaying the dirty logs by default (by mounting and immediately unmounting the filesystem) won't worsen the problem because it is the official recommended way, and if we don't do that, the subsequent operations will still try to mount this unhealthy disk. |
To unblock k8s 1.18.0 release, the recommendation from SIG Storage is to roll back PR #126 causing this issue (it was fixing a non-critical corner case). Working with k8s release team to get the go ahead for that. Once 1.18.0 is cut, we can revisit this. |
For tracking on the release side and more importantly: thank you to everyone who has been working on this! |
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: a1ae67d691d514d859fce68299d7bd3830686b38
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141
This PR updates the kubernetes utils packages we are using. we had hit an issue in xfs_repair as this is fixed in recent kubernetes utils we are updating it for the same reason more info at kubernetes/utils#141 Signed-off-by: Madhu Rajanna <[email protected]>
we are using. we had hit an issue in xfs_repair as this is fixed in recent kubernetes utils we are updating it for the same reason more info at kubernetes/utils#141 fixes #859 updates rook/rook#4914 Signed-off-by: Humble Chirammal <[email protected]>
NOTE: This PR also updates the kubernetes utils packages we are using. we had hit an issue in xfs_repair as this is fixed in recent kubernetes utils we are updating it for the same reason more info at kubernetes/utils#141 fixes ceph#859 updates rook/rook#4914 Signed-off-by: Humble Chirammal <[email protected]>
NOTE: This PR also updates the kubernetes utils packages we are using. we had hit an issue in xfs_repair as this is fixed in recent kubernetes utils we are updating it for the same reason more info at kubernetes/utils#141 fixes ceph#859 updates rook/rook#4914 Signed-off-by: Humble Chirammal <[email protected]>
This fixes bug with xfs mount failing because of xfs_repair being called. Fixes kubernetes/utils#141 Kubernetes-commit: 0630031f85ba508559abcb40a1adca4ac2350056
PR #126 added an extra step to run
xfs_repair
before mounting a XFS file system. However instead of helping to automatically correct FS issues due to prior unclean shutdowns, it actually prevented auto recovery from happening, which led to complete unavailability of the corresponding volume and subsequently required manual human intervention.The sequence of events is as follows:
Note that step #5 was what has always happened prior to this change. The volume is simply mounted without any attempt to perform FS check / xfs_repair. It can simply correct itself as part of just being mounted, as per XFS design.
The recommended fix is to only attempt to run
xfs_repair
if mounting actually fails, as the last resort. There shouldn't be any need toxfs_repair
prior to a mount failure.Alternatively, don't bail out if an error occurs when running
xfs_repair
. Let the mount attempt happen anyway. It'll then either fix itself, or fail mounting with another error.Relevant issue from rook-ceph repo: rook/rook#4914
CC'ing @27149chen
The text was updated successfully, but these errors were encountered: