NodeStageVolume fails if xfs_repairs returns error after cluster reboot #859

whymatter · 2020-03-14T00:14:05Z

Describe the bug

When there are "valuable metadata changes in a log which needs to be replayed" pod creation fails. (See log)

The reason this is a bug and not a feature for me is that this happens after a sudden cluster reboot.

Environment details

Kubernetes Version:

serverVersion:
  buildDate: "2019-10-15T19:09:08Z"
  compiler: gc
  gitCommit: c97fe5036ef3df2967d086711e6c0c405941e14b
  gitTreeState: clean
  gitVersion: v1.16.2
  goVersion: go1.12.10
  major: "1"
  minor: "16"
  platform: linux/amd64

Image/version of Ceph CSI driver

quay.io/cephcsi/cephcsi:v2.0.0

Deployed using rook.io

Logs

I0313 23:56:07.116518   19237 utils.go:157] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 GRPC call: /csi.v1.Node/NodeStageVolume
I0313 23:56:07.116553   19237 utils.go:158] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b3f222c4-1c92-4365-a58f-3f8d354e7703/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"apiVersion":"ceph.rook.io/v1","clusterID":"rook-ceph","imageFormat":"2","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1583893252987-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71"}
I0313 23:56:07.119265   19237 rbd_util.go:487] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 setting disableInUseChecks on rbd volume to: false
I0313 23:56:07.197517   19237 rbd_util.go:150] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd: status csi-vol-5c35924f-63c7-11ea-ab23-f2b4435d7b71 using mon 10.233.8.103:6789,10.233.15.101:6789,10.233.45.118:6789, pool replicapool
W0313 23:56:07.379853   19237 rbd_util.go:172] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd: no watchers on csi-vol-5c35924f-63c7-11ea-ab23-f2b4435d7b71
I0313 23:56:07.379993   19237 rbd_attach.go:208] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd: map mon 10.233.8.103:6789,10.233.15.101:6789,10.233.45.118:6789   I0313 23:56:07.543358   19237 nodeserver.go:139] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd image: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71/replicapool was successfully mapped at /dev/rbd0
I0313 23:56:07.543654   19237 mount_linux.go:390] Attempting to determine if disk "/dev/rbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/rbd0])
I0313 23:56:07.629332   19237 mount_linux.go:393] Output: "DEVNAME=/dev/rbd0\nTYPE=xfs\n", err: <nil>
I0313 23:56:07.629525   19237 mount_linux.go:390] Attempting to determine if disk "/dev/rbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/rbd0])
I0313 23:56:07.704376   19237 mount_linux.go:393] Output: "DEVNAME=/dev/rbd0\nTYPE=xfs\n", err: <nil>
I0313 23:56:07.704482   19237 mount_linux.go:282] Checking for issues with xfs_repair on disk: /dev/rbd0
W0313 23:56:08.081928   19237 mount_linux.go:294] Filesystem corruption was detected for /dev/rbd0, running xfs_repair to repair
E0313 23:56:08.411413   19237 nodeserver.go:344] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 failed to mount device path (/dev/rbd0) to staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b3f222c4-1c92-4365-a58f-3f8d354e7703/globalmount/0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71) for volume (0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71) error 'xfs_repair' found errors on device /dev/rbd0 but could not correct them: Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Steps to reproduce

Steps to reproduce the behavior:

For me this happens every time I have a simple pod connected to a ceph block PV which uses the xfs file system. After a reboot, the pod can not be recreated.

Actual results

The csi driver tries to run xfs_repair but reports an error stating that the volume has to be mounted first.

Expected behavior

In my case simply mounting the device (manually) resolved the problem. So I guess there should be a chance to fix this issue automatically by temporarily mounting the volume?

Additional context

Related issues:

[Important] Pod can't mount volume after server suddenly shutdown rook/rook#4914

The text was updated successfully, but these errors were encountered:

revog · 2020-03-18T14:32:33Z

Can confirm this issue. Simple "rbd map" and a "mount -t xfs .." and unmap/unmount afterwards seems to replay the log and fixes the issue. No xfs_repair needed!

Currently I'm not really sure after what action (pod recreation etc.) this error occurs.

whymatter · 2020-03-19T20:26:13Z

I believe this is not an error in the csi code. The xfs_repair command is executed here https://github.com/kubernetes/utils/blob/d1ab8797c55812f4fefe2c7b00a0d04a4740a93c/mount/mount_linux.go#L416.

kubernetes/utils#141

we are using. we had hit an issue in xfs_repair as this is fixed in recent kubernetes utils we are updating it for the same reason more info at kubernetes/utils#141 fixes #859 updates rook/rook#4914 Signed-off-by: Humble Chirammal <hchiramm@redhat.com>

NOTE: This PR also updates the kubernetes utils packages we are using. we had hit an issue in xfs_repair as this is fixed in recent kubernetes utils we are updating it for the same reason more info at kubernetes/utils#141 fixes ceph#859 updates rook/rook#4914 Signed-off-by: Humble Chirammal <hchiramm@redhat.com>

cristichiru · 2020-04-16T10:00:15Z

I had the same problems, and it is a pain to manually mount the volume on a host node, using rbd map when running kubernetes.
Because of this reason, we have decided to stay with the default ext4 for new volumes - until the release of the fix.

nixpanic · 2020-04-17T08:26:16Z

@whymatter can you try with cephcsi:v2.1.0 and let us know of that resolves the issue for you?

whymatter · 2020-04-19T23:28:55Z

I will give it a try

Madhu-1 · 2020-04-20T08:03:36Z

fixed in v2.1.0, if not please feel free to reopen it

Madhu-1 mentioned this issue Apr 9, 2020

Update kubernetes utils packages #929

Closed

nixpanic added bug Something isn't working dependency/k8s depends on Kubernetes features labels Apr 17, 2020

Madhu-1 closed this as completed Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeStageVolume fails if xfs_repairs returns error after cluster reboot #859

NodeStageVolume fails if xfs_repairs returns error after cluster reboot #859

whymatter commented Mar 14, 2020

revog commented Mar 18, 2020 •

edited

Loading

whymatter commented Mar 19, 2020 •

edited

Loading

cristichiru commented Apr 16, 2020 •

edited

Loading

nixpanic commented Apr 17, 2020

whymatter commented Apr 19, 2020

Madhu-1 commented Apr 20, 2020

NodeStageVolume fails if xfs_repairs returns error after cluster reboot #859

NodeStageVolume fails if xfs_repairs returns error after cluster reboot #859

Comments

whymatter commented Mar 14, 2020

Describe the bug

Environment details

Kubernetes Version:

Image/version of Ceph CSI driver

Logs

Steps to reproduce

Actual results

Expected behavior

Additional context

revog commented Mar 18, 2020 • edited Loading

whymatter commented Mar 19, 2020 • edited Loading

cristichiru commented Apr 16, 2020 • edited Loading

nixpanic commented Apr 17, 2020

whymatter commented Apr 19, 2020

Madhu-1 commented Apr 20, 2020

revog commented Mar 18, 2020 •

edited

Loading

whymatter commented Mar 19, 2020 •

edited

Loading

cristichiru commented Apr 16, 2020 •

edited

Loading