Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeStageVolume fails if xfs_repairs returns error after cluster reboot #859

Closed
whymatter opened this issue Mar 14, 2020 · 6 comments
Closed
Labels
bug Something isn't working dependency/k8s depends on Kubernetes features

Comments

@whymatter
Copy link

Describe the bug

When there are "valuable metadata changes in a log which needs to be replayed" pod creation fails. (See log)

The reason this is a bug and not a feature for me is that this happens after a sudden cluster reboot.

Environment details

Kubernetes Version:

serverVersion:
  buildDate: "2019-10-15T19:09:08Z"
  compiler: gc
  gitCommit: c97fe5036ef3df2967d086711e6c0c405941e14b
  gitTreeState: clean
  gitVersion: v1.16.2
  goVersion: go1.12.10
  major: "1"
  minor: "16"
  platform: linux/amd64

Image/version of Ceph CSI driver

quay.io/cephcsi/cephcsi:v2.0.0

Deployed using rook.io

Logs

I0313 23:56:07.116518   19237 utils.go:157] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 GRPC call: /csi.v1.Node/NodeStageVolume
I0313 23:56:07.116553   19237 utils.go:158] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b3f222c4-1c92-4365-a58f-3f8d354e7703/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"apiVersion":"ceph.rook.io/v1","clusterID":"rook-ceph","imageFormat":"2","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1583893252987-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71"}
I0313 23:56:07.119265   19237 rbd_util.go:487] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 setting disableInUseChecks on rbd volume to: false
I0313 23:56:07.197517   19237 rbd_util.go:150] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd: status csi-vol-5c35924f-63c7-11ea-ab23-f2b4435d7b71 using mon 10.233.8.103:6789,10.233.15.101:6789,10.233.45.118:6789, pool replicapool
W0313 23:56:07.379853   19237 rbd_util.go:172] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd: no watchers on csi-vol-5c35924f-63c7-11ea-ab23-f2b4435d7b71
I0313 23:56:07.379993   19237 rbd_attach.go:208] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd: map mon 10.233.8.103:6789,10.233.15.101:6789,10.233.45.118:6789   I0313 23:56:07.543358   19237 nodeserver.go:139] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 rbd image: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71/replicapool was successfully mapped at /dev/rbd0
I0313 23:56:07.543654   19237 mount_linux.go:390] Attempting to determine if disk "/dev/rbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/rbd0])
I0313 23:56:07.629332   19237 mount_linux.go:393] Output: "DEVNAME=/dev/rbd0\nTYPE=xfs\n", err: <nil>
I0313 23:56:07.629525   19237 mount_linux.go:390] Attempting to determine if disk "/dev/rbd0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/rbd0])
I0313 23:56:07.704376   19237 mount_linux.go:393] Output: "DEVNAME=/dev/rbd0\nTYPE=xfs\n", err: <nil>
I0313 23:56:07.704482   19237 mount_linux.go:282] Checking for issues with xfs_repair on disk: /dev/rbd0
W0313 23:56:08.081928   19237 mount_linux.go:294] Filesystem corruption was detected for /dev/rbd0, running xfs_repair to repair
E0313 23:56:08.411413   19237 nodeserver.go:344] ID: 4 Req-ID: 0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71 failed to mount device path (/dev/rbd0) to staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b3f222c4-1c92-4365-a58f-3f8d354e7703/globalmount/0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71) for volume (0001-0009-rook-ceph-0000000000000001-5c35924f-63c7-11ea-ab23-f2b4435d7b71) error 'xfs_repair' found errors on device /dev/rbd0 but could not correct them: Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Steps to reproduce

Steps to reproduce the behavior:

For me this happens every time I have a simple pod connected to a ceph block PV which uses the xfs file system. After a reboot, the pod can not be recreated.

Actual results

The csi driver tries to run xfs_repair but reports an error stating that the volume has to be mounted first.

Expected behavior

In my case simply mounting the device (manually) resolved the problem. So I guess there should be a chance to fix this issue automatically by temporarily mounting the volume?

Additional context

Related issues:

@revog
Copy link

revog commented Mar 18, 2020

Can confirm this issue. Simple "rbd map" and a "mount -t xfs .." and unmap/unmount afterwards seems to replay the log and fixes the issue. No xfs_repair needed!

Currently I'm not really sure after what action (pod recreation etc.) this error occurs.

@whymatter
Copy link
Author

whymatter commented Mar 19, 2020

I believe this is not an error in the csi code. The xfs_repair command is executed here https://github.com/kubernetes/utils/blob/d1ab8797c55812f4fefe2c7b00a0d04a4740a93c/mount/mount_linux.go#L416.

kubernetes/utils#141

humblec added a commit that referenced this issue Apr 9, 2020
we are using. we had hit an issue in xfs_repair
as this is fixed in recent kubernetes utils
we are updating it for the same reason

more info at kubernetes/utils#141

fixes #859
updates rook/rook#4914

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
humblec added a commit to humblec/ceph-csi that referenced this issue Apr 9, 2020
NOTE:

This PR also updates the kubernetes utils packages
we are using. we had hit an issue in xfs_repair
as this is fixed in recent kubernetes utils
we are updating it for the same reason

more info at kubernetes/utils#141

fixes ceph#859
updates rook/rook#4914

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
humblec added a commit to humblec/ceph-csi that referenced this issue Apr 13, 2020
NOTE:

This PR also updates the kubernetes utils packages
we are using. we had hit an issue in xfs_repair
as this is fixed in recent kubernetes utils
we are updating it for the same reason

more info at kubernetes/utils#141

fixes ceph#859
updates rook/rook#4914

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
@cristichiru
Copy link

cristichiru commented Apr 16, 2020

I had the same problems, and it is a pain to manually mount the volume on a host node, using rbd map when running kubernetes.
Because of this reason, we have decided to stay with the default ext4 for new volumes - until the release of the fix.

@nixpanic
Copy link
Member

@whymatter can you try with cephcsi:v2.1.0 and let us know of that resolves the issue for you?

@nixpanic nixpanic added bug Something isn't working dependency/k8s depends on Kubernetes features labels Apr 17, 2020
@whymatter
Copy link
Author

I will give it a try

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Apr 20, 2020

fixed in v2.1.0, if not please feel free to reopen it

@Madhu-1 Madhu-1 closed this as completed Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependency/k8s depends on Kubernetes features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants