How to detect blobfuse2 failure in AKS #944

eyedvabny · 2023-05-26T01:30:08Z

eyedvabny
May 26, 2023

Hi blob-csi-developers,

Is there a way to have the CSI driver periodically check that all blobfuse mounts are still functioning? I am using the Azure-managed blob CSI driver on an AKS 1.25 cluster. I ran an experiment to see what happens if a blobfuse2 process is killed on the node and it appears that neither the csi-blob-node daemonset nor the blobfuse-proxy systemd service on the node detect that a process has been killed. Within the pod the mount command still shows the mount as active, and the only indication that the mount has failed is when you try to do anything in the mount point, which returns cannot open directory '.': Transport endpoint is not connected.

To reproduce:

Set up a blobfuse PVC bound to a pod
Get an SSH node shell on the node hosting the pod
Run sudo pkill -9 blobfuse2 to kill blobfuse2 mount processes
Observe that there are no logs in /var/log/syslog on the node or in the CSI daemonset even though all mounts are dead

This is an admittedly contrived example, but I've run into an actual manifestation of this problem before. An AKS node temporarily lost internet connectivity, leading to all blobfuse2 processes failing. The only way to force a reconnect was to restart the pod, which unmounts and re-mounts the PVC.

While it would be nice for the CSI driver to actually recover a failed mount, I am more concerned about proactively detecting the failure. Right now I can either periodically ls the mount point to see if I get an error, or check whether the number of blobfuse2 processes on a node matches the number of expected PVCs. But both approaches feel very janky.

It would be a lot more convenient if the driver could detect a failed / missing process and raise an event on the pod.

andyzhangx · 2023-05-26T02:05:35Z

andyzhangx
May 26, 2023
Maintainer

I don't think there is such functionality for CSI driver to auto recover a broken mount, for the blobfuse mount, it depends on blobfuse driver, not sure whether blobfuse driver has such auto recover functionality
@vibhansa-msft any thoughts?

cc @cvvz

0 replies

eyedvabny · 2023-06-01T23:03:21Z

eyedvabny
Jun 1, 2023
Author

We've realized that a liveness check that confirms the mounts are working inside the pod solves the 'alerting' portion of my question. However it appears that a pod restart due to a liveness check failure does not cleanly remount blobfuse: Error: failed to prepare subPath for volumeMount "blobfuse-internal-sftp-inbound" of container "sftp-server" and that a full scale-down and scale-up is needed for the PVC to unbind and re-bind and thus recreate the failed blobfuse process.

Does anyone know if there's a way to hook into the pod lifecycle to issue a restart that actually re-mounts the blobfuse-based PVC?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to detect blobfuse2 failure in AKS #944

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to detect blobfuse2 failure in AKS #944

eyedvabny May 26, 2023

Replies: 2 comments

andyzhangx May 26, 2023 Maintainer

eyedvabny Jun 1, 2023 Author

eyedvabny
May 26, 2023

andyzhangx
May 26, 2023
Maintainer

eyedvabny
Jun 1, 2023
Author