-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blobfuse multiple mounts race condition issue #37
Comments
There should be
|
and there could be race condition issue since you are using different azure storage accounts, could you remove this and try again:
|
Thanks @andyzhangx but I can't find that pod. $ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
blobfuse-flexvol-installer-lk6dr 1/1 Running 0 5h22m
coredns-69b5b66fd8-9qtbt 1/1 Running 0 22h
coredns-69b5b66fd8-fd7fb 1/1 Running 0 22h
coredns-autoscaler-65d7986c6b-979k8 1/1 Running 0 22h
external-dns-8cd545cfb-n5b2c 1/1 Running 0 5h22m
kube-proxy-gsvtd 1/1 Running 0 74m
kubernetes-dashboard-cc4cc9f58-n77lz 1/1 Running 3 22h
metrics-server-66dbbb67db-7hs4t 1/1 Running 0 22h
nginx-ingress-controller-5zfgq 1/1 Running 0 5h22m
nginx-ingress-default-backend-6b8dc9d88f-zksqr 1/1 Running 0 5h22m
omsagent-gm8p9 1/1 Running 0 75m
omsagent-rs-6f4b46d595-jk5gm 1/1 Running 0 75m
tiller-deploy-9bf6fb76d-d5nxx 1/1 Running 0 5h22m
tunnelfront-65bd6b97d-jhkff 1/1 Running 0 22h $ kubectl get pods --all-namespaces | grep azure
$ No results for the above. I'm not able to access the master if that makes a difference:
|
pls provide that log and let me check whether it's due to race condition issue. |
@andyzhangx - thanks. Looking at the above examples
|
I can't see anything obvious in there but I'm not an expert. I'm going to kill that pod the redeploy the failing version of the pod only to simplify the log and see what happens. |
It's due to the race condition issue:
I think current workaround is all blobfuse flexvolume on same pod use the same storage account, since you are using 2 storage accounts, that's the reason why sometimes it failed. I will add mutex lock in the driver code and publish a new release. |
Ok fab. That would explain why it used to work (before using an extra account). Thanks! |
created a PR to fix the above: #103 |
What happened:
When creating a pod with writable blob fuse mount the container won't start with errors like
Unable to mount volumes for pod... list of unmounted volumes=[MY_WRITE_MOUNTED_BLOB-CONTAINER]...
Here is an example from a describe on the pod. Note this is a complicated pod (we are using this in a Jupyter Hub environment) with multiple mounts, including multiple read only blobfuse mounts. Note
scratch-blob
is the only writable blob mount.This container will now take hours to be deleted. It's still alive 55 minutes after running a
kubectl delete po
. 3 hours later it was gone. I don't know how long it took - most times I tear down the cluster because it's slow but still quicker than waiting. Running a force delete makes it disappear but I don't think it's really deleting it, as for example, you will not be able to delete the namespace. The only solutions I've found is tearing down my cluster (AKS) and building up again from scratch or waiting hours (exact number unknown).I don't believe this is an issue with the secrets/permissions because as detailed below I can make variants that do work with the same blob containers and secrets.
What you expected to happen:
The pod to start normally and to be deleted promptly when delete command sent.
How to reproduce it:
It's difficult to reproduce as in a simpler pod things work however below is taken from
kubectl get po -o yaml
for a pod that works and one that doesn't. The only difference is thatscratch-blob
has changed from readonly:false (fails) to readonly:true (works).Some ENV vars, etc have been redacted because I was concerned they were sensitive.
This works:
This doesn't:
These pods are spawned from JupyterHub so they are a little funky. The only difference in the configureation in JupyterHub between these two is
readOnly: true
on thescratch-blob
underextraVolumes
andextraVolumeMounts
.Interestingly (an surprisingly) this pod below does work (it's basically the same as the one above that doesn't work but without the init container). It also runs
sleep 600
rather than Jupyter but I don' think that's relevant as the failing pod doesn't get to start anyway.Anything else we need to know?:
I realise this is a complicated issue and there isn't much to go on but I feel that there should be more log information somewhere but I don't know where to find it.
I have destroyed the cluster multiple times (and resource group) and this issue is repeatable every time.
Environment:
kubectl version
):Runing on AKS
Kernel (e.g.
uname -a
):From in container (as mentioned above can not access host):
Linux works 4.15.0-1055-azure fix: mkdir issue on blobfuse 1.2.3 #60-Ubuntu SMP Thu Aug 8 18:29:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
AKS / AZ Cli
Helm
Zero2Jupyter helm chart
Pangeo helm chat
Others:
The text was updated successfully, but these errors were encountered: