Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pso-csi-node pods restarts frequently #168

Open
rpatil72 opened this issue May 8, 2021 · 17 comments
Open

pso-csi-node pods restarts frequently #168

rpatil72 opened this issue May 8, 2021 · 17 comments

Comments

@rpatil72
Copy link

rpatil72 commented May 8, 2021

$ kubectl get pods -o wide -n storage
NAME                                         READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
pso-csi-controller-0                         6/6     Running   6          6h1m    172.20.2.53    init22b-8       <none>           <none>
pso-csi-node-24qd8                           2/3     Running   53         6h1m    10.64.112.52   k8s-worker-02   <none>           <none>
pso-csi-node-mfjg7                           3/3     Running   0          6h1m    10.64.230.41   init22b-8       <none>           <none>
pso-csi-node-ncqmn                           3/3     Running   0          6h1m    10.64.230.40   init22b-7       <none>           <none>
pso-csi-node-p6b97                           2/3     Running   56         6h1m    10.64.112.51   k8s-worker-01   <none>           <none>
pso-db-0-0                                   1/1     Running   0          6h1m    172.20.1.47    init22b-7       <none>           <none>
pso-db-1-0                                   1/1     Running   0          6h1m    172.20.2.54    init22b-8       <none>           <none>
pso-db-2-0                                   1/1     Running   0          6h1m    172.20.3.4     k8s-worker-01   <none>           <none>
pso-db-3-0                                   1/1     Running   0          6h1m    172.20.4.5     k8s-worker-02   <none>           <none>
pso-db-4-0                                   1/1     Running   0          6h1m    172.20.1.48    init22b-7       <none>           <none>
pso-db-cockroach-operator-5f494b6b5b-wpskw   1/1     Running   0          6h1m    172.20.1.45    init22b-7       <none>           <none>
pso-db-deployer-5df48fb79b-5hj82             1/1     Running   0          6h1m    172.20.1.46    init22b-7       <none>           <none>
pso-explorer-84d6cf69f9-lgqjc                2/2     Running   0          5h35m   172.20.2.57    init22b-8       <none>           <none>
psox-agent-2bj4j                             1/1     Running   0          5h51m   172.20.3.5     k8s-worker-01   <none>           <none>
psox-agent-2vx4x                             1/1     Running   0          5h51m   172.20.4.7     k8s-worker-02   <none>           <none>
psox-agent-pc5bh                             1/1     Running   0          5h51m   172.20.2.56    init22b-8       <none>           <none>
psox-agent-z2hnz                             1/1     Running   0          5h51m   172.20.1.49    init22b-7       <none>           <none>
psox-agent-zqlnl                             1/1     Running   1          5h51m   172.20.0.9     k8s-master-01   <none>           <none>

node-driver-registrar

W0508 00:42:08.409027   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:18.409174   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:28.409061   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:38.409045   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:48.409081   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:58.408966   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:08.408997   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:18.409026   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:28.409013   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:38.409098   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
I0508 00:43:42.144113   56628 node_register.go:51] Starting Registration Server at: /registration/pure-csi-reg.sock
I0508 00:43:42.144325   56628 node_register.go:60] Registration Server started at: /registration/pure-csi-reg.sock
I0508 00:43:43.563343   56628 main.go:77] Received GetInfo call: &InfoRequest{}
I0508 00:43:43.584165   56628 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0508 00:49:40.845200   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:00:54.861780   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:05:12.763918   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:14:36.757668   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:33:18.765391   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:28:59.840505   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:46:42.755703   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:57:45.872492   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.

pso-csi-container

Original NodeID: k8s-worker-01
processIPHost
time="2021-05-08T06:50:10Z" level=error msg="Failed to retrieve configmap from cluster." configmap="&ConfigMap{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Data:map[string]string{},BinaryData:map[string][]byte{},}" configmap_name=pure-csi-container-configmap error="an empty namespace may not be set when a resource name is provided"
time="2021-05-08T06:50:11Z" level=info msg="Starting server" debug=false endpoint="unix:///csi/csi.sock" nodeID=k8s-worker-01 span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Created listener" addr=//csi/csi.sock protocol=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Listening for connections" address=//csi/csi.sock network=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Parsing ISCSI target ranges for CIDR filter." ISCSITargetRanges=
time="2021-05-08T06:50:11Z" level=info msg="CIDR Filters constructed." CIDRs="[]"
(/go/src/pso.purestorage.com/k8s/vendor/pso.purestorage.com/nsm/pkg/datastores/services/cockroach/collection.go:166)
[2021-05-08 06:50:12]  sql: no rows in result set 
time="2021-05-08T06:51:00Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:00Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:00Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:02Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:02Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:02Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:04Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:04Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:04Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:06Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:06Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:06Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:08Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:08Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:08Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:10Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:10Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:10Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:12Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
@dsupure
Copy link
Contributor

dsupure commented May 9, 2021

Hi, @rpatil72. Can you turn on the debug flag in the values.yaml, run the script here to collect the log and send to [email protected]?

Thanks.

@rpatil72
Copy link
Author

rpatil72 commented May 9, 2021

Thanks @dsupure for quick response.
Have enabled debug in values.yaml and have sent the collected logs to that email

@dsupure
Copy link
Contributor

dsupure commented May 12, 2021

Thank you @rpatil72 . We got the logs. Our support team is debugging the logs. We will contact you shortly once we find the issue. Thanks for your patience.

@zatricky
Copy link

zatricky commented Jun 10, 2021

Hi, @dsupure @rpatil72. Was there a solution to this problem before we need to contact support directly?

@dsupure
Copy link
Contributor

dsupure commented Jun 11, 2021

Hi @zatricky :
My team is investigating it. We will give you an update tomorrow. Thanks for your patient.

@pure-jliao
Copy link
Contributor

hello @rpatil72 @zatricky after investigation we probably need more info from you:

  • did you start with PSO 5 or 6?
  • did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions
  • though it restarts frequently, is it functioning or not?
  • did those pod restart before or after it was ready?

From the latest log I got from the email

   1 NAME                                         READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
   2 pso-csi-controller-0                         5/6     Running   0          6m26s   172.20.2.58    init22b-8       <none>           <none>
   3 pso-csi-node-dndw2                           2/3     Running   1          6m35s   10.64.112.52   k8s-worker-02   <none>           <none>
   4 pso-csi-node-hvg85                           2/3     Running   0          6m26s   10.64.230.41   init22b-8       <none>           <none>
   5 pso-csi-node-p2xjb                           3/3     Running   0          6m35s   10.64.230.40   init22b-7       <none>           <none>
   6 pso-csi-node-xx56s                           2/3     Running   1          6m35s   10.64.112.51   k8s-worker-01   <none>           <none>
   7 pso-db-0-0                                   1/1     Running   0          2d17h   172.20.1.47    init22b-7       <none>           <none>
   8 pso-db-1-0                                   1/1     Running   0          2d17h   172.20.2.54    init22b-8       <none>           <none>
   9 pso-db-2-0                                   1/1     Running   0          2d17h   172.20.3.4     k8s-worker-01   <none>           <none>
  10 pso-db-3-0                                   1/1     Running   0          2d17h   172.20.4.5     k8s-worker-02   <none>           <none>
  11 pso-db-4-0                                   1/1     Running   0          2d17h   172.20.1.48    init22b-7       <none>           <none>
  • pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done?
  • from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer.
  • meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again?

@zatricky
Copy link

Hi, @pure-jliao. Our org has sent in a support request. Note that I have similar errors to those @rpatil72 submitted. In our case the original cause was a hypervisor cluster failure affecting many unrelated hosts, including an entire kubernetes cluster making use of the storage. We had already recovered all core services. I was hoping that a resolution was found for @rpatil72 so that we could at least know where to look next before submitting that support request (which has of course already been sent by now).

@pure-jliao
Copy link
Contributor

thanks @zatricky , FYI we do found something wrong related to this error log line on our side Failed to retrieve configmap from cluster but it shouldn't be the root cause of the problem stated here and kind of misleading, so that's why we need more info here in my previous post.

@dsupure
Copy link
Contributor

dsupure commented Jun 15, 2021

@rpatil72 : can you help to answer the questions from @pure-jliao ? Thanks.

@rpatil72
Copy link
Author

@pure-jliao please find the replies below:

  • did you start with PSO 5 or 6?
    Started with PSO 6

  • did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions
    This was a new setup with PSO 6; k8s 1.21 on Ubuntu 20.04.2 LTS / 5.4.0-72-generic, CRI is docker://20.10.2

  • though it restarts frequently, is it functioning or not?
    It was not functioning

  • did those pod restart before or after it was ready?
    it not got ready

  • pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done?
    tried to redeploy couple of times

  • from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer.
    I can try this, but I switched to
    https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/blob/master/charts/nfs-subdir-external-provisioner/README.md
    and this work fine in the same environment

  • meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again?
    Yes but it did not help then

@pure-jliao
Copy link
Contributor

@rpatil72 thanks for applying, one thing to note right now we haven't tested PSO 6 on k8s 1.21 and it's not officially supported yet. but not 100% sure it's caused by a newer and unstable version of k8s, so if possible do you have any chance to try a fresh install on k8s 1.20? meanwhile in your first post, only 2 node pods are restarting and all other PSO pods are healthy, if you could reach that state again, could you try to grab the logs again? it might be a little bit hard to get logs if the pods keep crashing. also could you try to run PSO without the explorer to rule out some other effects?

and it still cannot be resolved we could schedule a call cc: @dsupure @dingyin

@smrutimandal
Copy link

smrutimandal commented Jun 29, 2021

Facing the same issue.

bash-3.2$ helm list -n pure-pso
NAME    	NAMESPACE	REVISION	UPDATED                              	STATUS  	CHART         	APP VERSION
pure-pso	pure-pso 	1       	2021-06-30 01:00:21.279272 +0200 CEST	deployed	pure-pso-6.1.1	6.1.1
bash-3.2$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
bash-3.2$

Any updates?

@pure-jliao
Copy link
Contributor

@smrutimandal sorry we haven't narrowed down the problem here, there could be a lot of things causing PSO pod failures. So please collect the logs and contact support team for more investigation.

@smrutimandal
Copy link

Hi,

Thanks for getting back to me. I am in the process of getting my support contract in order. In the meantime, I found that that the following feature-gate defaults have changed between 1.20 and 1.21.

Kube API Server v1.20:
CSIServiceAccountToken=true|false (ALPHA - default=false)
CSIStorageCapacity=true|false (ALPHA - default=false)

Kube API Server v1.21:
CSIServiceAccountToken=true|false (BETA - default=true)
CSIStorageCapacity=true|false (BETA - default=true)

Can these changes have any impact on the liveness-probes?

Any help is greatly appreciated.

Thanks

@dsupure
Copy link
Contributor

dsupure commented Jul 14, 2021

@smrutimandal: As @pure-jliao mentioned, there are many possibilities and it's hard to troubleshoot your problem via github. Please contact pure support and we will create a support case for you. We will follow up there. Thank you.

@smrutimandal
Copy link

Hi @dsupure ,

I understand that. If compatibility is really an issue though, the https://github.com/purestorage/pso-csi#environments-supported section, assuring support for v1.17+, is a bit confusing.

I will get in touch with support.

Thanks

@mscbpi
Copy link

mscbpi commented Jan 26, 2022

1.21 is supposed to be supported now, do you have any update ? I get the same liveness probe errors since K8s Upgrade (1.19 is EoL), though the setup seems to work OK (no restart -to be monitored- and volumes get attached)


time="2022-01-26T23:08:16Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2022-01-26T23:08:16Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2022-01-26T23:08:16Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants