HPA Adapter gets into broken state after some time with cert errors #8

randallt · 2020-07-20T13:45:59Z

After functioning fine for up to weeks at a time, our HPA Adapters stop working completely and just show these errors in the log:

E0720 13:23:27.337788       1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]

Obviously it is something to do with certs, but we are at a loss as to what is causing it.

If we delete the pod, a new one that functions fine takes its place.

We are on OpenShift 4.3 (K8S 1.16).

The text was updated successfully, but these errors were encountered:

randallt · 2020-07-20T13:50:41Z

I have opened a few tickets with [email protected] regarding this, but we haven't found any solution yet.

randallt · 2020-07-20T15:30:39Z

Actually, turns out that I found a difference in an argument that might account for this. The

--cert-dir

was set to /etc/ssl instead of /etc/ssl/certs, which could be the reason. I noticed that this differed from the default declared. We initially messed with this due to file system permissions. This shouldn't require cluster-admin privileges on the service account, should it?

vikramraman · 2020-07-20T17:43:07Z

@randallt No, this shouldn't need cluster-admin privileges

randallt · 2020-07-23T19:25:58Z

Looks like we are still getting these cert errors after a few days or so, even with the updated --cert-dir setting.

randallt · 2020-07-24T13:32:52Z

This is a critical issue for us in getting Wavefront and OpenShift to play nicely. Without it, I'm afraid we'll have to look for other scaling solutions. I found some issues from the past that seem to report similar symptoms, but I'm not familiar enough with OpenShift/Kubernetes to know if they apply directly. Does the Wavefront HPA Adapter pod need to check for changes in kubeconfig and reload like some of the below issues had to do (they link to several other similar issues)?

https://bugzilla.redhat.com/show_bug.cgi?id=1688820
https://bugzilla.redhat.com/show_bug.cgi?id=1668632

vikramraman · 2020-07-24T15:10:13Z

The HPA Adapter uses InClusterConfig by default (not a remote kubeconfig file) to communicate with the API server.

Basically all pods in Kubernetes are injected with a serviceaccount and a root ca.crt on startup under `/var/run/secrets/kubernetes.io/serviceaccount/ca.crt). If the root cert changes for some reason, pods using InClusterConfig would need to be restarted to pick up these changes.

We are targeting to address issue #7 in the near term which will add liveness / readiness probes. If it's possible to detect this state internally we can cause the container to fail the liveness probe and be restarted on it's own.

As a temporary workaround, you could add a liveness probe yourself to the adapter container to restart the pod once a day. See this link for an example.

randallt · 2020-07-24T15:46:58Z

Any guesses why we get into this state intermittently after only a few days? Is the service account's CA really getting changed to cause this? Our admin says he doesn't see a correlation.

randallt · 2020-07-24T18:54:16Z

Can you give me a liveliness probe definition that will work with this container? Apparently the container doesn't have a shell (/bin/sh).

randallt · 2020-08-03T14:18:31Z

I was able to use a liveness probe that always fails and adjust the 'initialDelaySeconds' to control how often it restarts. Currently restarting every hour:

        livenessProbe: # temp restart the pod every hour by a bad http endpoint. waiting on https://github.com/wavefrontHQ/wavefront-kubernetes-adapter/issues/8
          httpGet:
            path: /error
            port: 9999
          failureThreshold: 1
          initialDelaySeconds: 3600
          periodSeconds: 60

priyaselvaganesan · 2022-07-21T20:15:24Z

Created K8SSAAS-1054 to prioritize this bug.

This was referenced Sep 11, 2020

Support readiness/liveness probes for the adapter #7

Open

question: are multiple pods supported? #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPA Adapter gets into broken state after some time with cert errors #8

HPA Adapter gets into broken state after some time with cert errors #8

randallt commented Jul 20, 2020 •

edited

Loading

randallt commented Jul 20, 2020

randallt commented Jul 20, 2020

vikramraman commented Jul 20, 2020

randallt commented Jul 23, 2020

randallt commented Jul 24, 2020

vikramraman commented Jul 24, 2020

randallt commented Jul 24, 2020 •

edited

Loading

randallt commented Jul 24, 2020

randallt commented Aug 3, 2020

priyaselvaganesan commented Jul 21, 2022

HPA Adapter gets into broken state after some time with cert errors #8

HPA Adapter gets into broken state after some time with cert errors #8

Comments

randallt commented Jul 20, 2020 • edited Loading

randallt commented Jul 20, 2020

randallt commented Jul 20, 2020

vikramraman commented Jul 20, 2020

randallt commented Jul 23, 2020

randallt commented Jul 24, 2020

vikramraman commented Jul 24, 2020

randallt commented Jul 24, 2020 • edited Loading

randallt commented Jul 24, 2020

randallt commented Aug 3, 2020

priyaselvaganesan commented Jul 21, 2022

randallt commented Jul 20, 2020 •

edited

Loading

randallt commented Jul 24, 2020 •

edited

Loading