Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA Adapter gets into broken state after some time with cert errors #8

Open
randallt opened this issue Jul 20, 2020 · 10 comments
Open

Comments

@randallt
Copy link

randallt commented Jul 20, 2020

After functioning fine for up to weeks at a time, our HPA Adapters stop working completely and just show these errors in the log:

E0720 13:23:27.337788       1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]

Obviously it is something to do with certs, but we are at a loss as to what is causing it.

If we delete the pod, a new one that functions fine takes its place.

We are on OpenShift 4.3 (K8S 1.16).

@randallt
Copy link
Author

I have opened a few tickets with [email protected] regarding this, but we haven't found any solution yet.

@randallt
Copy link
Author

Actually, turns out that I found a difference in an argument that might account for this. The

--cert-dir

was set to /etc/ssl instead of /etc/ssl/certs, which could be the reason. I noticed that this differed from the default declared. We initially messed with this due to file system permissions. This shouldn't require cluster-admin privileges on the service account, should it?

@vikramraman
Copy link
Contributor

@randallt No, this shouldn't need cluster-admin privileges

@randallt
Copy link
Author

Looks like we are still getting these cert errors after a few days or so, even with the updated --cert-dir setting.

@randallt
Copy link
Author

This is a critical issue for us in getting Wavefront and OpenShift to play nicely. Without it, I'm afraid we'll have to look for other scaling solutions. I found some issues from the past that seem to report similar symptoms, but I'm not familiar enough with OpenShift/Kubernetes to know if they apply directly. Does the Wavefront HPA Adapter pod need to check for changes in kubeconfig and reload like some of the below issues had to do (they link to several other similar issues)?

https://bugzilla.redhat.com/show_bug.cgi?id=1688820
https://bugzilla.redhat.com/show_bug.cgi?id=1668632

@vikramraman
Copy link
Contributor

The HPA Adapter uses InClusterConfig by default (not a remote kubeconfig file) to communicate with the API server.

Basically all pods in Kubernetes are injected with a serviceaccount and a root ca.crt on startup under `/var/run/secrets/kubernetes.io/serviceaccount/ca.crt). If the root cert changes for some reason, pods using InClusterConfig would need to be restarted to pick up these changes.

We are targeting to address issue #7 in the near term which will add liveness / readiness probes. If it's possible to detect this state internally we can cause the container to fail the liveness probe and be restarted on it's own.

As a temporary workaround, you could add a liveness probe yourself to the adapter container to restart the pod once a day. See this link for an example.

@randallt
Copy link
Author

randallt commented Jul 24, 2020

Any guesses why we get into this state intermittently after only a few days? Is the service account's CA really getting changed to cause this? Our admin says he doesn't see a correlation.

@randallt
Copy link
Author

Can you give me a liveliness probe definition that will work with this container? Apparently the container doesn't have a shell (/bin/sh).

@randallt
Copy link
Author

randallt commented Aug 3, 2020

I was able to use a liveness probe that always fails and adjust the 'initialDelaySeconds' to control how often it restarts. Currently restarting every hour:

        livenessProbe: # temp restart the pod every hour by a bad http endpoint. waiting on https://github.com/wavefrontHQ/wavefront-kubernetes-adapter/issues/8
          httpGet:
            path: /error
            port: 9999
          failureThreshold: 1
          initialDelaySeconds: 3600
          periodSeconds: 60

@priyaselvaganesan
Copy link
Member

Created K8SSAAS-1054 to prioritize this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants