-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPA Adapter gets into broken state after some time with cert errors #8
Comments
I have opened a few tickets with [email protected] regarding this, but we haven't found any solution yet. |
Actually, turns out that I found a difference in an argument that might account for this. The --cert-dir was set to /etc/ssl instead of /etc/ssl/certs, which could be the reason. I noticed that this differed from the default declared. We initially messed with this due to file system permissions. This shouldn't require cluster-admin privileges on the service account, should it? |
@randallt No, this shouldn't need cluster-admin privileges |
Looks like we are still getting these cert errors after a few days or so, even with the updated --cert-dir setting. |
This is a critical issue for us in getting Wavefront and OpenShift to play nicely. Without it, I'm afraid we'll have to look for other scaling solutions. I found some issues from the past that seem to report similar symptoms, but I'm not familiar enough with OpenShift/Kubernetes to know if they apply directly. Does the Wavefront HPA Adapter pod need to check for changes in kubeconfig and reload like some of the below issues had to do (they link to several other similar issues)? https://bugzilla.redhat.com/show_bug.cgi?id=1688820 |
The HPA Adapter uses InClusterConfig by default (not a remote kubeconfig file) to communicate with the API server. Basically all pods in Kubernetes are injected with a serviceaccount and a root ca.crt on startup under `/var/run/secrets/kubernetes.io/serviceaccount/ca.crt). If the root cert changes for some reason, pods using InClusterConfig would need to be restarted to pick up these changes. We are targeting to address issue #7 in the near term which will add liveness / readiness probes. If it's possible to detect this state internally we can cause the container to fail the liveness probe and be restarted on it's own. As a temporary workaround, you could add a liveness probe yourself to the adapter container to restart the pod once a day. See this link for an example. |
Any guesses why we get into this state intermittently after only a few days? Is the service account's CA really getting changed to cause this? Our admin says he doesn't see a correlation. |
Can you give me a liveliness probe definition that will work with this container? Apparently the container doesn't have a shell (/bin/sh). |
I was able to use a liveness probe that always fails and adjust the 'initialDelaySeconds' to control how often it restarts. Currently restarting every hour:
|
Created K8SSAAS-1054 to prioritize this bug. |
After functioning fine for up to weeks at a time, our HPA Adapters stop working completely and just show these errors in the log:
Obviously it is something to do with certs, but we are at a loss as to what is causing it.
If we delete the pod, a new one that functions fine takes its place.
We are on OpenShift 4.3 (K8S 1.16).
The text was updated successfully, but these errors were encountered: