Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable behavior. More testing needed. #3

Open
kozyraki opened this issue May 26, 2017 · 1 comment
Open

Unstable behavior. More testing needed. #3

kozyraki opened this issue May 26, 2017 · 1 comment

Comments

@kozyraki
Copy link
Contributor

In general, the UI is quite unstable. Every now and then, some stats disappear forever. For example, after a while the CPU utilization statistics disappear. If I try to restart the ui (kubectl delete -> create) then I dont' get the app-level qos statistics. Not sure what the problem is but it would be nice to do some more testing of stability. For example, test bringing down and restarting the UI to see if it consistently works.

@adrianliaw
Copy link
Contributor

@kozyraki Thanks for pointing out these, I've been working on fixing this issue throughout last week, and it seems like it's working for me now. Please deploy the latest adrianliaw/be-controller-ui image and see if works consistently, and please let me know if there's any issues coming up with the UI.

I've tested it by running the cluster for 2 hours and showing the statistics in the browser with the UI. I've also tested bringing the be-controller-ui deployment down and recreating it. Right now all the charts are showing as expected and never get disappeared (as long as there's data coming in).

However, while I'm running this for 2 hours, the CPU utilisation percentage graph eventually gets empty, and that's because of a failure in snap (I'm not sure why, but most of the pods inside hyperpilot namespace get crashed after running the cluster for a while, including locust, influx, load controllers, grafana and demo-ui), this can be fixed by restarting the snap pods.

Also, I'm not quite sure why, but sometimes the CPU utilisation rates for each container calculated within Influx came out to be -Infinity (only for spark pods according to what I saw). I'm currently just replacing those infinities with 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants