-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does WPA kill pods if the queue length decreases? #144
Comments
Hi Robin, WPA operator periodically checks queue metrics and other configuration parameters to calculate the number of pods that should be running and then adjusts the deployment's desired pods value. With maxDisruption set to "0%", it doesn't partially scale down. Scale up is still allowed and is not affected by the maxDisruption value. With maxDisruption set to "0%", only complete scale down(to the The pod disruption issue that you are facing could be unrelated to WPA and could be due to the Kubernetes cluster autoscaler or some other factor(like spot instance replacement, node pressure eviction, etc) that causes the pod to get rescheduled and thereby interrupting the execution. You can confirm this by checking the prometheus metrics or the cluster autoscaler logs. I noticed that both Additionally, you can increase the verbosity of the WPA controller logs by changing |
Thanks for the reply @justjkk, yes this is the setting template file so it is |
Hi @justjkk I hope this image makes my question clear. When WPA decides to scale down completely, as you said, does this mean all the running pods will be killed? And if the answer is yes, then how can I prevent this? I want all pods to run till they exit (end of program). |
Does the consumer delete the job as soon as it receives it? Or it removes the job from the queue only after the message was processed? |
At present the consumer only removes the job from the queue after it is processed. Which is a better way? |
Yes that is the right way. Delete the job from the queue only when the processing finishes. With MaxDisruption=0%, pods should scale down when all the jobs in the queue are processed and queueSize=0 and nothing is being processed at that moment by any worker. Also as justkkk said:
Can you share the log of WPA queue by setting -v=4 verbosity? |
I will check these things and get back, thank you. |
We have deployed WPA on AWS EKS and are using it with some success. We have typically tried running with a maximum of 10 pods (replicas) and all went well. Now we have increased this to 50 and we are noticing pods are randomly quitting with no discernible reason. Its like there is a service that is shutting down a pod midway.
Does WPA kill pods midway as the queue length decreases? We have long running tasks that need around 1-2 hours to process, so we need the pods to finish completely and then quit. WPA should only be responsible for scaling up, and for scaling down, it should not kill existing running pods. Is this what you mean by "disruption"? does "disruption" mean you will:
A. Kill running pods
B. Not create additional pods and leave the running pods on
We are running a python app on EKS, this is the YAML for the app:
This is the current WPA config:
And the other WPA config:
Any help would be greatly appreciated to help us resolve this issue of pods randomly shutting down midway processing.
The text was updated successfully, but these errors were encountered: