-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent 5xx error during evicting deployments that has small replica count #33
Comments
Hi, I'm following up here with quotes from your linked comment.
It's interesting that it worked during a cluster upgrade which did create new nodes but not when recreating a node group independently. Is this process different between both of those methods?
This seems pretty scary because with a manually sized cluster there might not be room to add +1 replica for every deployment at once, or would this happen 1 deployment at a time so there's only ever +1 replica for 1 deployment at a time? If it's the second option that might work but is still a little scary since you might have a node with 8GB of memory with 2 pods running that requires 3GB of memory each, the node couldn't fit a 3rd pod -- even temporarily.
At the moment I'm running a 2 node cluster across 2 azs with 2 replicas of everything. That's just enough safety for me to feel comfortable without breaking the bank. I suppose where I'm going with this one is ideally maybe we shouldn't expect there always being a large amount of replicas across a bunch of nodes? I'll admit most of what your library does is a bit above my pay grade in Kubernetes knowledge. I have a surface level knowledge of draining / evicting from an end user's POV. In the end I'm happy to do whatever testing you want and offer any use case examples I can to help resolve this in a way that you feel is solid and safe at an implementation level. |
Thank you for the kind words! Since you have 2 replicas for everything, I guess that the downtime during the node group renaming was due to the skipping webhooks. |
If your deployments are well replicated, then this is not a problem.
However, when they are not, then there's still be ALB 5xx errors while they are evicted.
These eviction can be triggered by
kubectl drain
, node termination, etc.Evicted pods should be handled in this order:
Related comment: #31 (comment)
The text was updated successfully, but these errors were encountered: