Prevent 5xx error during evicting deployments that has small replica count #33

foriequal0 · 2022-04-07T11:46:06Z

If your deployments are well replicated, then this is not a problem.
However, when they are not, then there's still be ALB 5xx errors while they are evicted.
These eviction can be triggered by kubectl drain, node termination, etc.

Evicted pods should be handled in this order:

The pod is requested to be evicted.
Increase the replica count of the deployment of the requested pod. However, you can't isolate the pod yet since it'll drain the pod.
When the new pods are ready, you can isolate the pod, restoring the replica count, and starts delayed deletion.

Related comment: #31 (comment)

The text was updated successfully, but these errors were encountered:

nickjj · 2022-04-07T12:03:28Z

Hi, I'm following up here with quotes from your linked comment.

However, the eviction process seems to be different. Pods are evicted from the node first (and pods are drained) then replicaset controller tries to reconcile the pod replica count later without being noticed by eviction processors. I didn't recognized this until now.

It's interesting that it worked during a cluster upgrade which did create new nodes but not when recreating a node group independently. Is this process different between both of those methods?

I might be able to temporarily increase the replica count of the deployment when the pod is requested to be evicted in later versions.

This seems pretty scary because with a manually sized cluster there might not be room to add +1 replica for every deployment at once, or would this happen 1 deployment at a time so there's only ever +1 replica for 1 deployment at a time? If it's the second option that might work but is still a little scary since you might have a node with 8GB of memory with 2 pods running that requires 3GB of memory each, the node couldn't fit a 3rd pod -- even temporarily.

To mitigate this, I think we should have enough replicas and make sure they are distributed across multiple nodes.

At the moment I'm running a 2 node cluster across 2 azs with 2 replicas of everything. That's just enough safety for me to feel comfortable without breaking the bank. I suppose where I'm going with this one is ideally maybe we shouldn't expect there always being a large amount of replicas across a bunch of nodes?

I'll admit most of what your library does is a bit above my pay grade in Kubernetes knowledge. I have a surface level knowledge of draining / evicting from an end user's POV. In the end I'm happy to do whatever testing you want and offer any use case examples I can to help resolve this in a way that you feel is solid and safe at an implementation level.

foriequal0 · 2022-04-07T14:18:56Z

Thank you for the kind words!
I am also not sure about the solution I came up with for this issue. Let's sit still and think for some time.

Since you have 2 replicas for everything, I guess that the downtime during the node group renaming was due to the skipping webhooks.
Actually, I couldn't reproduce downtime reliably with renaming the node group. That's why I think you might have been lucky when you upgrade the cluster.
I'll create another issue for this here #34 😅 I think that would be more appropriate for this situation.

foriequal0 mentioned this issue Apr 7, 2022

Unable to upgrade cluster because of a drain error related to admission webhook denied request: no kind "Eviction" is registered for version #31

Closed

nickjj mentioned this issue Apr 26, 2022

Proper HA setup #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent 5xx error during evicting deployments that has small replica count #33

Prevent 5xx error during evicting deployments that has small replica count #33

foriequal0 commented Apr 7, 2022 •

edited

Loading

nickjj commented Apr 7, 2022 •

edited

Loading

foriequal0 commented Apr 7, 2022

Prevent 5xx error during evicting deployments that has small replica count #33

Prevent 5xx error during evicting deployments that has small replica count #33

Comments

foriequal0 commented Apr 7, 2022 • edited Loading

nickjj commented Apr 7, 2022 • edited Loading

foriequal0 commented Apr 7, 2022

foriequal0 commented Apr 7, 2022 •

edited

Loading

nickjj commented Apr 7, 2022 •

edited

Loading