Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper handling of reboot scenarios / drain and cordon max. one node etc. #282

Open
Martin-Weiss opened this issue Dec 22, 2023 · 2 comments

Comments

@Martin-Weiss
Copy link

Is your feature request related to a problem? Please describe.

We need to patch and reboot nodes in the cluster in a sequential fashion with ensuring that max. 1 master and 1 worker are "drained / cordoned / rebootet" in parallel. So we have to ensure that max. 1 node is not available during the process.

When looking at the example https://github.com/rancher/system-upgrade-controller/blob/master/examples/ubuntu/bionic/linux-kernel-virtual-hwe-18.04.yaml (as well as others that do a reboot of the node) - the problem is that SUC will start another node to be drained even before the first one is back in the cluster.

Describe the solution you'd like
Somehow we need an option in the drain / cordon process to "wait" until the last node updated/rebooted is back and healthy. This also needs to take into account that a node might still show "ready" even though it is rebooted, because Kubernetes might not realize a down / unavailable node for some time.

It would be great if we could specify in the "drain" for how long we wait until we run the job on the next node if the first one is completed and it would be great if we could specify "wait for drain until at least 90% of the nodes are available in ready and not cordoned status" and "no more than 1 node not-available / cordoned".

Describe alternatives you've considered
Using kured instead of system upgrade controller.

@brandond
Copy link
Member

I suspect that the SUC considers the upgrade job complete as soon as the image exits successfully - as a 0 exit code is its sole success criteria.

If you want it to wait until after the reboot, you'll probably have to tweak your image so that it "fails" after triggering the reboot, and then "retries" and exits cleanly with a no-op after the reboot is complete.

The SUC itself doesn't know anything at all other than whether the job has succeeded. If the jobs succeeds before the reboot is complete, the SUC considers the work done and will move along to the next node. Any additional checks belong in your upgrade image, not in the SUC.

@brandond
Copy link
Member

brandond commented Apr 23, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants