scheduler: Shorten tolerations for node failure #1146

sharnoff · 2024-11-18T04:52:57Z

Similar to what was done in #1055, we need to explicitly add tolerations to the scheduler to get it to be recreated more quickly on node failure.

This is particularly necessary because we don't have #995. We could wait for that, but it's a lot of work, and this is a small thing we can do in the meantime.

Fixes neondatabase/cloud#17298, part of neondatabase/cloud#14114.

Notes for review: For some context, I was looking into #995, trying to figure out clearly why we can't "just" use leader election in the scheduler with things as they are today. The short version is that there's no guarantee the autoscaler-agents will switch to a replacement leader, because the old one will still be marked as "ready" until the kubelet reports otherwise (or, in case of failure, until the "node not ready" toleration expires). But because of that, I figured that without #995 we can get almost all the possible benefit from leader election just by having shorter tolerations — so that's what this PR is for.

Similar to what was done in #1055, we need to explicitly add tolerations to the scheduler to get it to be recreated more quickly on node failure. This is particularly necessary because we don't have #955. We could wait for that, but it's a lot of work, and this is a small thing we can do in the meantime. Fixes neondatabase/cloud#17298.

sharnoff requested a review from Omrigan November 18, 2024 04:52

mikhail-sakhnov approved these changes Nov 18, 2024

View reviewed changes

Merge branch 'main' into scheduler-node-failure-tolerations

10d71d8

sharnoff merged commit 9ac98e7 into main Nov 19, 2024
22 checks passed

sharnoff deleted the sharnoff/scheduler-node-failure-tolerations branch November 19, 2024 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: Shorten tolerations for node failure #1146

scheduler: Shorten tolerations for node failure #1146

sharnoff commented Nov 18, 2024

scheduler: Shorten tolerations for node failure #1146

scheduler: Shorten tolerations for node failure #1146

Conversation

sharnoff commented Nov 18, 2024