Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: Shorten tolerations for node failure #1146

Merged
merged 2 commits into from
Nov 19, 2024

Conversation

sharnoff
Copy link
Member

Similar to what was done in #1055, we need to explicitly add tolerations to the scheduler to get it to be recreated more quickly on node failure.

This is particularly necessary because we don't have #995. We could wait for that, but it's a lot of work, and this is a small thing we can do in the meantime.

Fixes neondatabase/cloud#17298, part of neondatabase/cloud#14114.


Notes for review: For some context, I was looking into #995, trying to figure out clearly why we can't "just" use leader election in the scheduler with things as they are today. The short version is that there's no guarantee the autoscaler-agents will switch to a replacement leader, because the old one will still be marked as "ready" until the kubelet reports otherwise (or, in case of failure, until the "node not ready" toleration expires). But because of that, I figured that without #995 we can get almost all the possible benefit from leader election just by having shorter tolerations — so that's what this PR is for.

Similar to what was done in #1055, we need to explicitly add tolerations
to the scheduler to get it to be recreated more quickly on node failure.

This is particularly necessary because we don't have #955. We could wait
for that, but it's a lot of work, and this is a small thing we can do in
the meantime.

Fixes neondatabase/cloud#17298.
@sharnoff sharnoff merged commit 9ac98e7 into main Nov 19, 2024
22 checks passed
@sharnoff sharnoff deleted the sharnoff/scheduler-node-failure-tolerations branch November 19, 2024 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants