Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "operator: always delete terminated pending nodes (#2545)" #2596

Merged
merged 1 commit into from
Nov 13, 2023

Conversation

3u13r
Copy link
Member

@3u13r 3u13r commented Nov 13, 2023

This reverts commit 5267ad0.

Context

Proposed change(s)

  • We need to revert this. With this patch we've observed 30+ nodes being scheduled during the upgrade of a 1:1 cluster on GCP. This presumably happens because if a node is not found then it's state within the operator is "Terminated". This likely happens in the first seconds of the new node being created. With is patch the "Terminated" node already has reached its goal. It's true goal was joining the cluster though.

The idea before (and after) this patch is that the joiningNode has a deadline set until it must join the cluster. Otherwise it's goal is set to "Leaving" which is reached if the node never joined the cluster because it was shut down. This should've already handled the AWS case.

Additional info

Checklist

  • Add labels (e.g., for changelog category)
  • Link to Milestone

@3u13r 3u13r added the no changelog Change won't be listed in release changelog label Nov 13, 2023
@3u13r 3u13r added this to the v2.13.0 milestone Nov 13, 2023
@3u13r 3u13r requested a review from malt3 as a code owner November 13, 2023 18:42
Copy link

netlify bot commented Nov 13, 2023

Deploy Preview for constellation-docs canceled.

Name Link
🔨 Latest commit c2e2ae3
🔍 Latest deploy log https://app.netlify.com/sites/constellation-docs/deploys/65526e036c278b00084f3205

@3u13r 3u13r requested a review from daniel-weisse November 13, 2023 18:45
@malt3
Copy link
Contributor

malt3 commented Nov 13, 2023

I agree that reverting is better. However, we should not forget about the AWS bug, where we will now schedule too many nodes. The AWS case is not handled because of weird behavior of the AWS state machine (the per instance state can be unexpected).

Copy link
Contributor

@msanft msanft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving here while workaround for AWS should be implemented separately

@3u13r 3u13r merged commit 56ab3e9 into main Nov 13, 2023
8 checks passed
@3u13r 3u13r deleted the fix/revert/2545 branch November 13, 2023 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no changelog Change won't be listed in release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants