Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSU-2424: AKS drift detection improvements #422

Merged
merged 6 commits into from
Nov 14, 2024
Merged

Conversation

Tsonov
Copy link
Contributor

@Tsonov Tsonov commented Nov 14, 2024

Problem
Observed issues when a customer "updated" credentials for a cluster but the update failed. Afterwards customer was stuck - terraform had stale credentials in state but thought they match what Cast had on SaaS side.

Observed symptomps:

  • After failed update, Terraform does not detect drift anymore so it will not attempt to re-apply the credentials. User has to manually reset credentials to trigger re-apply. Makes for bad UX.
  • Even remotely resetting the credentials does not work as credentials_id differences have no effect on plan since it's computed.
  • Failed update was stuck until context deadline exceeded, showing no info on what failed for debugging (without setting TF_LOG).

For now, we decided not to expose credentials hash in API until it is 100% required so this MR works around most issues but will not catch credentials content drift directly.

Changes in MR

  • On failed cluster update, we force drift in client_id. This means next plan will see an apply is required and retry updating, hence resolving the drift.
  • On Read, if the credentials ID on Cast side does not match what terraform last saved in state, we consider that some drift in credentials happened. Since credentials_id is a computed value, we cannot force state updates through it. To work around this, the client_id is reset, which will force terraform to see the drift and re-apply the client credentials.
  • When UpdateCluster fails continuously and a context deadline was reached, we would surface the context deadline error without any context to user. Changed it so we surface the last non-context error observed.
  • Changed the error handling in Update a bit to match other providers. Non-credential 400 errors are treated as permanent and surfaced immediately to avoid 20m wait.

TODOs
Add the same drift logic for EKS/GKE. Add unit tests.

Given time constraints, these TODOs will be in next MR, I want to fix customer issue for AKS.

@Tsonov Tsonov requested a review from a team as a code owner November 14, 2024 09:07
@Tsonov Tsonov merged commit ad777a3 into master Nov 14, 2024
10 checks passed
@Tsonov Tsonov deleted the CSU-2424-drift-detection branch November 14, 2024 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants