CSU-2424: AKS drift detection improvements #422

Tsonov · 2024-11-14T09:07:25Z

Problem
Observed issues when a customer "updated" credentials for a cluster but the update failed. Afterwards customer was stuck - terraform had stale credentials in state but thought they match what Cast had on SaaS side.

Observed symptomps:

After failed update, Terraform does not detect drift anymore so it will not attempt to re-apply the credentials. User has to manually reset credentials to trigger re-apply. Makes for bad UX.
Even remotely resetting the credentials does not work as credentials_id differences have no effect on plan since it's computed.
Failed update was stuck until context deadline exceeded, showing no info on what failed for debugging (without setting TF_LOG).

For now, we decided not to expose credentials hash in API until it is 100% required so this MR works around most issues but will not catch credentials content drift directly.

Changes in MR

On failed cluster update, we force drift in client_id. This means next plan will see an apply is required and retry updating, hence resolving the drift.
On Read, if the credentials ID on Cast side does not match what terraform last saved in state, we consider that some drift in credentials happened. Since credentials_id is a computed value, we cannot force state updates through it. To work around this, the client_id is reset, which will force terraform to see the drift and re-apply the client credentials.
When UpdateCluster fails continuously and a context deadline was reached, we would surface the context deadline error without any context to user. Changed it so we surface the last non-context error observed.
Changed the error handling in Update a bit to match other providers. Non-credential 400 errors are treated as permanent and surfaced immediately to avoid 20m wait.

TODOs
Add the same drift logic for EKS/GKE. Add unit tests.

Given time constraints, these TODOs will be in next MR, I want to fix customer issue for AKS.

… service

…reading from server to avoid perpetual drift

Tsonov added 5 commits November 13, 2024 15:51

On context cancel/deadlineexceeded, show the last observed error from…

4ff7002

… service

Save lastErr when response is not OK as well

63b9f67

Drift detection when credentials ID does not match (AKS)

5c43420

Add unit test for read drift detection

d65b795

Save generated credentialsID from update response in state before re-…

0340737

…reading from server to avoid perpetual drift

Tsonov requested a review from a team as a code owner November 14, 2024 09:07

Handle err from Set

9f4c876

aldor007 approved these changes Nov 14, 2024

View reviewed changes

Tsonov merged commit ad777a3 into master Nov 14, 2024
10 checks passed

Tsonov deleted the CSU-2424-drift-detection branch November 14, 2024 11:40

Tsonov mentioned this pull request Nov 15, 2024

CSU-2024: Drift detection for other clouds #423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSU-2424: AKS drift detection improvements #422

CSU-2424: AKS drift detection improvements #422

Tsonov commented Nov 14, 2024 •

edited

Loading

CSU-2424: AKS drift detection improvements #422

CSU-2424: AKS drift detection improvements #422

Conversation

Tsonov commented Nov 14, 2024 • edited Loading

Tsonov commented Nov 14, 2024 •

edited

Loading