CSU-2424: AKS drift detection improvements #422
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Observed issues when a customer "updated" credentials for a cluster but the update failed. Afterwards customer was stuck - terraform had stale credentials in state but thought they match what Cast had on SaaS side.
Observed symptomps:
credentials_id
differences have no effect on plan since it's computed.TF_LOG
).For now, we decided not to expose credentials hash in API until it is 100% required so this MR works around most issues but will not catch credentials content drift directly.
Changes in MR
client_id
. This means next plan will see an apply is required and retry updating, hence resolving the drift.credentials_id
is a computed value, we cannot force state updates through it. To work around this, theclient_id
is reset, which will force terraform to see the drift and re-apply the client credentials.UpdateCluster
fails continuously and a context deadline was reached, we would surface the context deadline error without any context to user. Changed it so we surface the last non-context error observed.Update
a bit to match other providers. Non-credential 400 errors are treated as permanent and surfaced immediately to avoid 20m wait.TODOs
Add the same drift logic for EKS/GKE. Add unit tests.
Given time constraints, these TODOs will be in next MR, I want to fix customer issue for AKS.