Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental version checks during migration #2399

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from
Draft

Conversation

Zash
Copy link
Contributor

@Zash Zash commented Jan 17, 2025

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

What does this PR do / why do we need this PR?

  • Ensure that ck8s upgrade prepare is run before ck8s upgrade apply
  • Ensure that ck8s upgrade apply is run before ck8s apply TODO
digraph "ck8s upgrade" {
	init [label="ck8s init"]
	edit [label="edit config"]
	apply [label="ck8s apply"]
	uprep [label="ck8s upgrade prepare"]
	upgrade [label="ck8s upgrade apply"]

	init -> edit -> apply -> edit
	edit -> uprep -> upgrade -> apply

# init -> uprep [label="NO!"]
# edit -> init [label="NO!"]
}

This uses a pair of ConfigMaps to keep track of progress during apps upgrades
in order to enforce order (prepare first, then apply, and for the same version)
and atomicity (only allow prepare once) per step.

  • apps-meta stores the current version of apps
    • .data.version the version number itself
  • apps-upgrade stores details about an in-progress upgrade, and is deleted to signify a completed migration
    • .data.prepare records that ck8s upgrade ... prepare has been started
    • .data.prepared records that ck8s upgrade ... prepare has completed
    • .data.prepared records that ck8s upgrade ... apply has started
    • .data.last_step records the last snippet run by ck8s upgrade ...

Information to reviewers

This is intended to behave as a state machine, going from start to prepare to
apply (for each step) to done, at which point the upgraded-to version is
recorded.

How to test: I applied the changes on top of the release-0.42 branch and upgraded from 0.41.x to 0.42.1, then again to 0.43 by creating a local tag (not to be published).

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@Zash Zash force-pushed the ka/in-cluster-version branch 8 times, most recently from c3dca19 to 21653d6 Compare January 22, 2025 10:58
@Zash Zash marked this pull request as ready for review January 22, 2025 10:58
@Zash Zash requested review from a team as code owners January 22, 2025 10:58
@aarnq
Copy link
Contributor

aarnq commented Jan 22, 2025

What is the motivation for keeping the phase of prepare in-cluster?

Isn't it just critical to track the apply phase?

@Zash
Copy link
Contributor Author

Zash commented Jan 22, 2025

What is the motivation for keeping the phase of prepare in-cluster?

Seemed a natural starting point for these checks.

Isn't it just critical to track the apply phase?

Yes.

@Zash
Copy link
Contributor Author

Zash commented Jan 22, 2025

#2315 (comment) is probably where I got that from, to track the whole upgrade process from the start.

Copy link
Contributor

@simonklb simonklb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

No strong opinion here though. It doesn't hurt but might add some extra complexity.

bin/upgrade.bash Outdated
Comment on lines 118 to 121
prepared_version="$(get_prepared_version sc || true)"
if [ -z "${prepared_version}" ]; then
log_fatal "'prepare' step does not appear to have been run, do so first"
fi

if [[ "${prepared_version}" != "${CK8S_TARGET_VERSION}" ]]; then
log_fatal "'prepare' step in sc appears to have been run for version ${prepared_version}, not ${CK8S_TARGET_VERSION}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Extract into a separate function and reuse for both sc and wc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zash Zash requested a review from simonklb January 23, 2025 13:06
@Zash Zash force-pushed the ka/in-cluster-version branch from 3162bf7 to 53ac173 Compare January 23, 2025 13:08
Copy link
Contributor

@simonklb simonklb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still approved after removing the prepare phase tracking and keep relying on ck8sVersion in the config file. However, I still do think that getting rid of the ck8sVersion config value and only rely on the apps-meta config map would have been neat.

@aarnq
Copy link
Contributor

aarnq commented Jan 24, 2025

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean.
Or we have to move the entire config in cluster.
Else if someone has done prepare, then someone else without that config can run apply.

bin/upgrade.bash Outdated
# Create a configmap. This fails if done twice.
if [[ "${CK8S_CLUSTER:-}" =~ ^(sc|both)$ ]]; then
if ! record_migration_begin sc; then
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
log_fatal "apply already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

Also, would it make sense to add a ck8s upgrade <prefix> unlock-apply to do this?
(With some warning+confirm that you probably shouldn't unless you know what you are doing.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a thing with --force, would a separate sub-command be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, because else some people might just always run with --force and that would defeat the purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a ck8s reset-upgrade command to bypass some of the argument related logic that ck8s upgrade ... does. How's it look?

@simonklb
Copy link
Contributor

simonklb commented Jan 24, 2025

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.

Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

bin/ck8s Outdated
Comment on lines 157 to 175
version)
shift
case "${1:-both}" in
both)
log_info "sc version:" "$(get_version "sc")"
log_info "wc version:" "$(get_version "wc")"
;;
sc | wc)
log_info "${1} version:" "$(get_version "${1}")"
;;
esac
;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this command. I think it could be even better if it also printed the version in the config, so you can easily see if they are the same. Does that sound good to you as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Should just be one version in defaults/common-config.yaml, or are there cases where wc and sc may differ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the only logical place to override the version is in common. It might be technically possible to have it in sc or wc, but that is not intended.

So just comparing with common should be fine, but doing the regular merge of sc with common and wc with common might be safest.

Comment on lines 285 to 292
cluster_version="$(get_apps_version "${1}" >/dev/null 2>&1 || true)"
if [ -n "${cluster_version}" ]; then
log_info "Currently running ${cluster_version} according to cluster"
if [ "${common_override}" != "null" ] && [ "${cluster_version%.*}" != "${common_override}" ]; then
log_warn "Version mismatch, cluster ${cluster_version}, config ${common_override}"
fi
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Could you elaborate on why we have this block? If I understand it correctly then it will warn if there is a version override in the common config and that is not the same as the applied version. But it will not exit and is not part of a prompt, so it would just be informational when looking at what is happening.

Could it perhaps make sense to merge this check into the prompt above it? Then you would still get this information, but it can be used as more context to decide if you want to abort or not. What do you think of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, probably missed that it's the override config it checks against. The intent was probably to check the pre-upgrade config version (defaults/common-config.yaml → .global.ck8sVersion) against the in-cluster version (if available). E.g. if you try to upgrade to v0.43 and your config says v0.42 but your cluster says v0.41 or similar. This gets complicated and hard to think about since the config is changed by ck8s init.

Maybe just drop this, or what do you think is the way forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the best way forward is.
Merging it with the block above might still be best, so that you get a chance to abort. Because I think that there is some value to having this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the block above the abort check but removed the mismatch check since it would be comparing the to-be-upgraded-to version recorded by ck8s init to the currently running version, which wasn't the intent and it would always mismatch unless try to migrate a second time.

@aarnq
Copy link
Contributor

aarnq commented Feb 12, 2025

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.

Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

If that is the concern then we should change how ck8sVersion is handled, because I do agree that this is an issue because init is not supposed to be the prepare step.

Zash added 3 commits February 19, 2025 14:28
This is meant to atomically record the state transition into a migration (by creation) and then out of it (by deletion).

The general apps version is recorded at the end, and compared to the config version on next upgrade.
This reverts commit 53ac173.

Decided to go with the way where it starts tracking at
the prepare stage.
Zash added 21 commits February 19, 2025 14:28
should apply have one too?
separate command to discourage always running with --force
going trough ck8s upgrade invokes a number of behaviors that are not needed
did not mean to commit this
one of the config checks were wrong, better to only have one
it's going to mismatch every you upgrade because it compares the version you have with the version you are upgrading to
since ck8s init changes it
@Zash Zash force-pushed the ka/in-cluster-version branch from 608960d to 4150a95 Compare February 19, 2025 13:28
@Zash
Copy link
Contributor Author

Zash commented Feb 19, 2025

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

If that is the concern then we should change how ck8sVersion is handled, because I do agree that this is an issue because init is not supposed to be the prepare step.

Would it be enough to add something here that checks in the apply steps that config-version == cluster version, or do we need to rethink migrations from ck8s init onwards?

@aarnq
Copy link
Contributor

aarnq commented Feb 24, 2025

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

If that is the concern then we should change how ck8sVersion is handled, because I do agree that this is an issue because init is not supposed to be the prepare step.

Would it be enough to add something here that checks in the apply steps that config-version == cluster version, or do we need to rethink migrations from ck8s init onwards?

Or just move the update of ck8sVersion to be done as a final step when upgrade prepare completes.
Init should still be able to set it in a new config though.

bin/ck8s Outdated
@@ -35,7 +35,9 @@ usage() {
echo " update-ips <wc|sc|both> <apply|dry-run> Automatically fetches and applies the IPs for network policies" 1>&2
echo " upgrade <wc|sc|both> <vX.Y> apply runs all apply steps upgrading the environment" 1>&2
echo " upgrade <wc|sc|both> <vX.Y> prepare runs all prepare steps upgrading the configuration" 1>&2
echo " unlock-upgrade <wc|sc|both> reset a migration to allow trying again" 1>&2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, could this be upgrade-unlock instead? Makes similar commands stay together.
Or perhaps upgrade <wc|sc|both> unlock to be more similar in structure to the rest of the upgrade commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, could this be upgrade-unlock instead?

Sure

Or perhaps upgrade <wc|sc|both> unlock

Finding a good way to bypass the existing path that requires a version number was a bit awkward.

[[ "${2}" =~ ^(wc|sc|both)$ ]] || usage
export CK8S_CLUSTER="${2}"
"${here}/upgrade.bash" "unlock"
;;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  echo "  upgrade <wc|sc|both> unlock                     reset a migration to allow trying again" 1>&2
####
upgrade)
  [[ "${2}" =~ ^(wc|sc|both)$ ]] || usage
  if [[ "${3}" == "unlock" ]]; then
    export CK8S_CLUSTER="${2}"
    "${here}/upgrade.bash" "unlock"
    exit 0
  fi
  [[ "${3}" =~ ^(v[0-9]+\.[0-9]+)$ ]] || usage
  [[ "${4}" =~ ^(prepare|apply)$ ]] || usage
  check_tools
  export CK8S_CLUSTER="${2}"
  "${here}/upgrade.bash" "${3}" "${4}"
  ;;

I guess this would work, tho feels a bit awkward to jump out in the middle here. @aarnq ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how you write it, you don't have to jump in the middle.

[[ "${2}" =~ ^(wc|sc|both)$ ]] || usage
[[ "${3}" =~ ^(v[0-9]+\.[0-9]+|unlock)$ ]] || usage
if [[ "${3}" != "unlock" ]]; then
  [[ "${4}" =~ ^(prepare|apply)$ ]] || usage
fi
check_tools
export CK8S_CLUSTER="${2}"
"${here}/upgrade.bash" "${3}" "${4:-}"

@Zash Zash marked this pull request as draft February 24, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[3] Add applied apps version to cluster state
5 participants