OCPBUGS-54611: pkg/operator/status: Drop kubelet skew guard #4970

wking · 2025-04-03T20:59:49Z

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

Install a 4.19 cluster.
Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).
The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will be Upgradeable=True, but will have a reason and message complaining about the old Node versions. With this fix, the machine-config ClusterOperator will still be Upgradeable=True, but will not complain about the old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

openshift-ci-robot · 2025-04-03T20:59:58Z

@wking: This pull request references Jira Issue OCPBUGS-54611, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

Install a 4.19 cluster.

Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).

The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will also be Upgradeable=False, and oc adm upgrade will be talking about the old-kubelet/Node issue for both ClusterOperators. With this fix, the machine-config ClusterOperator will not be Upgradeable=False, and oc adm upgrade will only talk about the kube-apiserver's old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

The kubelet skew guard is from 1471d2c (Bug 1986453: Check for API server and node versions skew, 2021-07-27, openshift#2658). But the Kube API server also landed a similar guard in openshift/cluster-kube-apiserver-operator@9ce4f74775 (add KubeletVersionSkewController, 2021-08-26, openshift/cluster-kube-apiserver-operator#1199). openshift/enhancements@0ba744e750 (eus-upgrades-mvp: don't enforce skew check in MCO, 2021-04-29, openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This commit drops it, to consolidate around the KAS-side guard.

wking · 2025-04-04T17:21:08Z

unit test-case doesn't have much context on what went wrong:

helpers: TestAssertions/Fails_immediately_with_Now	0s
{Failed  }

bootstrap-unit seems to have choked on an InsightsDataGatherer CRD:

Path: /go/src/github.com/openshift/machine-config-operator/test/e2e-bootstrap/.local/share/kubebuilder-envtest/k8s/1.31.1-linux-amd64
    envtest.go:108: setup-envtest complete!
    bootstrap_test.go:80: 
        	Error Trace:	/go/src/github.com/openshift/machine-config-operator/test/e2e-bootstrap/bootstrap_test.go:80
        	Error:      	Received unexpected error:
        	            	unable to install CRDs onto control plane: unable to create CRD instances: unable to create CRD "insightsdatagathers.config.openshift.io": CustomResourceDefinition.apiextensions.k8s.io "insightsdatagathers.config.openshift.io" is invalid: spec.validation.openAPIV3Schema.properties[spec].properties[gatherConfig].properties[storage].properties[persistentVolume].properties[claim].properties[name].x-kubernetes-validations[0].rule: Invalid value: apiextensions.ValidationRule{Rule:"!format.dns1123Subdomain().validate(self).hasValue()", Message:"a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character.", MessageExpression:"", Reason:(*apiextensions.FieldValueErrorReason)(nil), FieldPath:"", OptionalOldSelf:(*bool)(nil)}: compilation failed: ERROR: <input>:1:2: undeclared reference to 'format' (in container '')
        	            	 | !format.dns1123Subdomain().validate(self).hasValue()
        	            	 | .^
        	            	ERROR: <input>:1:25: undeclared reference to 'dns1123Subdomain' (in container '')
        	            	 | !format.dns1123Subdomain().validate(self).hasValue()
        	            	 | ........................^
        	            	ERROR: <input>:1:[36](https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4970/pull-ci-openshift-machine-config-operator-main-bootstrap-unit/1908197258944319488#1:build-log.txt%3A36): undeclared reference to 'validate' (in container '')
        	            	 | !format.dns1123Subdomain().validate(self).hasValue()
        	            	 | ...................................^
        	Test:       	TestE2EBootstrap
--- FAIL: TestE2EBootstrap (5.11s)

Neither seems related to my change, so retesting both:

/test unit
/test bootstrap-unit

wking · 2025-04-04T22:23:44Z

/retest-required

djoshy · 2025-04-07T18:14:55Z

/lgtm

Seems sane to me

/hold

Holding for pre merge QE

openshift-ci · 2025-04-07T18:16:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [djoshy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-04-07T18:24:06Z

@wking: This pull request references Jira Issue OCPBUGS-54611, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

Install a 4.19 cluster.

Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).

The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will be Upgradeable=True, but will have a reason and message complaining about the old Node versions. With this fix, the machine-config ClusterOperator will still be Upgradeable=True, but will not complain about the old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ptalgulk01 · 2025-04-08T10:43:25Z

Pre-merge verification:
Verified using 4.19 AWS IPI based cluster

Created Custom MCP

Infra pool template

$ oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" EOF machineconfigpool.machineconfiguration.openshift.io/infra created

$ oc label node/ip-10-0-1-183.us-east-2.compute.internal node-role.kubernetes.io/infra=
node/ip-10-0-1-183.us-east-2.compute.internal labeled

Applied below MC to apply 4.17 rhcos

To get the rhcos image from 4.17 cluster
$ oc adm release info --image-for rhel-coreos
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6074e4da57d9e290d18928335a308762ca3ef840c12ca0d81a186868656d7956

$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: infra 
  name: os-layer-custom
spec:
  osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6074e4da57d9e290d18928335a308762ca3ef840c12ca0d81a186868656d7956
EOF
machineconfig.machineconfiguration.openshift.io/os-layer-custom created

Check the osImage is applied properly

$ oc describe mc rendered-infra-74f50497da2b65ef722900dae96812a9 | tail -n 10
        Name:     kubelet-cleanup.service
  Extensions:
  Fips:  false
  Kernel Arguments:
    systemd.unified_cgroup_hierarchy=1
    cgroup_no_v1="all"
    psi=0
  Kernel Type:   default
  Os Image URL:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6074e4da57d9e290d18928335a308762ca3ef840c12ca0d81a186868656d7956
Events:

After mcp update is complete check in CO for kube-apiserver is Upgradeable=False and for machine-config to be Upgradeable=True

$ oc get co -o yaml
...
- apiVersion: config.openshift.io/v1
  kind: ClusterOperator
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2025-04-08T06:59:28Z"
    generation: 1
    name: kube-apiserver
 .....
  status:
    conditions:
    - lastTransitionTime: "2025-04-08T10:22:49Z"
      message: 'KubeletMinorVersionUpgradeable: Unsupported kubelet minor version
        (1.30.11) on node ip-10-0-21-153.us-east-2.compute.internal is too far behind
        the target API server version (1.32.3).'
      reason: KubeletMinorVersion_KubeletMinorVersionUnsupported
      status: "False"
      type: Upgradeable
 ....
       version: 4.19.0-0.test-2025-04-08-063432-ci-ln-b1nzigk-latest
- apiVersion: config.openshift.io/v1
  kind: ClusterOperator
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2025-04-08T06:59:28Z"
    generation: 1
    name: machine-config
......
  status:
    conditions:
.....
    - lastTransitionTime: "2025-04-08T07:04:43Z"
      reason: AsExpected
      status: "True"
      type: Upgradeable

ptalgulk01 · 2025-04-08T13:06:43Z

/label qe-approved

djoshy · 2025-04-08T13:08:47Z

/unhold

openshift-ci-robot · 2025-04-08T13:34:44Z

/retest-required

Remaining retests: 0 against base HEAD 71563e7 and 2 for PR HEAD 0c21907 in total

openshift-ci-robot · 2025-04-08T14:29:37Z

/retest-required

Remaining retests: 0 against base HEAD 71563e7 and 2 for PR HEAD 0c21907 in total

openshift-ci · 2025-04-08T16:43:50Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-op-ocl	`0c21907`	link	false	`/test e2e-gcp-op-ocl`
ci/prow/bootstrap-unit	`0c21907`	link	false	`/test bootstrap-unit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-04-08T22:33:20Z

/retest-required

Remaining retests: 0 against base HEAD ddc18e8 and 1 for PR HEAD 0c21907 in total

openshift-ci-robot · 2025-04-09T08:14:04Z

@wking: Jira Issue OCPBUGS-54611: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4970

Jira Issue OCPBUGS-54611 has been moved to the MODIFIED state.

In response to this:

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

Install a 4.19 cluster.

Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).

The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will be Upgradeable=True, but will have a reason and message complaining about the old Node versions. With this fix, the machine-config ClusterOperator will still be Upgradeable=True, but will not complain about the old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2025-04-09T12:23:57Z

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.20.0-202504091047.p0.gecc0b92.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 3, 2025

openshift-ci bot requested review from cheesesashimi and isabella-janssen April 3, 2025 21:00

wking mentioned this pull request Apr 3, 2025

WIP: OCPBUGS-53427: pkg/operator/status: Drop kubelet skew guard, add RHEL guard #4956

Open

wking force-pushed the drop-kubelet-skew-guard branch from cb44bcb to 2c9e7d6 Compare April 4, 2025 16:13

wking force-pushed the drop-kubelet-skew-guard branch from 2c9e7d6 to 0c21907 Compare April 4, 2025 16:36

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2025

openshift-ci bot assigned djoshy Apr 7, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 7, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2025

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 8, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2025

openshift-merge-bot bot merged commit ecc0b92 into openshift:main Apr 9, 2025
15 of 17 checks passed

wking deleted the drop-kubelet-skew-guard branch April 9, 2025 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-54611: pkg/operator/status: Drop kubelet skew guard #4970

OCPBUGS-54611: pkg/operator/status: Drop kubelet skew guard #4970

wking commented Apr 3, 2025 •

edited

Loading

openshift-ci-robot commented Apr 3, 2025

wking commented Apr 4, 2025

wking commented Apr 4, 2025

djoshy commented Apr 7, 2025

openshift-ci bot commented Apr 7, 2025

openshift-ci-robot commented Apr 7, 2025

ptalgulk01 commented Apr 8, 2025

ptalgulk01 commented Apr 8, 2025

djoshy commented Apr 8, 2025

openshift-ci-robot commented Apr 8, 2025

openshift-ci-robot commented Apr 8, 2025

openshift-ci bot commented Apr 8, 2025 •

edited

Loading

openshift-ci-robot commented Apr 8, 2025

openshift-ci-robot commented Apr 9, 2025

openshift-bot commented Apr 9, 2025

OCPBUGS-54611: pkg/operator/status: Drop kubelet skew guard #4970

OCPBUGS-54611: pkg/operator/status: Drop kubelet skew guard #4970

Conversation

wking commented Apr 3, 2025 • edited Loading

openshift-ci-robot commented Apr 3, 2025

wking commented Apr 4, 2025

wking commented Apr 4, 2025

djoshy commented Apr 7, 2025

openshift-ci bot commented Apr 7, 2025

openshift-ci-robot commented Apr 7, 2025

ptalgulk01 commented Apr 8, 2025

ptalgulk01 commented Apr 8, 2025

djoshy commented Apr 8, 2025

openshift-ci-robot commented Apr 8, 2025

openshift-ci-robot commented Apr 8, 2025

openshift-ci bot commented Apr 8, 2025 • edited Loading

openshift-ci-robot commented Apr 8, 2025

openshift-ci-robot commented Apr 9, 2025

openshift-bot commented Apr 9, 2025

wking commented Apr 3, 2025 •

edited

Loading

openshift-ci bot commented Apr 8, 2025 •

edited

Loading