Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-54611: pkg/operator/status: Drop kubelet skew guard #4970

Merged
merged 1 commit into from
Apr 9, 2025

Conversation

wking
Copy link
Member

@wking wking commented Apr 3, 2025

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

  1. Install a 4.19 cluster.
  2. Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).
  3. The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will be Upgradeable=True, but will have a reason and message complaining about the old Node versions. With this fix, the machine-config ClusterOperator will still be Upgradeable=True, but will not complain about the old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 3, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-54611, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

  1. Install a 4.19 cluster.
  2. Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).
  3. The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will also be Upgradeable=False, and oc adm upgrade will be talking about the old-kubelet/Node issue for both ClusterOperators. With this fix, the machine-config ClusterOperator will not be Upgradeable=False, and oc adm upgrade will only talk about the kube-apiserver's old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

The kubelet skew guard is from 1471d2c (Bug 1986453: Check for API
server and node versions skew, 2021-07-27, openshift#2658).  But the Kube API
server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (add
KubeletVersionSkewController, 2021-08-26,
openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (eus-upgrades-mvp: don't enforce
skew check in MCO, 2021-04-29, openshift/enhancements#762) had shifted
the proposal from MCO-guards to KAS-guards, so I'm not clear on why
the MCO guard landed.  This commit drops it, to consolidate around the
KAS-side guard.
@wking wking force-pushed the drop-kubelet-skew-guard branch from 2c9e7d6 to 0c21907 Compare April 4, 2025 16:36
@wking
Copy link
Member Author

wking commented Apr 4, 2025

unit test-case doesn't have much context on what went wrong:

helpers: TestAssertions/Fails_immediately_with_Now	0s
{Failed  }

bootstrap-unit seems to have choked on an InsightsDataGatherer CRD:

Path: /go/src/github.com/openshift/machine-config-operator/test/e2e-bootstrap/.local/share/kubebuilder-envtest/k8s/1.31.1-linux-amd64
    envtest.go:108: setup-envtest complete!
    bootstrap_test.go:80: 
        	Error Trace:	/go/src/github.com/openshift/machine-config-operator/test/e2e-bootstrap/bootstrap_test.go:80
        	Error:      	Received unexpected error:
        	            	unable to install CRDs onto control plane: unable to create CRD instances: unable to create CRD "insightsdatagathers.config.openshift.io": CustomResourceDefinition.apiextensions.k8s.io "insightsdatagathers.config.openshift.io" is invalid: spec.validation.openAPIV3Schema.properties[spec].properties[gatherConfig].properties[storage].properties[persistentVolume].properties[claim].properties[name].x-kubernetes-validations[0].rule: Invalid value: apiextensions.ValidationRule{Rule:"!format.dns1123Subdomain().validate(self).hasValue()", Message:"a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character.", MessageExpression:"", Reason:(*apiextensions.FieldValueErrorReason)(nil), FieldPath:"", OptionalOldSelf:(*bool)(nil)}: compilation failed: ERROR: <input>:1:2: undeclared reference to 'format' (in container '')
        	            	 | !format.dns1123Subdomain().validate(self).hasValue()
        	            	 | .^
        	            	ERROR: <input>:1:25: undeclared reference to 'dns1123Subdomain' (in container '')
        	            	 | !format.dns1123Subdomain().validate(self).hasValue()
        	            	 | ........................^
        	            	ERROR: <input>:1:[36](https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4970/pull-ci-openshift-machine-config-operator-main-bootstrap-unit/1908197258944319488#1:build-log.txt%3A36): undeclared reference to 'validate' (in container '')
        	            	 | !format.dns1123Subdomain().validate(self).hasValue()
        	            	 | ...................................^
        	Test:       	TestE2EBootstrap
--- FAIL: TestE2EBootstrap (5.11s)

Neither seems related to my change, so retesting both:

/test unit
/test bootstrap-unit

@wking
Copy link
Member Author

wking commented Apr 4, 2025

/retest-required

@djoshy
Copy link
Contributor

djoshy commented Apr 7, 2025

/lgtm

Seems sane to me

/hold

Holding for pre merge QE

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 7, 2025
Copy link
Contributor

openshift-ci bot commented Apr 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-54611, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

  1. Install a 4.19 cluster.
  2. Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).
  3. The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will be Upgradeable=True, but will have a reason and message complaining about the old Node versions. With this fix, the machine-config ClusterOperator will still be Upgradeable=True, but will not complain about the old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ptalgulk01
Copy link

Pre-merge verification:
Verified using 4.19 AWS IPI based cluster

Created Custom MCP

Infra pool template
$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""
EOF
machineconfigpool.machineconfiguration.openshift.io/infra created

$ oc label node/ip-10-0-1-183.us-east-2.compute.internal node-role.kubernetes.io/infra=
node/ip-10-0-1-183.us-east-2.compute.internal labeled

Applied below MC to apply 4.17 rhcos

To get the rhcos image from 4.17 cluster
$ oc adm release info --image-for rhel-coreos
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6074e4da57d9e290d18928335a308762ca3ef840c12ca0d81a186868656d7956

$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: infra 
  name: os-layer-custom
spec:
  osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6074e4da57d9e290d18928335a308762ca3ef840c12ca0d81a186868656d7956
EOF
machineconfig.machineconfiguration.openshift.io/os-layer-custom created

Check the osImage is applied properly

$ oc describe mc rendered-infra-74f50497da2b65ef722900dae96812a9 | tail -n 10
        Name:     kubelet-cleanup.service
  Extensions:
  Fips:  false
  Kernel Arguments:
    systemd.unified_cgroup_hierarchy=1
    cgroup_no_v1="all"
    psi=0
  Kernel Type:   default
  Os Image URL:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6074e4da57d9e290d18928335a308762ca3ef840c12ca0d81a186868656d7956
Events:          

After mcp update is complete check in CO for kube-apiserver is Upgradeable=False and for machine-config to be Upgradeable=True

$ oc get co -o yaml
...
- apiVersion: config.openshift.io/v1
  kind: ClusterOperator
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2025-04-08T06:59:28Z"
    generation: 1
    name: kube-apiserver
 .....
  status:
    conditions:
    - lastTransitionTime: "2025-04-08T10:22:49Z"
      message: 'KubeletMinorVersionUpgradeable: Unsupported kubelet minor version
        (1.30.11) on node ip-10-0-21-153.us-east-2.compute.internal is too far behind
        the target API server version (1.32.3).'
      reason: KubeletMinorVersion_KubeletMinorVersionUnsupported
      status: "False"
      type: Upgradeable
 ....
       version: 4.19.0-0.test-2025-04-08-063432-ci-ln-b1nzigk-latest
- apiVersion: config.openshift.io/v1
  kind: ClusterOperator
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2025-04-08T06:59:28Z"
    generation: 1
    name: machine-config
......
  status:
    conditions:
.....
    - lastTransitionTime: "2025-04-08T07:04:43Z"
      reason: AsExpected
      status: "True"
      type: Upgradeable

@ptalgulk01
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 8, 2025
@djoshy
Copy link
Contributor

djoshy commented Apr 8, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 71563e7 and 2 for PR HEAD 0c21907 in total

1 similar comment
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 71563e7 and 2 for PR HEAD 0c21907 in total

Copy link
Contributor

openshift-ci bot commented Apr 8, 2025

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-ocl 0c21907 link false /test e2e-gcp-op-ocl
ci/prow/bootstrap-unit 0c21907 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ddc18e8 and 1 for PR HEAD 0c21907 in total

@openshift-merge-bot openshift-merge-bot bot merged commit ecc0b92 into openshift:main Apr 9, 2025
15 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-54611: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-54611 has been moved to the MODIFIED state.

In response to this:

The kubelet skew guard is from 1471d2c (#2658). But the Kube API server also landed a similar guard in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal from MCO-guards to KAS-guards, so I'm not clear on why the MCO guard landed. This pull request drops it, to consolidate around the KAS-side guard.

Closes: OCPBUGS-54611

- What I did

Dropped the MCO's kubelet-skew guard, because the Kube API server's kubelet-skew guard is enough on its own.

- How to verify it

  1. Install a 4.19 cluster.
  2. Create a MachineConfigPool where you override the RHCOS image to use one from 4.17 (maybe there's some fancy MCO way to do this?).
  3. The cluster's kube-apiserver ClusterOperator will be Upgradeable=False, and oc adm upgrade will point out the old Nodes. Without this fix, the machine-config ClusterOperator will be Upgradeable=True, but will have a reason and message complaining about the old Node versions. With this fix, the machine-config ClusterOperator will still be Upgradeable=True, but will not complain about the old-kubelet/Node complaints.

- Description for the changelog

Probably not worth a change-log entry, because so few folks are likely to bump up against these skew constraints, and we're not adding or removing a guard at the OCP level, just simplifying the OCP-scoped messaging by consolidating around the KAS-side wording.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.20.0-202504091047.p0.gecc0b92.assembly.stream.el9.
All builds following this will include this PR.

@wking wking deleted the drop-kubelet-skew-guard branch April 9, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants