Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport v0.9] Drift detection/correction should omit status fields #2545

Closed
1 task done
rancherbot opened this issue Jun 21, 2024 · 2 comments
Closed
1 task done
Assignees
Milestone

Comments

@rancherbot
Copy link
Collaborator

This is a backport issue for #2521, automatically created via GitHub Actions workflow initiated by @aruiz14

Original issue body:

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Sometimes will detect false positives about modified resources, reporting differences inside the status field.
This situation is impossible to remediate (status is normally a subresource and is not part of the deployment specs), which also causes a redeployment loop when drift correction (self-healing) is active.

However, this does not happen for all resources, e.g. deployments, but I experienced when installing a chart that contained CRDs.

Expected Behavior

Changes in status fields are not reported as modified resources.

Steps To Reproduce

  1. Install Fleet
  2. Add a GitRepo whose fleet.yaml is like:
defaultNamespace: longhorn-system
helm:
  chart: longhorn-crd
  repo: https://charts.rancher.io
  releaseName: longhorn-crd
  version: "103.3.1+up1.6.2"
  1. Check how it will stay as NotReady, reporting some CRDs were modified, e.g.:
  modifiedStatus:
  - apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    name: engineimages.longhorn.io
    patch: '{"status":{"acceptedNames":{"kind":"","plural":""},"conditions":[],"storedVersions":[]}}'
  - apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    name: engines.longhorn.io
    patch: '{"status":{"acceptedNames":{"kind":"","plural":""},"conditions":[],"storedVersions":[]}}'
  - apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    name: instancemanagers.longhorn.io
    patch: '{"status":{"acceptedNames":{"kind":"","plural":""},"conditions":[],"storedVersions":[]}}'
  - apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    name: nodes.longhorn.io
    patch: '{"status":{"acceptedNames":{"kind":"","plural":""},"conditions":[],"storedVersions":[]}}'
  - apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    name: replicas.longhorn.io
    patch: '{"status":{"acceptedNames":{"kind":"","plural":""},"conditions":[],"storedVersions":[]}}'
  - apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    name: settings.longhorn.io
    patch: '{"status":{"acceptedNames":{"kind":"","plural":""},"conditions":[],"storedVersions":[]}}'
  - apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    name: volumes.longhorn.io
    patch: '{"status":{"acceptedNames":{"kind":"","plural":""},"conditions":[],"storedVersions":[]}}'

Environment

- Architecture: *
- Fleet Version: 0.9.5
- Cluster:
  - Provider: K3D
  - Options: basic setup, installed Rancher 2.8.3
  - Kubernetes Version: v1.26.9+k3s1

Logs

No response

Anything else?

Screenshot 2024-06-14 at 14 24 23

@rancherbot rancherbot added this to the v2.8-Next1 milestone Jun 21, 2024
@rancherbot rancherbot added this to Fleet Jun 21, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Jun 21, 2024
@aruiz14 aruiz14 changed the title [backport v2.8] Drift detection/correction should omit status fields [backport v0.9] Drift detection/correction should omit status fields Jun 21, 2024
@aruiz14 aruiz14 moved this from 🆕 New to 🏗 In progress in Fleet Jun 21, 2024
@aruiz14 aruiz14 moved this from 🏗 In progress to 👀 In review in Fleet Jun 21, 2024
@aruiz14
Copy link
Contributor

aruiz14 commented Jun 21, 2024

Additional QA

Problem

Some just-deployed resources will be reported as Modified by Fleet, never reaching a Ready status. The reported diff contains only changes in the status field of the resource, which is not really actionable by Fleet.

Further investigation has shown that 2 conditions must be met in order to reproduce this issue:

  1. The resource spec must explicitly set at least one field to its default value, e.g. preserveUnknownFields: false in CRDs.
  2. The same manifest also specifies a status field, despite having empty values, e.g.:
status:
  acceptedNames:
    kind: ""
    plural: ""
  conditions: []
  storedVersions: []

https://github.com/longhorn/longhorn/blob/f5109e31d29f0ad0dbcdab6870afa1916796908e/deploy/longhorn.yaml#L266-L271

Solution

Fleet Agent should ignores status fields when comparingHelm Rollback operations, used internally by Fleet to correct drift, now obey Fleet's global limit on Helm history, restricting the number of kept history items to 2.

Testing

(See repro steps above)

  1. Create a fleet.yaml which will deploy any resource matching the above criteria, e.g. the longhorn-crd chart:
defaultNamespace: longhorn-system
helm:
  chart: longhorn-crd
  repo: https://charts.rancher.io
  releaseName: longhorn-crd
  version: "103.3.1+up1.6.2"
  1. Create a GitRepo pointing to that fleet.yaml and wait for its deployment.
  • Alternatively, you can use the fleetcli to create a Bundle from the fleet.yaml file and directly apply to the cluster:
mkdir longhorn-crd-bundle && cp fleet.yaml longhorn-crd-bundle/
fleet apply -o - longhorn-crd ./longhorn-crd-bundle | kubectl -n fleet-local apply -f -
  1. Check how the Bundle status successfully becomes Ready:
❯ kubectl -n fleet-local get bundles longhorn-crd-testcli
NAME                   BUNDLEDEPLOYMENTS-READY   STATUS
longhorn-crd-testcli   1/1

@aruiz14 aruiz14 moved this from 👀 In review to Needs QA review in Fleet Jun 21, 2024
@mmartin24
Copy link
Collaborator

Verified in:

  • Rancher v2.8-d72c5157164a9eb798aab685e23af818f95804e4-head with fleet:103.1.6+up0.9.6-rc.3 and k3s v1.27.10+k3s1
  • Rancher v2.8.6-alpha3 with fleet fleet:103.1.5+up0.9.5 and later upgraded to fleet:103.1.6+up0.9.6-rc.3 and k3s v1.27.10 +k3s1

Checked using fleet.yaml provided and confirmed there is no process in modified state:

defaultNamespace: longhorn-system
helm:
  chart: longhorn-crd
  repo: https://charts.rancher.io
  releaseName: longhorn-crd
  version: "103.3.1+up1.6.2"
  • Verification in v2.8-d72c5157164a9eb798aab685e23af818f95804e4-head:
    image

  • Verication in 2.8.6-alpha3 :

... failing with old version of fleet:103.1.5+up0.9.5
image (1)

... and later succeeding when upgraded to fleet:103.1.6+up0.9.6-rc.3:
image (2)

@github-project-automation github-project-automation bot moved this from Needs QA review to ✅ Done in Fleet Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants