Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm release unable to rollback manual changes after enabling drift detection #742

Open
sojinss4u opened this issue Jul 22, 2023 · 8 comments

Comments

@sojinss4u
Copy link

sojinss4u commented Jul 22, 2023

Hello,

I am currently using flux version 'v0.41.2' (Helm Controller Version: v0.31.2). I have enabled '--feature-gates=DetectDrift=true' in Helm controller so that the manual changes in helm deployments can be automatically rolled back. After enabling this feature I have seen the following error in our Kyverno helm release.

$ flux get hr | grep -i kyverno                                                               

kyverno                         2.5.5+2dbac08dfae2.2    False           False   failed to diff release against cluster resources: Deployment/ky
verno/kyverno-kyverno dry-run failed, reason: Invalid, error: Deployment.apps "kyverno-kyverno" is invalid: spec.template.spec.containers[0].re
sources.requests: Invalid value: "500Mi": must be less than or equal to memory limit
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: kyverno
  namespace: flux-system
spec:
  serviceAccountName: helm-controller
  storageNamespace: kyverno
  targetNamespace: kyverno
  interval: 3m
  install:
    createNamespace: false
  chart:
    spec:
      reconcileStrategy: Revision
      chart: ./helm/kyverno/v1.7.5
      sourceRef:
        kind: GitRepository
        name: kyverno-infra-kube-manifest
        namespace: flux-system
      interval: 1m
      valuesFiles:
        - ./envs/<name>/stage/stage-white/helm/kyverno/values-v1.7.5.yaml

In the HelmChart the value for kyverno deployment memory limit is 200Mi & request is 100Mi. I have manually edited the deployment inside the cluster & increased request to 500Mi & limit to 1000Mi. After the changes helm is unable to automatically revert the changes & showing above error. I believe this error is coming as the actual memory request value present in the cluster (500Mi) > the memory limit present in the HelmChart. If I make the request less than 200Mi in the cluster, the error disappears & rollback automatically happen. However this is a problem as now these manual changes can't be reverted by Helm drift detection. Is there any solution to this problem?
cc: @hiddeco

@somtochiama
Copy link
Member

I am trying to reproduce this @sojinss4u .
I have installed the kyverno helm chart with the following manifest

kind: HelmRelease
metadata:
  name: kyverno
  namespace: flux-system
spec:
  targetNamespace: kyverno
  releaseName: kyverno
  interval: 30m
  install:
    createNamespace: true
  chart:
    spec:
      chart: kyverno
      version: 2.6.0
      sourceRef:
        kind: HelmRepository
        name: kyverno
        namespace: flux-system
      interval: 30m

Then edited the kyverno deployment in cluster to use limits: 1000Mi and request 500Mi

Deployment/kyverno/kyverno diff:
.metadata.generation
-2
+3

.spec.template.spec.containers[0].resources.limits.memory
-1000Mi
+384Mi

.spec.template.spec.containers[0].resources.requests.memory
-500Mi
+128Mi

Is there any steps I might be missing?
Pretty weird that the value on the cluster is causing an error during dry run since the manifest used for the dry run is gotten from the prev release

@sojinss4u
Copy link
Author

Steps are correct. Basically this error comes when manually set limit on the cluster is > last release limit.

@somtochiama
Copy link
Member

@sojinss4u I am unable to replicate your error then. The helm controller reverts the changes for me.

What memory request/limits do you see when you run helm get manifest <release-name> --revision <last-revision> --namespace flux-system. You can get the last revision by looking at .status.lastReleaseRevision in the helm release resource

@rsafonseca
Copy link

rsafonseca commented Nov 9, 2023

I've ran into a similar issue, it seems that this happens when not using the replacement strategy:

With upgrade.force = true
changes are reverted as expected

With upgrade.force = false
Drift detection enters a loop where on every iteration of spec.interval, it detects the changes, issues a patch, the patch apparently fails silently but the upgrade is logged as successful and no changes are actually applied.

I've ran into another caveat with upgrade.force=true, i'll open a separate issue

@rsafonseca
Copy link

FYI, I only have metadata being logged on the API servers and not the full payload (that would cause an immense toll to logs on our clusters) but i can see that the patch command is being issued and returns a 200, but the spec of my object isn't really changed, so most likely the issue is with the patch payload itself. If i find any further info i'll drop it here later :)

@rsafonseca
Copy link

I've build a custom version of the latest helm-controller, with a custom version of helm 3.12.3 which outputs the patches it's going to send to the API server, here's the relevant bits:

helm-controller detects drift on .spec.pod.resources.requests.cpu, triggers an upgrade
{"level":"debug","ts":"2023-11-09T13:55:38.012Z","msg":"EtcdCluster/etcd/etcd-cluster diff:\n.metadata.generation\n-115\n+116\n\n.spec.pod.resources.requests.cpu\n-1801m\n+1800m","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}

helm upgrade sends this patch to the API server:

{"level":"debug","ts":"2023-11-09T13:55:38.461Z","msg":"Patch EtcdCluster \"etcd-cluster\" in namespace etcd","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}
{"level":"debug","ts":"2023-11-09T13:55:38.461Z","msg":"Patch: {\"metadata\":{\"annotations\":{\"meta.helm.sh/release-name\":\"pitaya-cluster\",\"meta.helm.sh/release-namespace\":\"etcd\"}}} ","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"pitaya-cluster","namespace":"etcd"},"namespace":"etcd","name":"pitaya-cluster","reconcileID":"3ebe1de3-580f-4a12-8b04-46b99ba35b36"}

@rsafonseca
Copy link

Just to add, from my testing this seems to be affecting the patching of CRDs but not of native resources like Deployment. Seems like a helm bug to me..

@rsafonseca
Copy link

Upon closer analysis, seems like this is unrelated to the original issue, I've opened #805 to address this problem specifically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants