Move webhook registration behind feature gate flag #5099

bryan-cox · 2024-08-28T15:24:14Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Move webhook registration behind feature gate flags similar to controller registration.

Without this PR, from a self-managed / externally managed infrastructure perspective, if you want to exclude the CRDs behind the MachinePool and ASOAPI feature flags, you'll get an error because the webhook for them is still registered.

E0828 10:05:27.972237       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

cherry-pick candidate

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

Moves webhook registration behind feature gate flags like controller registration already does.

k8s-ci-robot · 2024-08-28T15:24:24Z

Hi @bryan-cox. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

main.go

muraee · 2024-08-28T15:31:59Z

/ok-to-test

nojnhuh · 2024-08-29T15:13:54Z

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

bryan-cox · 2024-08-29T15:58:56Z

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

We aren't using AzureMachinePool. Yeah, we are seeing more than just the log message; the CAPZ pod restarts constantly. Here are some additional logs before the pod restarts:

E0829 15:50:31.089094       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
I0829 15:50:38.588560       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-8z465" namespace="clusters-generic-hc" name="generic-hc-9npwz-8z465" reconcileID="743788a0-e979-4c1e-9ca4-0c854d575fc0" x-ms-correlation-request-id="0951596f-73b2-4a57-801b-40faca63ef50"
I0829 15:50:38.809896       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-7p4fb" namespace="clusters-generic-hc" name="generic-hc-9npwz-7p4fb" reconcileID="f9e21048-ea5d-44bf-9c2d-195d7ad86e74" x-ms-correlation-request-id="1e6492aa-fe1d-413c-9cac-292107e030f7"
E0829 15:50:41.091628       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
E0829 15:50:41.235638       1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.235695       1 internal.go:516] "Stopping and waiting for non leader election runnables"
I0829 15:50:41.235829       1 internal.go:520] "Stopping and waiting for leader election runnables"
I0829 15:50:41.235949       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236026       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236232       1 controller.go:242] "All workers finished" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236158       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236386       1 controller.go:242] "All workers finished" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236177       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236823       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237036       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237121       1 internal.go:528] "Stopping and waiting for caches"
I0829 15:50:41.237583       1 internal.go:532] "Stopping and waiting for webhooks"
I0829 15:50:41.237981       1 server.go:249] "Shutting down webhook server with timeout of 1 minute" logger="controller-runtime.webhook"
I0829 15:50:41.238191       1 internal.go:535] "Stopping and waiting for HTTP servers"
I0829 15:50:41.238323       1 server.go:231] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0829 15:50:41.238458       1 server.go:43] "shutting down server" kind="health probe" addr="[::]:9440"
I0829 15:50:41.238568       1 internal.go:539] "Wait completed, proceeding to shutdown the manager"
E0829 15:50:41.238677       1 main.go:353] "problem running manager" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" logger="setup"

We have the MachinePool feature turned off in our pod deployment:

      containers:
      - args:
        - --namespace=$(MY_NAMESPACE)
        - --leader-elect=true
        - --feature-gates=MachinePool=false
...
        name: manager

bryan-cox · 2024-08-29T16:00:04Z

FWIW the machines do get provisioned and join our cluster. The CAPZ pod just consistently restarts.

bryan-cox · 2024-08-29T16:25:43Z

CAPA seems to follow this same pattern as well - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/4507c0bc7371dd44e7b7b719c393f86452be60dd/main.go#L323.

mboersma

/lgtm
/assign @nojnhuh

This is causing pod restarts and the fix follows an approach similar to CAPA's, so I think it's reasonable.

k8s-ci-robot · 2024-09-26T16:21:17Z

LGTM label has been added.

Git tree hash: 11c709afba588489dedad6cf904e145c3451676a

codecov · 2024-09-27T17:39:29Z

Codecov Report

Attention: Patch coverage is 0% with 50 lines in your changes missing coverage. Please review.

Project coverage is 52.98%. Comparing base (dbc4d54) to head (32f73b1).
Report is 43 commits behind head on main.

Files with missing lines	Patch %	Lines
main.go	0.00%	50 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5099   +/-   ##
=======================================
  Coverage   52.98%   52.98%           
=======================================
  Files         273      273           
  Lines       29197    29145   -52     
=======================================
- Hits        15469    15443   -26     
+ Misses      12926    12900   -26     
  Partials      802      802

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nojnhuh · 2024-09-27T18:42:34Z

If they're no longer reachable, can we remove the checks in the webhooks that the corresponding feature gates for resources are enabled? e.g.

cluster-api-provider-azure/exp/api/v1beta1/azuremachinepool_webhook.go

Lines 75 to 82 in 83f3f66

    
           // NOTE: AzureMachinePool is behind MachinePool feature gate flag; the webhook 
        
           // must prevent creating new objects in case the feature flag is disabled. 
        
           if !feature.Gates.Enabled(capifeature.MachinePool) { 
        
           	return nil, field.Forbidden( 
        
           		field.NewPath("spec"), 
        
           		"can be set only if the MachinePool feature flag is enabled", 
        
           	) 
        
           }

bryan-cox · 2024-10-23T11:55:19Z

If they're no longer reachable, can we remove the checks in the webhooks that the corresponding feature gates for resources are enabled? e.g.

cluster-api-provider-azure/exp/api/v1beta1/azuremachinepool_webhook.go

Lines 75 to 82 in 83f3f66

// NOTE: AzureMachinePool is behind MachinePool feature gate flag; the webhook

// must prevent creating new objects in case the feature flag is disabled.

if !feature.Gates.Enabled(capifeature.MachinePool) {

return nil, field.Forbidden(

field.NewPath("spec"),

"can be set only if the MachinePool feature flag is enabled",

)

}

@nojnhuh - I don't understand. Could you clarify a bit more? Wouldn't those checks still be valid since those are behind a featuregate flag?

nojnhuh · 2024-10-23T15:36:53Z

If they're no longer reachable, can we remove the checks in the webhooks that the corresponding feature gates for resources are enabled? e.g.

cluster-api-provider-azure/exp/api/v1beta1/azuremachinepool_webhook.go

Lines 75 to 82 in 83f3f66

// NOTE: AzureMachinePool is behind MachinePool feature gate flag; the webhook

// must prevent creating new objects in case the feature flag is disabled.

if !feature.Gates.Enabled(capifeature.MachinePool) {

return nil, field.Forbidden(

field.NewPath("spec"),

"can be set only if the MachinePool feature flag is enabled",

)

}

@nojnhuh - I don't understand. Could you clarify a bit more? Wouldn't those checks still be valid since those are behind a featuregate flag?

If we only start the webhooks in main.go when the feature gate is enabled, is it possible for the feature gate to be disabled when we're inside the webhook? The feature gates can't be toggled at runtime AFAIK.

k8s-ci-robot · 2024-10-23T18:26:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from mboersma. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Move webhook registration behind feature gate flags similar to controller registration. Signed-off-by: Bryan Cox <[email protected]>

bryan-cox · 2024-11-01T12:52:38Z

/test pull-cluster-api-provider-azure-e2e-aks

mboersma

/lgtm

k8s-ci-robot · 2024-11-01T19:36:27Z

LGTM label has been added.

Git tree hash: 802baeabc173b13515eb73bc8556cf322c7e37db

bryan-cox · 2024-11-06T12:22:27Z

Hey @nojnhuh 👋🏻 - if you're good with the changes, can I get a /approve on this PR please?

nojnhuh · 2024-11-07T16:48:27Z

Sorry, still working my way back to this.

/assign

nojnhuh · 2024-11-12T22:33:08Z

Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

^ (and when the AzureMachinePool CRD is installed, as it would be for default CAPZ installations?)

@bryan-cox Can you confirm that this is the case?

bryan-cox · 2024-11-14T17:46:43Z

Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

^ (and when the AzureMachinePool CRD is installed, as it would be for default CAPZ installations?)

@bryan-cox Can you confirm that this is the case?

@nojnhuh I can try and take a look. The project I'm on doesn't use that CRD in our self-managed setup.

bryan-cox · 2024-11-26T11:40:50Z

Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

^ (and when the AzureMachinePool CRD is installed, as it would be for default CAPZ installations?)
@bryan-cox Can you confirm that this is the case?

@nojnhuh I can try and take a look. The project I'm on doesn't use that CRD in our self-managed setup.

@nojnhuh - Sorry for the delay, but I tested this yesterday. If I install the AzureMachinePool CRD, there are no errors in the CAPZ pod, machines are provisioned, and join a nodepool in my self-managed setup.

However, the issue is still there if the AzureMachinePool CRD isn't installed where the pod restarts continually.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 28, 2024

k8s-ci-robot requested review from jackfrancis and Jont828 August 28, 2024 15:24

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 28, 2024

bryan-cox commented Aug 28, 2024

View reviewed changes

main.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2024

bryan-cox mentioned this pull request Aug 28, 2024

HOSTEDCP-1921: Remove unused CAPZ CRDs from HyperShift install openshift/hypershift#4618

Closed

2 tasks

mboersma approved these changes Sep 26, 2024

View reviewed changes

k8s-ci-robot assigned nojnhuh and mboersma Sep 26, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2024

bryan-cox force-pushed the fix-webhook-registration branch from 83f3f66 to 2f2e523 Compare October 23, 2024 18:26

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2024

k8s-ci-robot requested review from mboersma and nojnhuh October 23, 2024 18:26

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2024

bryan-cox force-pushed the fix-webhook-registration branch from 2f2e523 to 362e74d Compare October 23, 2024 19:09

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2024

bryan-cox force-pushed the fix-webhook-registration branch from 362e74d to 26db138 Compare October 24, 2024 12:48

Move webhook registration behind feature gate flag

32f73b1

Move webhook registration behind feature gate flags similar to controller registration. Signed-off-by: Bryan Cox <[email protected]>

bryan-cox force-pushed the fix-webhook-registration branch from 26db138 to 32f73b1 Compare October 24, 2024 13:45

mboersma approved these changes Nov 1, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2024

nawazkh added this to the v1.18 milestone Nov 7, 2024

bryan-cox mentioned this pull request Nov 15, 2024

REQUEST: New membership for @bryan-cox kubernetes/org#5260

Closed

11 tasks

dtzar modified the milestones: v1.18, next Nov 19, 2024

bryan-cox mentioned this pull request Nov 21, 2024

Enable granularity for which controllers to run #5294

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move webhook registration behind feature gate flag #5099

Move webhook registration behind feature gate flag #5099

bryan-cox commented Aug 28, 2024 •

edited

Loading

k8s-ci-robot commented Aug 28, 2024

muraee commented Aug 28, 2024

nojnhuh commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

mboersma left a comment

k8s-ci-robot commented Sep 26, 2024

codecov bot commented Sep 27, 2024 •

edited

Loading

nojnhuh commented Sep 27, 2024

bryan-cox commented Oct 23, 2024

nojnhuh commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

bryan-cox commented Nov 1, 2024

mboersma left a comment

k8s-ci-robot commented Nov 1, 2024

bryan-cox commented Nov 6, 2024

nojnhuh commented Nov 7, 2024

nojnhuh commented Nov 12, 2024

bryan-cox commented Nov 14, 2024

bryan-cox commented Nov 26, 2024 •

edited

Loading

Move webhook registration behind feature gate flag #5099

Are you sure you want to change the base?

Move webhook registration behind feature gate flag #5099

Conversation

bryan-cox commented Aug 28, 2024 • edited Loading

k8s-ci-robot commented Aug 28, 2024

muraee commented Aug 28, 2024

nojnhuh commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

mboersma left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 26, 2024

codecov bot commented Sep 27, 2024 • edited Loading

Codecov Report

nojnhuh commented Sep 27, 2024

bryan-cox commented Oct 23, 2024

nojnhuh commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

bryan-cox commented Nov 1, 2024

mboersma left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 1, 2024

bryan-cox commented Nov 6, 2024

nojnhuh commented Nov 7, 2024

nojnhuh commented Nov 12, 2024

bryan-cox commented Nov 14, 2024

bryan-cox commented Nov 26, 2024 • edited Loading

bryan-cox commented Aug 28, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading

bryan-cox commented Nov 26, 2024 •

edited

Loading