Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition between service account availability and webhook invocation #236

Merged
merged 9 commits into from
Sep 12, 2024

Conversation

roehrijn
Copy link
Contributor

@roehrijn roehrijn commented Aug 13, 2024

In #174 a race condition is mentioned, where the webhook tries to mutate a Pod having a service account reference which is not yet populated to the ServiceAccountCache. This results in the pod being not mutated.

In our scenario this is a matter of a few milliseconds before the cache is properly populated.

As an alternative to #178, this PR tries to fix this by waiting for a certain amount of time for proper ServiceAccountCache population and being notified upon population. Additionally, to wait only for SAs which aren't in the cache at all and not just missing a RoleARN, I had to restructure the return values of the ServiceAccountCache methods from multiple return values to structured data in order to transport the required information without adding additional two return values.

Jan Roehrich [email protected], Mercedes-Benz Tech Innovation GmbH, legal info/Impressum

@roehrijn roehrijn requested a review from a team as a code owner August 13, 2024 19:56
@roehrijn roehrijn force-pushed the roehrijn/fix-with-notification branch 2 times, most recently from 91754ec to 79fcca1 Compare August 13, 2024 20:40
@roehrijn roehrijn force-pushed the roehrijn/fix-with-notification branch from 79fcca1 to ea82b2f Compare August 13, 2024 20:45
@roehrijn roehrijn changed the title [WIP] Fix race condition between service account availability and webhook invocation Fix race condition between service account availability and webhook invocation Aug 14, 2024
Copy link
Contributor

@modulitos modulitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor questions/suggestions, but the overall approach looks good. I'm happy to work with you on getting this merged 😁

Thank you for the contribution!

pkg/handler/handler_test.go Show resolved Hide resolved
pkg/cache/cache_test.go Outdated Show resolved Hide resolved
pkg/cache/cache.go Outdated Show resolved Hide resolved
pkg/cache/cache.go Outdated Show resolved Hide resolved
pkg/cache/cache.go Outdated Show resolved Hide resolved
pkg/cache/cache.go Outdated Show resolved Hide resolved
pkg/cache/fake.go Outdated Show resolved Hide resolved
pkg/handler/handler.go Outdated Show resolved Hide resolved
pkg/cache/cache.go Show resolved Hide resolved
pkg/cache/cache.go Outdated Show resolved Hide resolved
@roehrijn
Copy link
Contributor Author

I have some minor questions/suggestions, but the overall approach looks good. I'm happy to work with you on getting this merged 😁

Thank you for the contribution!

Hi @modulitos. This is good news. I appreciate seeing this eventually to be merged. In my eyes, every single comment of yours seems to be very valid and I'm going to resolve them step by step. I'm also going to add a test which tests exactly the fixed behavior. Currently I actually don't know how long it will take but I keep you posted.

@roehrijn roehrijn force-pushed the roehrijn/fix-with-notification branch from e3fe4ce to dccceff Compare August 21, 2024 13:53
@roehrijn
Copy link
Contributor Author

@modulitos, I have also added a test which tests the asynchronous cache notification behavior. But now waiting for your thoughts about the open topics from above to not waste time working in the wrong direction.

pkg/cache/cache.go Show resolved Hide resolved
pkg/cache/cache_test.go Outdated Show resolved Hide resolved
pkg/cache/cache.go Outdated Show resolved Hide resolved
pkg/cache/cache.go Outdated Show resolved Hide resolved
@roehrijn roehrijn force-pushed the roehrijn/fix-with-notification branch from e20c44f to 18f345b Compare August 29, 2024 06:40
main.go Outdated Show resolved Hide resolved
@modulitos
Copy link
Contributor

modulitos commented Sep 10, 2024

Just adding a note here, that an alternative solution for handling cache misses is to fetch and populate the service account from the kubernetes api server. Here's a quick list of tradeoffs from these two approaches:

  1. Waiting for cache to populate via a lookup grace period (implemented in this PR):
    pros:

    • grace period gives customer control over the performance tradeoff

    cons:

    • adds complexity to the webhook's cache implementation
  2. Fetch and populate cache misses from the api server
    pros:

    • simpler implementation

    cons:

    • Adding an rpc for a cache miss might have perf/reliability risks (are we going to make too many calls? Is GET blocked by some webhooks?)

I'm in favor of proceeding with the lookup grace period in this PR, but open to feedback.

Copy link
Contributor

@modulitos modulitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thank you for the contribution. Fixes a bug impacting multiple users, without adding overhead to the API server 🎉

@modulitos modulitos merged commit ebac929 into aws:master Sep 12, 2024
1 check passed
@roehrijn roehrijn deleted the roehrijn/fix-with-notification branch September 13, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants