Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support kubernetes workload identity #4

Closed
wants to merge 11 commits into from

Conversation

testower
Copy link

@testower testower commented Mar 24, 2023

This solves issue #3

Summary

What this PR aims to achieve is to make it possible to use the library in a GKE container that uses workload identity, instead of a credentials json file.

Here is the outline of the solution:

  1. Add a configurable option to use workload identity (detect presence of config)
  2. If enabled, use the following strategy to get a bearer token:
    a. Perform an HTTP GET request to http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token with the header "Metadata-Flavor: Google"
    b. Retrieve token and expiry from the response : { "access_token": "...", "expires_in":3090,"token_type":"Bearer" }
    c. Update the oauth dict

The cleanest way to achieve this is probably to create a new module called "workload_identity_client", along with a new configuration table for it. The client can have the same "get_oauth_token", perhaps renamed to just "get_token", so either client can be passed in to request constructor from producer.

@testower testower marked this pull request as draft March 24, 2023 12:27
@ankneo ankneo requested a review from Vasu7052 March 24, 2023 12:38
@testower testower force-pushed the oauth-token-from-path branch 4 times, most recently from 1be21da to 2eae9c2 Compare March 25, 2023 12:10
@testower testower force-pushed the oauth-token-from-path branch from 2eae9c2 to 21baa11 Compare March 25, 2023 12:12
Caveat: Unable to use callbacks to assert success because of race condition
@testower testower marked this pull request as ready for review March 25, 2023 22:03
@testower
Copy link
Author

Everything seems to be working as expected locally - I still need to test it in GKE, I will do that on Monday.

@testower testower force-pushed the oauth-token-from-path branch from 51aeb54 to 7a2efaf Compare March 27, 2023 09:46
@testower
Copy link
Author

Today I have tested this live in our dev environment and everything seems to be working fine, except one little caveat. The default token_url does not work because nginx is not able to resolve metadata.google.internal.

I have tried the following solutions:

  • enable the local resolver (local=on)
  • add the VM's internal dns to the resolver (169.254.169.254)

Both of these works some, or even most, of the time, but intermittently fail with "metadata.google.internal could not be resolved (3: Host not found)".

However, using the internal name server address instead of metadata.google.internal in the toke_url, seems to always work.

@testower testower force-pushed the oauth-token-from-path branch from fee68c0 to 7a2efaf Compare March 27, 2023 10:15
@testower
Copy link
Author

Another issue I am seeing now is quite a few of these in my logs:

producer.lua:149: _timer_flush(): failed to create timer at _timer_flush, err: too many pending timers

I have not seen them before, so therefore suspecting it's something with my code. Trying to investigate.

@testower
Copy link
Author

Update: The issue with pending timers were resolved when I reverted to the default settings for the producer config. My hypothesis is that a combination of a short timer interval (1000ms) together with my updates (token retrieval?) resulted in a build up of timers. I'm just wondering, is this the same issue as this on the kafka project: doujiang24/lua-resty-kafka#22 ? Meanwhile, I have to fine-tune the parameters, because the defaults gave me issues with buffer overflow in production.

@testower
Copy link
Author

Update: Seems like the fix of reverting to default settings only postponed the problem. It now runs for about 20-30 minutes without issues before starting to issue the same warnings (too many pending timers).

@testower
Copy link
Author

Update: Issue resolved, it was a silly mistake during testing where I used dofile instead of require on the producer, which resultet in an increasing number of producer instances with runaway timers.

This PR is now ready for review 🎉

@testower testower changed the title Support getting token from path Support kubernetes workload identity Mar 29, 2023
@testower
Copy link
Author

@Vasu7052 Is this something you are interested in as a feature?

@testower testower closed this Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants