-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kubernetes secret provider] Add cache for the secrets #3822
[Kubernetes secret provider] Add cache for the secrets #3822
Conversation
This pull request does not have a backport label. Could you fix it @constanca-m? 🙏
NOTE: |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
Reminder that we need also documentation update for this change in https://www.elastic.co/guide/en/fleet/current/kubernetes_secrets-provider.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest issue I see in this PR is how does a key get removed from the cache if it is no longer referenced from the policy? Let say the policy references kubernetes_secrets.somenamespace.somesecret.value1
and then changes to kubernetes_secrets.somenamespace.somesecret.value2
. This change will keep caching kubernetes_secrets.somenamespace.somesecret.value1
forever.
To solve the problem you should add a last accessed time per cached key, and during each update cycle and based on an interval you should remove keys that have not been accessed. That will solve the issue I described above.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
@blakerouse thanks for this . Makes sense. |
And should we give that option also to the user or should we enforce that time ourselves @gizas ? Edit: I enforced it as 1h. Currently, user cannot change it. |
- Update duration config format
- Update duration config format
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
This is also missing unit tests coverage for any of these additions, that needs to be added as well. |
- Add ttl for both update and delete
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
Yes I am still waiting on my comment about holding the lock the entire time the code is refreshing the cache. I think that is still a big issue that needs to be looked at. |
I don't have another explanation on that other than this comment. Is that a blocker? @blakerouse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look, there is still at least one bug and a few cases I could spot where splitting the cache changes across separate "read lock" and "write lock" sections introduces concurrency problems.
This is because the manipulation you are doing to the cache is no longer atomic across a single function body allowing the cache to change in the middle of the function execution when you unlock the read lock and acquire the write lock.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thought, it seems like it would be straightforward to add a configuration entry that disables the cache.
That way if there is some unexpected surprise we can revert to the previous behavior without needing a release.
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
- Remove goroutine - Refactor timeout name to requestTimeout
I added this option in my last commit. I also added the proper unit test for it. |
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Outdated
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets.go
Show resolved
Hide resolved
|
||
// if value is still not present in cache, it is possible we haven't tried to fetch it yet | ||
if !ok { | ||
value, ok := p.addToCache(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why grab the lock inside of the function to set the new value. Then only 4 lines below grab it again to set the lastAccess time?
Why not just set lastAccess
inside of addToCache
and then return inside of the if statement? That would prevent the need to grab the lock twice for the same key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just set lastAccess inside of addToCache and then return inside of the if statement?
Because then we would be changing lastAccess
in two functions, when we don't need it. The logic is that lastAccess
is only changed when it is required (by calling the get from cache).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me its worse to grab a lock, release the lock, grab the lock again, then release it again all in the same path. It should grab the lock do what it needs to do and release it.
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the changes and fixes. I am glad a solution was able to be created for not holding the lock the entire time a refresh loop occurs.
My last point of hold/releasing followed by another hold/release is not a blocker, just seems like it would be less CPU cycles to grab the lock once. I understand your trying to be my DRY with the code, but for something as simple as setting a single attribute timestamp I think less lock contention would be worth it.
What does this PR do?
Currently, we make a request to the API Server every time we need to get the value of a secret. This issue is better explained here. Discussion also present in the issue.
With this PR, we use a map to store the secrets like this:
The secrets are accessed from the outside through the function Fetch. If we try to retrieve a secret that is not present in our map, then we make a request to the API Server to obtain its value and store it in the map. This function never causes an update on the secret values.
The only difference from this approach to the current one is that we know store the values of the secrets and secret values.
Warning: Sometimes the secret values are not updated in time and we can be using incorrect ones until the update happens. This is also happening now.
Why is it important?
To stop overwhelming the API Server with secret requests.
Checklist
./changelog/fragments
using the changelog toolHow to test this PR locally
Related issues
Screenshots
There should not be a change in behavior apart from a decrease in calls to the API Server. Everything else works as expected/before: