-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] KEP-3926: show how we address client cache inconsistency issue #4949
base: master
Are you sure you want to change the base?
Conversation
tkashem
commented
Nov 5, 2024
- One-line PR description:
- Issue link:
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tkashem The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
7bb5a5a
to
d1f216d
Compare
A client backed up an informer already has the object in its cache, since the | ||
client never receives a `watch.DELETED` event the object remains in the lsiter | ||
cache. This creates an inconsistency - retrieving the object from the cache | ||
yields the object, but if we get it from the storage we see a `corrupt object` | ||
error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the proposed solution, this means that all existing clients will remain in a corrupted state indefinitely without the ability to recover. To have a safe rollout for our own clients (not even considering external clients yet), this means we need to promote handling for this special error to locked-to-true status until every supported kubelet (n-3) has the value locked-to-true before we can start enabling the server-side capability. Otherwise we can end up with unrecoverable corruption, correct?
3c01a2e
to
1064dac
Compare
A client backed up by an informer already has the object in its cache, since the | ||
client never receives a `watch.DELETED` event the object remains in the lsiter | ||
cache. This creates an inconsistency - retrieving the object from the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I can parse the sentence. You are saying the cache will be stuck?
the old cache be replaced. | ||
|
||
- Pros: the existing clients will work without any code change | ||
- Cons: relisting is expensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is acceptable for the situation of a single object being unsafely deleted. Wondering, for the usecase of a lost encryption key, aren't we confronted with potentially many objects? Each will cause a relist 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the whole discussion about downsides of the approaches, wondering about the alternative of doing a manual delete directly through etcd. Where are we with that today? How does the apiserver and clients (reflector) behave in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe manually deleting the etcd key directly will have the same effect, etcd will relay the delete event to any client watching for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words: we don't regress even if we need a client change? That is crucial IMO and should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rereading your option description: A does not need that client change. And your implementation PR kubernetes/kubernetes#127513 has no client-go changes (beyond the added test).
1064dac
to
6f31437
Compare
yields the object, but if we get it from the storage we see a `corrupt object` | ||
error. | ||
|
||
There are a few factors we need to consider to understan how the client is impacted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few factors we need to consider to understan how the client is impacted: | |
There are a few factors we need to consider to understand how the client is impacted: |
6f31437
to
397bba3
Compare
397bba3
to
2be03c0
Compare
@tkashem: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |