-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2008: Graduate "Forensic Container Checkpointing" to Beta #4288
Conversation
Hi @adrianreber. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
74301a9
to
9c1211a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a. should we move beyond forensics use case(s) before moving to beta? b. should we add scenarios for managing security/encryption (contents include live memory in container..) c. how do we manage the checkpoints (rm/gc) d. do we enable use cases that require restore first?
/ok-to-test |
So @mrunalp and I discussed this. I think that the scope of this KEP will drastically change if we try to support restore as part of this KEP. There is a lot of interest in restore but I think there needs to be a different design for that case. I know there were security implications and it warrants a discussion. Could we consider promoting this KEP to beta (with gc questions answered)? And draft a new KEP to cover checkpoint/restore in more details? For promotion to beta, I think we should answer the question how do we gc checkpoints and how is storage monitored for it? One suggestion we have is to maybe consider an operator for managing checkpoints rather than putting it in upstream kubelet. |
@adrianreber Where are we on garbage collecting old container checkpoints? |
I understood it that it should be done in an operator. That sounded to me like it might be independent of this PR. |
So even if its done in an operator, would we require mention of the operator? My main concern is that kubelet is not going to monitor checkpoints and if they fill up the disk then we effectively take down the node. I don't think we want gc in the main repo (as this is an optional feature) but we may want to have some kind of suggestion or operator that people could use? We could document this as a known issue and suggest ways to mitigate it if your disk fills up due to checkpoints. |
9c1211a
to
0ff3da6
Compare
@kannon92 I added a short paragraph concerning garbage collection and a possible operator. Something like that? |
@adrianreber the other thing that is crucial for beta is to fill out the PRR questions in this KEP. I think you are missing quite a few since this feature was created. I think everyone one of this questions needs a response to push to beta. |
I created an operator which replicates the functionality I tried to bring into the kubelet: https://github.com/checkpoint-restore/checkpoint-restore-operator At this point it is really simple. It has a parameter to define the maximum number of checkpoints for a container and if more checkpoints are created older checkpoints are deleted.
At first I liked the idea, but unconditionally deleting all checkpoint archives seems a bit harsh and maybe unexpected. Which would mean we need some logic to decide which checkpoint archive to keep. Not sure we want that. If people agree that this would be a good idea I am happy to implement it. At this point I am not 100% convinced it should be done.
Thanks for pointing that out. I will look into that. |
Several PRR sections are missing: https://github.com/kubernetes/enhancements/blame/master/keps/NNNN-kep-template/README.md#L454 Please copy/paste the new template and fill out all the alpha and beta questions. |
0ff3da6
to
7239c99
Compare
0b30ea5
to
447f338
Compare
@kannon92 Again, thanks a lot for your suggestions. Makes it easy for me to finally understand how this needs to work. It also makes updating the text in the PR easy. Thanks. |
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? | ||
|
||
Does not apply as the enhancement will only be called when requested. Not a service. | ||
|
||
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
|
||
Does not apply as the enhancement will only be called when requested. Not a service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two need updates to indicate that the kubelet will add (or does have) metrics for the kubelet endpoints indicating usage counts and failure counts. Prior to going to beta, the exact metric names must be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded as suggested
As defined in the existing KEP the steps to graduate from Alpha to Beta are At least one container engine has to have implemented the corresponding CRI APIs to introduce e2e test for checkpointing. - [ ] Enable the feature per default - [ ] No major bugs reported in the previous cycle CRI-O implemented the corresponding CRI RPC and no major bugs have been reported since the initial release in 1.25. Signed-off-by: Adrian Reber <[email protected]>
447f338
to
67cf1ec
Compare
@deads2k Reworked based on your suggestions. |
PRR is complete for beta. /approve |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianreber, deads2k, dims The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM.. would like to talk about documenting how an admin can direct these checkpoints to a secure/encrypted mount the keys for which can be managed by another party. Will be watching for the changes and help get it into containerd cheers! |
@mikebrow we are currently working on enabling built-in support for encryption in CRIU (checkpoint-restore/criu#2297). In the current implementation, we use a certificate with a public key to encrypt the content of the checkpoint. |
I want to thank everyone involved for all the feedback and quick turnaround to get this merged in time. Thanks! |
As defined in the existing KEP the steps to graduate from Alpha to Beta are
At least one container engine has to have implemented the
corresponding CRI APIs to introduce e2e test for checkpointing.
CRI-O implemented the corresponding CRI RPC and no major bugs have been reported since the initial release in 1.25.
One-line PR description: Graduate "Forensic Container Checkpointing" to Beta
Issue link: Forensic Container Checkpointing #2008