-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot checkpoint container: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1 #2397
Comments
Checkpointing Kubernetes containers with Nvidia GPUs is not working as far as we know. We have seen success with AMD GPUs. |
|
I just want to keep the environment inside the container, mainly files |
Then checkpointing is the wrong approach. |
What I mean is, I want to preserve the environment inside the container. After checkpointing the export, it is then built as an image. This method is very fast for building a runtime image |
Sorry, I do not understand what you want to do. First you said you want to checkpoint the container then you said you want to just keep the environment inside of the container. Anyway, checkpointing containers with Nvidia GPUs does not work. You need to talk to Nvidia to enable it. |
thanks |
Hi! I want to put my 2 cents here. Nvidia recently uploaded their utility (in binary only) to github called cuda-checkpoint, which provides a method for checkpointing applications when they do not have any kernel running. Actually, this method utilizes some new capabilities of the nvidia driver (550). After application's data storing in GPU has been copied to the host memory, the application can be safely dumped with criu. The restore process looks the same as in common case but after application has been restored, it need to be toggled with cuda-checkpoint. However, to be able to use this in docker in the future, it seems that some more work has to be done. Currently, the error when
|
Some more thoughts on that.. Another case, that can be potentially interesting, is that docker container is running application which generates periodic gpu load forking new process each time and using some process(es) containing temporary data in host memory. process A (stores temporary data in host memory, does not have CUDA calls) In this case snapshotting docker container would be useful to preserve state of the process A, but now it is not possible. |
@alexfrolov thanks for your thoughts. Today we already can checkpoint and restore amd GPU containers with Podman. So we know it is doable, but from my point of view nvidia needs to do the work to fully make work. Just like amd came along and implemented it. We are also following closely what Nvidia does with their checkpoint tool. It is extremely limited at this point but it looks promising for the future. The actual error about the mount point looks fixable by correctly specifying all mountpoints in config.json from runc. |
@adrianreber Nvidia chose another way to implement C/R, and there's nothing wrong with that. I looked at the https://github.com/NVIDIA/cuda-checkpoint tool, and I think we need to implement support for it in CRIU. The only thing we need to do is run this tool for all processes that use CUDA and nvml (it isn't supported yet, but they are working on that). It has to be done before the dump and after the restore. Even without the support of this tool in CRIU, users can checkpoint/restore CUDA workloads but they will need to run this tool for CUDA processes. |
cc: @sgurfinkel |
Adding NVIDIA GPU support to CRIU is a fundamental goal of the cuda-checkpoint project. |
@sgurfinkel sounds great, thanks. |
A friendly reminder that this issue had no activity for 30 days. |
Description
k8s 1.28
containerd 2.0
I want curl k8s checkpoint to create a container checkpoint
Steps to reproduce the issue:
Describe the results you received:
I want curl k8s checkpoint to create a container checkpoint
Describe the results you expected:
The actual situation is that an error occurs, showing: checkpointing of default/gpu-base-02/gpu-base-02 failed (rpc error: code = Unknown desc = checkpointing container "208a82339ddc590e460b89912304f56ad64924f89a959f982b17aeb6ab0c2aa8" failed: /usr/bin/nvidia-contain er-runtime did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/k8s.io/208a82339ddc590e460b89912304f56ad64924f89a959f982b17aeb6ab0c2aa8/criu-dump. log: unknown)
Additional information you deem important (e.g. issue happens only occasionally):
CRIU logs and information:
CRIU full dump/restore logs:
(00.011105) mnt: Inspecting sharing on 1494 shared_id 0 master_id 0 (@./proc/sys)
(00.011109) mnt: Inspecting sharing on 1493 shared_id 0 master_id 0 (@./proc/irq)
(00.011113) mnt: Inspecting sharing on 1492 shared_id 0 master_id 0 (@./proc/fs)
(00.011116) mnt: Inspecting sharing on 1491 shared_id 0 master_id 0 (@./proc/bus)
(00.011120) mnt: Inspecting sharing on 1611 shared_id 0 master_id 13 (@./proc/driver/nvidia/gpus/0000:b1:00.0)
(00.011124) Error (criu/mount.c:1088): mnt: Mount 1611 ./proc/driver/nvidia/gpus/0000:b1:00.0 (master_id: 13 shared_id: 0) has unreachable sharing. Try --enable-external-masters.
(00.011142) net: Unlock network
(00.011146) Running network-unlock scripts
(00.011149) RPC
(00.072541) Unfreezing tasks into 1
(00.072552) Unseizing 1641382 into 1
(00.072562) Unseizing 1641424 into 1
(00.072568) Unseizing 1641533 into 1
(00.072580) Unseizing 1641475 into 1
(00.072586) Unseizing 1641500 into 1
(00.072599) Unseizing 2157578 into 1
(00.072632) Error (criu/cr-dump.c:2093): Dumping FAILED.
Output of `criu --version`:
Version: 3.18
Output of `criu check --all`:
Looks good.
Additional environment details:
The text was updated successfully, but these errors were encountered: