-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
criu/plugin: Add NVIDIA CUDA plugin #2416
Conversation
If you rebase the "CentOS Stream 8" test error should disappear as we removed it (it is EOL). A README.md with some information about what is going on and how to use it would be helpful in the plugin directory. Maybe a link to the cuda-checkpoint tool. I would also recommend to at least have two commits here. One with the changes to CRIU and a more detailed commit message why you need the changes and one commits with actual plugin. |
For the CRIU changes commit would you just like the 2 changes to cr-dump.c and cr-restore.c or should I include the new plugin hooks/calls in that one as well? |
One goal is to have each commit still compile. If your changes are not directly dependent on each other moving it into extra commits sounds correct. It really depends and is not always a yes or no decision. But if you already ask the question then it would probably make sense to split it, because that seems to indicate to me you also see it as two independent things. |
5a6a20a
to
c49326c
Compare
…and run the plugin finalizer later in the dump sequence Restore rseq_cs state before calling RESUME_DEVICES_LATE as the CUDA plugin will temporarily unfreeze a thread during the plugin hook to assist with device restore Run the plugin finalizer later in the dump sequence since the finalizer is used by the CUDA plugin to handle some process cleanup Signed-off-by: Jesus Ramos <[email protected]>
…DEVICES to be used during pstree collection PAUSE_DEVICES is called before a process is frozen and is used by the CUDA plugin to place the process in a state that's ready to be checkpointed and quiesce any pending work CHECKPOINT_DEVICES is called after all processes in the tree have been frozen and PAUSE'd and performs the actual checkpointing operation for CUDA applications Signed-off-by: Jesus Ramos <[email protected]>
5cdb90e
to
30b4df8
Compare
@jesus-ramos It would be great if we could add tests for this functionality. We currently don't have GPUs available in CI, but we could add tests to verify that the |
Thank you for contributing to the project. Overall, this PR looks good to me. |
Here is a POC patch how we can test the cuda plugin with a mocked cuda-checkpoint tool: |
4eba0bc
to
09ff693
Compare
#11 17.04 In file included from ../../compel/include/uapi/compel/infect.h:6, For the ppc and arm builds do I need to add anything to the plugin Makefile to include the proper directory for this? |
This error does not seem to be related to the changes in this pull request. It also occurs on the criu-dev branch. |
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary. Signed-off-by: Jesus Ramos <[email protected]>
@ jesus-ramos I think we can merge this pr and do other improvements in separate pr-s. Does it sound good to you? |
That works for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It breaks build in CI:
https://github.com/checkpoint-restore/criu/actions/runs/9754988212/job/26922763707?pr=2427 |
Is it an issue with the include order or something else? I can see about getting some time on an aarch64 machine to check it out. |
Likely, in ./plugins/cuda/cuda_plugin.c:
criu headers should probably go AFTER system headers, but I'm not fully sure how it causes the problem. |
Andrei opened a pull request to fix this problem: #2428 |
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary.