Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: add parallel restore of BO content to accelerate restore #2527

Open
wants to merge 8 commits into
base: criu-dev
Choose a base branch
from

Conversation

wweewrwer
Copy link

TL;DR:

This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.

The target issue:

In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in CR_PLUGIN_HOOK__RESTORE_EXT_FILE. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.

The parallel restore approach in this PR:

The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook, CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents in CR_PLUGIN_HOOK__RESTORE_EXT_FILE and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs in CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.

Tests:

We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.

Results:

From disk From page cache
Sequential restore 1728ms 254ms
Parallel restore 1596ms 167ms
Speed up 7.6% 34.3%

Settings:

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz

Memory: DDR4, 2x8GB

GPU: AMD MI50

Disk: 512GB, Samsung SSD 860

Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1

Example program:

example.py: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.

import time
import os
import sys
import torch
import torchvision.models as models
import torchvision.transforms as transforms
torch.set_grad_enabled(False)

device = "cuda:0"

model = models.resnet18(weights='DEFAULT')
model = model.to(device)
model.eval()

batch_size = 1
channels = 3
height = 224
width = 224
input_tensor = torch.randn(batch_size, channels, height, width)
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(input_tensor)

while input()!="y":
    st = time.time()
    input_tensor = input_tensor.to(device)
    output = model(input_tensor)
    output = output.to("cpu")
    _, predicted_idx = torch.max(output, 1)
    torch.cuda.synchronize()
    ed = time.time()
    print("test time:",ed-st)
    sys.stdout.flush()

Steps:

  1. Install CRIU

    Follow the standard CRIU installation process. Ensure you set the environment variable CRIU_LIBS_DIR to the plugins/amdgpu path.

  2. Dump checkpoint image

    #In one shell
    python3 example.py
    #In another shell
    mkdir -p /tmp/criu-dump
    criu dump -t $(pgrep python3) -D /tmp/criu-dump -j --file-locks
    
  3. Restore from disk

    Test for sequential restore:

    #Clear page cache
    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    

    Test for parallel restore:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    
  4. Restore from page cache

    Install vmtouch for caching images:

    sudo apt install vmtouch
    

    Test:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    #Cache image in memory
    vmtouch -l criu-dump
    #Warm up environment 
    criu restore -D /tmp/criu-dump -j --file-locks
    #Begin to Test
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    

criu/crtools.c Outdated Show resolved Hide resolved
criu/cr-restore.c Outdated Show resolved Hide resolved
criu/cr-restore.c Outdated Show resolved Hide resolved
@Ddnirvana
Copy link

Thanks for the above comments @avagin @rst0git , we are fixing and polishing the PR. Will update ASAP.

@rst0git
Copy link
Member

rst0git commented Nov 25, 2024

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

@Ddnirvana
Copy link

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

@rst0git No problem. We will add proper description in the next version.

Copy link
Contributor

@dayatsin-amd dayatsin-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wweewrwer. Some minor nit picks, but overall the code looks good to me.

plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
@wweewrwer
Copy link
Author

@rst0git @avagin @dayatsin-amd hi maintainers, thanks for your prior reviews and comments. We have fixed all the issues, as the following:

  1. Use the proper APIs to allocate (xmalloc, etc.)
  2. Enable the optimizations by default
  3. Change the name of hook
  4. Fix the issues to run in Podman containers
  5. Other fixes (line width, comments, etc.)
  6. Add descriptions in README to explain the optimizations.

Please let us know if you have any further comments

Copy link
Contributor

@dayatsin-amd dayatsin-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wweewrwer

@rst0git
Copy link
Member

rst0git commented Nov 28, 2024

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase?
https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

@wweewrwer
Copy link
Author

wweewrwer commented Nov 29, 2024

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase? https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

@rst0git Thanks for your comment! I have merged the fixup commits into the previous commits using git rebase. Please let me know if you have any further comments.

criu/cr-restore.c Outdated Show resolved Hide resolved
Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations. However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be offloaded to the main CRIU process, allowing the
target process to perform other restore operations in parallel.

- POST_FORKING

*POST_FORKING: Hook to enable the main CRIU process to perform some
restore operations of plugins.

Signed-off-by: Yanning Yang <[email protected]>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.

Signed-off-by: Yanning Yang <[email protected]>
@wweewrwer wweewrwer force-pushed the parallel_restore branch 2 times, most recently from cb6b91d to 37e3813 Compare December 5, 2024 13:47
plugins/amdgpu/README.md Outdated Show resolved Hide resolved
Currently, `restore_wait_inprogress_tasks` is a static function and can
only be called within `cr-restore.c`. However, to implement parallel
restore, amdgpu plugin also needs to check the tasks' state to decide
whether to stop the parallel restore server. Therefore, this patch moves
the declaration of `restore_wait_inprogress_tasks` to `restore.h` so
that it can be called by the plugin.

Signed-off-by: Yanning Yang <[email protected]>
Parallel restore needs an interface to know if there is only one process
to restore. This patch adds a `has_children` function in `pstree.h`.

Signed-off-by: Yanning Yang <[email protected]>
When enabling `POST_FORKING`, the target process and the main CRIU
process need an IPC interface to communicate and transfer file
descriptors. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <[email protected]>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <[email protected]>
This patch implements the entire logic to enable the offloading of
buffer object content restoration. It has two parts: the first replaces
the restoration of buffer objects in the target process by sending a
parallel restore command to the main CRIU process; the second implements
the `POST_FORKING` hook in the amdgpu plugin to enable buffer object
content restoration in the main CRIU process.

Signed-off-by: Yanning Yang <[email protected]>
@wweewrwer
Copy link
Author

@rst0git @avagin
Dear maintainers,

We have pushed the V4 version of the PR, completing all mentioned issues since the last version. Specifically, we: (1) support multiple commands (from a single process), (2) support multiple processes restore, and (3) fix other minor issues mentioned.

Details:

  • Replaced UDP with TCP to distinguish messages between different processes and commands.
  • Multiple-command support: Instead of receiving the command only once, the hook function now launches a dedicated thread to receive commands indefinitely until all tasks finish their restore stage. The main thread in this hook uses restore_wait_inprogress_tasks to determine when tasks have finished. Once completed, it sends an exit command to the parallel restore thread to stop receiving commands.
  • Multi-process support: In the case of multiple processes, they are restored in parallel (with different processes) by default, which will not benefit from the parallel optimizations. Therefore, we introduce a flag (called parallel_disabled) to only enable the optimization for single-process (which is the common case) as a fast path, and fallback to original restore otherwise.
  • Multi-GPU parallel restore support: In the original restore, when a process has multiple GPUs, the content on each GPU is restored in parallel. In this version, we have supported multi-GPU parallel restore utilizing the original design.
  • Other issues: Big thanks to Andrei and Radostin for other issues and suggestions, which are all fixed accordingly.

We have performed all the tests with the above changes. The PR can still bring 31% decrease for the restore latency in the case of single process, and achieves the same results for mutlti-process scenarios.

Please let me know if you have any further comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants