Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda: fix check for GPU device availability #2510

Merged
merged 1 commit into from
Nov 12, 2024

Conversation

rst0git
Copy link
Member

@rst0git rst0git commented Nov 2, 2024

The check for /dev/nvidiactl to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different 1. This pull request changes the logic to check if a GPU device is available in /proc/driver/nvidia/gpus/. This approach is similar to torch.cuda.is_available() and it is a more accurate indicator. The subsequent check for support of the cuda-checkpoint --action option would confirm if the driver supports checkpoint/restore.

Fixes: #2509

Footnotes

  1. https://github.com/NVIDIA/gpu-operator

@rst0git rst0git marked this pull request as ready for review November 2, 2024 09:37
@rst0git rst0git requested review from jesus-ramos and avagin and removed request for jesus-ramos November 2, 2024 09:37
plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
@rst0git rst0git force-pushed the 2024-11-02-cuda-check branch 4 times, most recently from 9a50892 to a69ea00 Compare November 4, 2024 23:40
plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved
@avagin
Copy link
Member

avagin commented Nov 8, 2024

LGTM. Thanks.

@rst0git rst0git force-pushed the 2024-11-02-cuda-check branch 4 times, most recently from 3d103d0 to 3940ee7 Compare November 10, 2024 17:08
@rst0git rst0git changed the title cuda: check for libcuda instead of /dev/nvidiactl cuda: fix check for GPU device availability Nov 10, 2024
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This
approach is similar to `torch.cuda.is_available()` and it is a more
accurate indicator.

The subsequent check for support of the `cuda-checkpoint --action`
option would confirm if the driver supports checkpoint/restore.

[1] https://github.com/NVIDIA/gpu-operator

Fixes: checkpoint-restore#2509

Signed-off-by: Radostin Stoyanov <[email protected]>
@rst0git rst0git force-pushed the 2024-11-02-cuda-check branch from 3940ee7 to de9d552 Compare November 10, 2024 17:13
@avagin avagin merged commit 26dcc21 into checkpoint-restore:criu-dev Nov 12, 2024
38 of 41 checks passed
@rst0git rst0git deleted the 2024-11-02-cuda-check branch November 12, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The gpu-operator for Kubernetes uses /run/nvidia/driver as default path for driver installation.
3 participants