Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support running on nodes with host-installed GPU drivers #40

Open
zerodayyy opened this issue Jun 29, 2023 · 0 comments
Open

Support running on nodes with host-installed GPU drivers #40

zerodayyy opened this issue Jun 29, 2023 · 0 comments

Comments

@zerodayyy
Copy link

zerodayyy commented Jun 29, 2023

Nos is currently broken on systems where GPU drivers are pre-installed on hosts, for example on AKS. The symptom is gpu-agent Pod not starting due to missing /run/nvidia path on host.

According to Nvidia DRA driver documentation, /run/nvidia folder is provided via driver container. When drivers are installed on host instead of via container, the path is missing and has to be symlinked to host root manually:

Ensure your NVIDIA driver installation is rooted at /run/nvidia/driver

For deployments running a driver container this is a noop.
The driver container should already mount the driver installation at /run/nvidia/driver.

For deployments running with a host-installed driver, the following is sufficient to meet this requirement:

mkdir -p /run/nvidia
sudo ln -s / /run/nvidia/driver

NOTE: This is only currently necessary due to a limitation of how our CDI
generation library works. This restriction will be removed very soon.

To implement support for host-installed drivers, we can simply mount host's / as /run/nvidia/driver inside gpu-agent container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant