-
-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-container-toolkit: Some cuda libraries do not work without extra LD_LIBRARY_PATH #366109
Comments
Hello @sliedes! Thanks for the bug report. Can you check if using CDI instead fixes this problem for you? Container runtime wrappers are somewhat out of date. Note that after enabling CDI (you can check https://nixos.wiki/wiki/Docker, "GPU Pass-through (Nvidia)"), your Docker Compose will need a bit of adaptation: https://nixos.org/manual/nixpkgs/stable/#using-docker-compose. |
That being said, regardless whether it's reproducible with CDI or not we should fix this issue. |
Apologies, I should have specified that I did use (I think!) CDI. I'm a bit feeling in the dark here, not being that familiar with Docker. But this is what I have in my celery-gpu: &celery-gpu
<<: *app
environment:
- KRAKEN_TRAINING_DEVICE=cuda:0
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
- LD_LIBRARY_PATH=/usr/local/nvidia/lib64
command: "celery -A escriptorium worker -l INFO -E -Ofair --prefetch-multiplier 1 -Q gpu -c 1 --max-tasks-per-child=1"
shm_size: '3gb'
deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=all
capabilities: [gpu] The shell log above and the error about not being able to open By the way, is this CDI approach NixOS independent? I.e. is the above docker-compose definition something one might reasonably try to upstream? |
Can you share your NixOS configuration? If you have If you have virtualisation.docker.enableNvidia enabled in your configuration, disable it, so that you get rid of the runtime wrappers. You can also clean up your docker compose removing the environment variables that apply to them: celery-gpu: &celery-gpu
<<: *app
environment:
- KRAKEN_TRAINING_DEVICE=cuda:0
command: "celery -A escriptorium worker -l INFO -E -Ofair --prefetch-multiplier 1 -Q gpu -c 1 --max-tasks-per-child=1"
shm_size: '3gb'
deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=all
CDI is a spec being pushed by some vendors and some like Nvidia implement it, it's certainly NixOS independent. However, we need to tweak some things here and there because of how NixOS works. I'd say it would be fine to try to upstream it, specifically the device IDs in your case is a very common one generated always by the Nvidia tooling |
I have I believe we can observe the problem without escriptorium thus:
That is, the container does not have the |
Thanks for the details @sliedes. I observe the same thing. I'd like to understand why nvidia-container-toolkit cdi generate is not adding In the meantime, although not ideal, you can use https://search.nixos.org/options?channel=24.11&show=hardware.nvidia-container-toolkit.mounts&from=0&size=50&sort=relevance&type=packages&query=nvidia-container-toolkit. Let me know if you have any issues using Note that the fact that |
@sliedes I have followed the instructions at the escriptorium website and I have both the |
I think escriptorium only uses GPU when training either a segmentation or recognition (OCR) model. I think for that, create a document with some annotations, select the page(s) in the document view, and click train. It's not the most intuitive piece of software, and I've only looked into it for a few days, too. Presumably would also be somehow possible from the command line... I could dig a bit further given the error message:
I'm not in front of a keyboard now so I can't check, but it sounds like maybe OTOH libcuda.so is present in Is there some clear purpose for having both I can see from the CDI file that it picks specific files to be put in /nix/store and the whole directory in /usr/local/nvidia/lib. Sorry for not being able to give an easier reproducer now :( |
Hello @sliedes! Turns out our nvidia-container-toolkit version is quite old (1.15.0-rc.3). Looks like the missing I have tested bumping nvidia-container-toolkit in nixos and certainly it is added after the change:
However, in the updated version,
I don't know if this will be an issue for projects like escriptorium though. WIP PR for bumping nvidia-container-toolkit: #367769 |
Describe the bug
I'm running this in a docker image with GPU:
https://gitlab.com/scripta/escriptorium/-/wikis/docker-install
GPU training failed out of the box suggesting
libcuda.so
cannot be loaded:Adding
LD_LIBRARY_PATH=/usr/local/nvidia/lib64
to the environment fixes this issue.I believe this happens because the
/nix/store
path that is in ld.so search path only containslibcuda.so.1
while/usr/local/nvidia
also containslibcuda.so
:... while cudnn wants
libcuda.so
.Metadata
"x86_64-linux"
Linux 6.6.64, NixOS, 25.05 (Warbler), 25.05.20241213.3566ab7
yes
yes
nix-env (Nix) 2.24.10
"nixos"
/nix/store/22r7q7s9552gn1vpjigkbhfgcvhsrz68-source
Notify maintainers
@SomeoneSerge @ereslibre
Relevant tracking bug: #290609
Note for maintainers: Please tag this issue in your PR.
Add a 👍 reaction to issues you find important.
The text was updated successfully, but these errors were encountered: