Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-toolkit: Some cuda libraries do not work without extra LD_LIBRARY_PATH #366109

Open
sliedes opened this issue Dec 18, 2024 · 9 comments
Labels
0.kind: bug Something is broken

Comments

@sliedes
Copy link

sliedes commented Dec 18, 2024

Describe the bug

I'm running this in a docker image with GPU:

https://gitlab.com/scripta/escriptorium/-/wikis/docker-install

GPU training failed out of the box suggesting libcuda.so cannot be loaded:

celery-gpu-1           | GPU available: True (cuda), used: True
celery-gpu-1           | TPU available: False, using: 0 TPU cores
celery-gpu-1           | IPU available: False, using: 0 IPUs
celery-gpu-1           | HPU available: False, using: 0 HPUs
celery-gpu-1           | `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
celery-gpu-1           | You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
celery-gpu-1           | [2024-12-18 09:09:03,469: INFO/ForkPoolWorker-1] Creating new model [1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do0.1,2 Lbx200 Do] with 77 outputs
celery-gpu-1           | [2024-12-18 09:09:03,680: INFO/ForkPoolWorker-1] Adding 1 dummy labels to validation set codec.
celery-gpu-1           | [2024-12-18 09:09:03,686: INFO/ForkPoolWorker-1] Setting seg_type to baselines.
celery-gpu-1           | LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
celery-gpu-1           | Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
celery-gpu-1           | [2024-12-18 09:09:17,657: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:221 exited with 'signal 6 (SIGABRT)'
celery-gpu-1           | [2024-12-18 09:09:17,669: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 6 (SIGABRT) Job: 0.')
celery-gpu-1           | Traceback (most recent call last):
celery-gpu-1           |   File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
celery-gpu-1           |     raise WorkerLostError(
celery-gpu-1           | billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT) Job: 0.

Adding LD_LIBRARY_PATH=/usr/local/nvidia/lib64 to the environment fixes this issue.

I believe this happens because the /nix/store path that is in ld.so search path only contains libcuda.so.1 while /usr/local/nvidia also contains libcuda.so:

# ls -l /usr/local/nvidia/lib64/libcuda.so*
lrwxrwxrwx 1 root root       12 Jan  1  1970 /usr/local/nvidia/lib64/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       17 Jan  1  1970 /usr/local/nvidia/lib64/libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan  1  1970 /usr/local/nvidia/lib64/libcuda.so.565.77
# cat /etc/ld.so.conf.d/nvcr-3734471176.conf
/nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib
# ls -l /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so*
lrwxrwxrwx 1 root root       17 Dec 18 09:34 /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan  1  1970 /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.565.77

... while cudnn wants libcuda.so.

Metadata

  • system: "x86_64-linux"
  • host os: Linux 6.6.64, NixOS, 25.05 (Warbler), 25.05.20241213.3566ab7
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.24.10
  • channels(root): "nixos"
  • nixpkgs: /nix/store/22r7q7s9552gn1vpjigkbhfgcvhsrz68-source

Notify maintainers

@SomeoneSerge @ereslibre

Relevant tracking bug: #290609


Note for maintainers: Please tag this issue in your PR.


Add a 👍 reaction to issues you find important.

@sliedes sliedes added the 0.kind: bug Something is broken label Dec 18, 2024
@ereslibre
Copy link
Member

Hello @sliedes!

Thanks for the bug report. Can you check if using CDI instead fixes this problem for you? Container runtime wrappers are somewhat out of date.

Note that after enabling CDI (you can check https://nixos.wiki/wiki/Docker, "GPU Pass-through (Nvidia)"), your Docker Compose will need a bit of adaptation: https://nixos.org/manual/nixpkgs/stable/#using-docker-compose.

@ereslibre
Copy link
Member

That being said, regardless whether it's reproducible with CDI or not we should fix this issue.

@sliedes
Copy link
Author

sliedes commented Dec 18, 2024

Apologies, I should have specified that I did use (I think!) CDI. I'm a bit feeling in the dark here, not being that familiar with Docker. But this is what I have in my docker-compose.yml:

  celery-gpu: &celery-gpu
    <<: *app
    environment:
      - KRAKEN_TRAINING_DEVICE=cuda:0
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
      - LD_LIBRARY_PATH=/usr/local/nvidia/lib64
    command: "celery -A escriptorium worker -l INFO -E -Ofair --prefetch-multiplier 1 -Q gpu -c 1 --max-tasks-per-child=1"
    shm_size: '3gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids:
                - nvidia.com/gpu=all
              capabilities: [gpu]

The shell log above and the error about not being able to open libcuda.so is from this container; without that LD_LIBRARY_PATH, it fails to open libcuda.so.

By the way, is this CDI approach NixOS independent? I.e. is the above docker-compose definition something one might reasonably try to upstream?

@ereslibre
Copy link
Member

ereslibre commented Dec 19, 2024

Apologies, I should have specified that I did use (I think!) CDI

Can you share your NixOS configuration? If you have /var/run/cdi/nvidia-container-toolkit.json in your system and it contains a valid JSON, it means the CDI specs are being generated correctly.

If you have virtualisation.docker.enableNvidia enabled in your configuration, disable it, so that you get rid of the runtime wrappers. You can also clean up your docker compose removing the environment variables that apply to them:

  celery-gpu: &celery-gpu
    <<: *app
    environment:
      - KRAKEN_TRAINING_DEVICE=cuda:0
    command: "celery -A escriptorium worker -l INFO -E -Ofair --prefetch-multiplier 1 -Q gpu -c 1 --max-tasks-per-child=1"
    shm_size: '3gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids:
                - nvidia.com/gpu=all

By the way, is this CDI approach NixOS independent? I.e. is the above docker-compose definition something one might reasonably try to upstream?

CDI is a spec being pushed by some vendors and some like Nvidia implement it, it's certainly NixOS independent. However, we need to tweak some things here and there because of how NixOS works.

I'd say it would be fine to try to upstream it, specifically the device IDs in your case is a very common one generated always by the Nvidia tooling nvidia.com/gpu=all, so that should not be a problem in that sense from my POV, but up to the project documentation team.

@sliedes
Copy link
Author

sliedes commented Dec 21, 2024

I have virtualization.docker.enableNvidia = false, and /var/run/cdi/nvidia-container-toolkit.json exists.

I believe we can observe the problem without escriptorium thus:

$ grep libcuda.so /var/run/cdi/nvidia-container-toolkit.json
        "hostPath": "/nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.565.77",
        "containerPath": "/nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.565.77",

$ ls -l /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/ |grep libcuda.so
lrwxrwxrwx 1 root root       12 Jan  1  1970 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       17 Jan  1  1970 libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan  1  1970 libcuda.so.565.77

$ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest ls -l /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/ |grep libcuda.so
lrwxrwxrwx 1 root root       17 Dec 21 18:09 libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan  1  1970 libcuda.so.565.77

That is, the container does not have the libcuda.so symlink unlike the host—except in /usr/local/nvidia/lib where the CDI places the entire /nix/store/...-nvidia-x11-565.77-6.6.64/lib directory, but that directory is not in ld.so search path unlike the store directory.

@ereslibre
Copy link
Member

ereslibre commented Dec 21, 2024

Thanks for the details @sliedes. I observe the same thing. I'd like to understand why nvidia-container-toolkit cdi generate is not adding libcuda.so to the mount list.

In the meantime, although not ideal, you can use https://search.nixos.org/options?channel=24.11&show=hardware.nvidia-container-toolkit.mounts&from=0&size=50&sort=relevance&type=packages&query=nvidia-container-toolkit.

Let me know if you have any issues using hardware.nvidia-container-toolkit.mounts directly.

Note that the fact that libcuda.so is not available within the container does not impact other images such as ollama/ollama (docker run --rm -it --device=nvidia.com/gpu=all ollama/ollama) that is able to detect available GPU's and use them successfully. This probably has to do with the way the containerized application tries to discover GPU's or access libcuda; and this is in fact impacting escriptorium.

@ereslibre
Copy link
Member

@sliedes I have followed the instructions at the escriptorium website and I have both the docker-compose.override.yml and the variables.env file with the required modifications. How can I reproduce in the most straightforward way that celery-gpu is working as intended?

@sliedes
Copy link
Author

sliedes commented Dec 21, 2024

I think escriptorium only uses GPU when training either a segmentation or recognition (OCR) model. I think for that, create a document with some annotations, select the page(s) in the document view, and click train. It's not the most intuitive piece of software, and I've only looked into it for a few days, too. Presumably would also be somehow possible from the command line...

I could dig a bit further given the error message:

Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory

I'm not in front of a keyboard now so I can't check, but it sounds like maybe libcudnn_cnn_infer.so.8 just has a DT_NEEDED (i.e. is linked against) libcuda.so? This would be a clear indication that nothing that uses it can work. Generally I'd be surprised if any software that uses libcudnn_cnn_infer.so doesn't fail in the same way.

OTOH libcuda.so is present in /usr/local/nvidia/lib since the entire nvidia-x11 gets mounted there.

Is there some clear purpose for having both /nix/store/...-nvidia-x11*/ and /usr/local/nvidia/lib, and only the first of those in ld.so.conf.d? In this case, having the other path in ld.so path would solve the problem, but I don't know what problems it would cause.

I can see from the CDI file that it picks specific files to be put in /nix/store and the whole directory in /usr/local/nvidia/lib.

Sorry for not being able to give an easier reproducer now :(

@ereslibre
Copy link
Member

ereslibre commented Dec 23, 2024

Hello @sliedes!

Turns out our nvidia-container-toolkit version is quite old (1.15.0-rc.3).

Looks like the missing libcuda.so was fixed in 1.17.0 (NVIDIA/nvidia-container-toolkit@v1.16.2...v1.17.0#diff-f9bb1d7101f90a959d16e109635c8c50d146bc23ac7b4cfdda37fab81858914bR84).

I have tested bumping nvidia-container-toolkit in nixos and certainly it is added after the change:

  • Before
❯ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest ls -1 /nix/store/avgfkx6cww5skkmamnapqjbjzqlr2jk7-nvidia-x11-565.77-6.6.67/lib | grep libcuda.so
libcuda.so.1
libcuda.so.565.77
  • After
❯ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest ls -1 /nix/store/avgfkx6cww5skkmamnapqjbjzqlr2jk7-nvidia-x11-565.77-6.6.67/lib | grep libcuda.so
libcuda.so
libcuda.so.1
libcuda.so.565.77

However, in the updated version, ldconfig -p does not report it until ldconfig is executed explicitly:

❯ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest ldconfig -p | grep libcuda
	libcudadebugger.so.1 (libc6,x86-64) => /nix/store/avgfkx6cww5skkmamnapqjbjzqlr2jk7-nvidia-x11-565.77-6.6.67/lib/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /nix/store/avgfkx6cww5skkmamnapqjbjzqlr2jk7-nvidia-x11-565.77-6.6.67/lib/libcuda.so.1
❯ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest sh -c 'ldconfig && ldconfig -p | grep libcuda'
	libcudadebugger.so.1 (libc6,x86-64) => /nix/store/avgfkx6cww5skkmamnapqjbjzqlr2jk7-nvidia-x11-565.77-6.6.67/lib/libcudadebugger.so.1
	libcuda.so.1 (libc6,x86-64) => /nix/store/avgfkx6cww5skkmamnapqjbjzqlr2jk7-nvidia-x11-565.77-6.6.67/lib/libcuda.so.1
	libcuda.so (libc6,x86-64) => /nix/store/avgfkx6cww5skkmamnapqjbjzqlr2jk7-nvidia-x11-565.77-6.6.67/lib/libcuda.so

I don't know if this will be an issue for projects like escriptorium though.

WIP PR for bumping nvidia-container-toolkit: #367769

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

2 participants