Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM aarch-64 server build failed (host OS: Ubuntu22.04.3) #2021

Closed
zhudy opened this issue Dec 11, 2023 · 54 comments · Fixed by #8713
Closed

ARM aarch-64 server build failed (host OS: Ubuntu22.04.3) #2021

zhudy opened this issue Dec 11, 2023 · 54 comments · Fixed by #8713

Comments

@zhudy
Copy link

zhudy commented Dec 11, 2023

do as: https://docs.vllm.ai/en/latest/getting_started/installation.html

  1. docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
  2. git clone https://github.com/vllm-project/vllm.git
  3. cd vllm
  4. pip install -e .

here is the details in side the docker instance:
root@f8c2e06fbf8b:/mnt/vllm# pip install -e .
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///mnt/vllm
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> [22 lines of output]
/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
:142: UserWarning: Unsupported CUDA/ROCM architectures ({'6.1', '7.2', '8.7', '5.2', '6.0'}) are excluded from the TORCH_CUDA_ARCH_LIST env variable (5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX). Supported CUDA/ROCM architectures are: {'7.5', '8.0', '9.0', '7.0', '8.6+PTX', '9.0+PTX', '8.6', '8.0+PTX', '8.9+PTX', '8.9', '7.0+PTX', '7.5+PTX'}.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
main()
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 132, in get_requires_for_build_editable
return hook(config_settings)
File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 441, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "", line 297, in
File "", line 267, in get_vllm_version
NameError: name 'nvcc_cuda_version' is not defined. Did you mean: 'cuda_version'?
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python -m pip install --upgrade pip

@zhudy
Copy link
Author

zhudy commented Dec 11, 2023

Actually, the nvcc is ok to run as these:

root@f8c2e06fbf8b:/mnt/vllm# nvcc -v
nvcc fatal : No input files specified; use option --help for more information
root@f8c2e06fbf8b:/mnt/vllm# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:10:07_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

@zhudy
Copy link
Author

zhudy commented Dec 11, 2023

there is cuda:

root@f8c2e06fbf8b:/mnt/vllm# echo $CUDA_HOME
/usr/local/cuda

root@f8c2e06fbf8b:/mnt/vllm# type nvcc
nvcc is /usr/local/cuda/bin/nvcc

github.com/vllm# python3 -c "import torch; print(torch.cuda.is_available()); print(torch.version);"
True
2.1.0a0+32f93b1

@yexing
Copy link

yexing commented Dec 13, 2023

add

nvcc_cuda_version = get_nvcc_cuda_version(CUDA_HOME) 

to setup.py at line 268

@cyc00518
Copy link

cyc00518 commented Feb 22, 2024

@yexing @zhudy
Excuse me. I face the same problem.
I cloned vllm into my project.
and add
nvcc_cuda_version = get_nvcc_cuda_version(CUDA_HOME)
to setup.py at line 268

But still have same problem. Did I mislead something?

@Wetzr
Copy link

Wetzr commented Mar 4, 2024

I have the same problem and would be glad if there would be any help.
Setup:
Aarch64 GH200
OS: Ubuntu 22.04.3 LTS (Jammy Jellyfish)
nvcc: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_11:03:34_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
cuda home: /usr/local/cuda
Torch: 2.2.0a0+81ea7a4

I am running inside the nvidia pytorch_23.12 Container.

@haileyschoelkopf
Copy link

Got it working with the changes in this branch: https://github.com/haileyschoelkopf/vllm/tree/aarm64-dockerfile , with built dockerfiles here: https://hub.docker.com/r/haileysch/vllm-aarch64-base https://hub.docker.com/r/haileysch/vllm-aarch64-openai hopefully this'll be helpful to others!

@tuanhe
Copy link

tuanhe commented Mar 29, 2024

do as: https://docs.vllm.ai/en/latest/getting_started/installation.html

  1. docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
  2. git clone https://github.com/vllm-project/vllm.git
  3. cd vllm
  4. pip install -e .

here is the details in side the docker instance: root@f8c2e06fbf8b:/mnt/vllm# pip install -e . Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Obtaining file:///mnt/vllm Installing build dependencies ... done Checking if build backend supports build_editable ... done Getting requirements to build editable ... error error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully. │ exit code: 1 ╰─> [22 lines of output] /tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.) device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'), No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' :142: UserWarning: Unsupported CUDA/ROCM architectures ({'6.1', '7.2', '8.7', '5.2', '6.0'}) are excluded from the TORCH_CUDA_ARCH_LIST env variable (5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX). Supported CUDA/ROCM architectures are: {'7.5', '8.0', '9.0', '7.0', '8.6+PTX', '9.0+PTX', '8.6', '8.0+PTX', '8.9+PTX', '8.9', '7.0+PTX', '7.5+PTX'}. Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in main() File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main json_out['return_val'] = hook(**hook_input['kwargs']) File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 132, in get_requires_for_build_editable return hook(config_settings) File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 441, in get_requires_for_build_editable return self.get_requires_for_build_wheel(config_settings) File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel return self._get_build_requires(config_settings, requirements=['wheel']) File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 295, in _get_build_requires self.run_setup() File "/tmp/pip-build-env-4xoxai9j/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 311, in run_setup exec(code, locals()) File "", line 297, in File "", line 267, in get_vllm_version NameError: name 'nvcc_cuda_version' is not defined. Did you mean: 'cuda_version'? [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully. │ exit code: 1 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

[notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: python -m pip install --upgrade pip

HI, guys , had you solved the issue ?

@cyc00518
Copy link

cyc00518 commented Jun 6, 2024

@tuanhe
Still face same problem,
Anyone know vllm support aarch-64 now?

@drikster80
Copy link
Contributor

Had a similar problem on the GH200 (aarch64 Grace CPU).
Similar to @haileyschoelkopf, I updated the Dockerfile and requirements to work with v0.5.1. Here is the forked version:
https://github.com/drikster80/vllm/tree/gh200-docker

Main issues that needed to be overcome:

  • Use Nvidia's Pytorch container due to PyTorch not supporting ARM64. (specifically nvcr.io/nvidia/pytorch:24.04-py3 to ensure PyTorch 2.3 build and latest optimizations (e.g. Lightning-Thunder). Release Notes for 24.04-py3
  • xformers hangs on pip install. Not sure why (maybe just taking forever to compile?)
  • Triton needs to be installed from source
  • vllm-flash-attn needs to be built from source
  • Comment out "torch", "xformers", and "vllm-flash-attn" in requirements files (handling that in the Dockerfile directly).

For future updating, you can see the changes here: drikster80@359fd4f

@ZihaoZhou
Copy link

ZihaoZhou commented Jul 19, 2024

Thank you all.

I have built the image using the script provided by @drikster80, and it takes about 12 hours (most of the time is spent for mamba builder and xformers). So to save time for others, I have made the image public at https://hub.docker.com/r/zihaokevinzhou/vllm-aarch64-openai . I have validated it works well for my personal hosting of fp8 quantized version of llama-3-70b.

@drikster80
Copy link
Contributor

@ZihaoZhou, thank you.

It normally only takes ~80 min on my system. 12 hrs seems excessive. I'm working on an update for v0.5.2, but haven't gotten the new flash-infer to build yet. I'll update the script when that's solved and post back here.

I haven't been uploading since the container is ~33GB. It looks like the one you uploaded is 13GB? Is that just from native compression? I'm sure there are some ways to cut it down (e.g. remove some of the build artifacts from the last image?).

@cyc00518
Copy link

@ZihaoZhou
You should have appeared earlier!

@drikster80
In fact, I am also using GH200, and today I used yours forked version to build it.

The step that took me the most time was:
RUN python3 setup.py bdist_wheel --dist-dir=dist which took a total of 40 minutes.
Installing Triton also took a very long time.

Additionally, for the xformers part, I spent an entire afternoon, and it also seemed to be stuck there.
So in the end, I commented out this part.

Now, vllm is successfully running on GH200, thanks to your selfless contribution!

May I ask, regarding the Docker image on aarch64, compared to the original version, is the main difference just commenting out the items you mentioned in the requirements.txt?
Why is this necessary?

@drikster80
Copy link
Contributor

drikster80 commented Jul 19, 2024

@cyc00518 You can see the list of full changes here: main...drikster80:vllm:gh200-docker

Effectively, xformers and vllm-flash-attention don't release ARM64 wheels, so those need to be built from source. Also, since Nvidia's PyTorch container already containers torch, torchvision, and some other stuff, those need to be commented out in the requirements file. The only 3 files that are changes are Dockerfile, requirements-build.txt, and requirements-cuda.txt.

As a side note, if you're using a GH200 bare metal, you might also want to checkout my auto-install for GH200s. Getting it setup with optimizations, NCCL, OFED, for high-speed distributed training/inference was a pain, so automated it for people to use or reference: https://github.com/drikster80/gh200-Ubuntu-22.04-autoinstall

Main issues that needed to be overcome:

  • Use Nvidia's Pytorch container due to PyTorch not supporting ARM64. (specifically nvcr.io/nvidia/pytorch:24.04-py3 to ensure PyTorch 2.3 build and latest optimizations (e.g. Lightning-Thunder). Release Notes for 24.04-py3
  • xformers hangs on pip install. Not sure why (maybe just taking forever to compile?)
  • Triton needs to be installed from source
  • vllm-flash-attn needs to be built from source
  • Comment out "torch", "xformers", and "vllm-flash-attn" in requirements files (handling that in the Dockerfile directly).

@cyc00518
Copy link

@drikster80
Thank you very much for your patient replies.

I have learned a lot, and I also appreciate the additional information you provided!

@drikster80
Copy link
Contributor

Updated the aarch64 remote branch to v0.5.2: https://github.com/drikster80/vllm/tree/gh200-docker

Pushed up a GH200 specific version (built for SM 9.0+PTX) to https://hub.docker.com/r/drikster80/vllm-gh200-openai

Building a more generic version now and will update this comment when complete.

@drikster80
Copy link
Contributor

If anyone comes across this and is trying to get Llama-3.1 to work with the GH200 (or aarch64 + H100), I have the latest working container (v0.5.3-post1 with a couple more commits) image up at https://hub.docker.com/r/drikster80/vllm-gh200-openai
Pull it with docker pull drikster80/vllm-gh200-openai:latest

Codes is still in the https://github.com/drikster80/vllm/tree/gh200-docker branch.

Validated Llama-3.1-8b-Instruct works, and trying to working to test 405B-FP8 now (with cpu-offload)

@FanZhang91
Copy link

@tuanhe Still face same problem, Anyone know vllm support aarch-64 now?

+1

@skandermoalla
Copy link

Also built some images for arm64 with cuda arch 9.0 (for GH200/H100) and for amd64 for cuda arch 8.0 and 9.0 (A100 and H100) from a fork of @drikster80 's installation to focus on the reproducibility of the build and to have both architectures start from the NGC PyTorch images.
Code: https://github.com/skandermoalla/vllm-build
Images: https://hub.docker.com/repository/docker/skandermoalla/vllm/general

@drikster80
Copy link
Contributor

@FanZhang91, I still maintain two docker images for aarch64 on DockerHub. These have both been updated to v0.6.1 as of 30 min ago.

All Supported CUDA caps: drikster80/vllm-aarch64-openai:latest
GH200/H100+ only (smaller): drikster80/vllm-gh200-openai:latest

They are slightly different from upstream in a couple small ways:

  • Based on Nvidia Pytorch container 24.07
  • Python 3.10 (haven't upgraded to 3.12 yet to to source compiling problems
  • Using main FlashInfer instead of release... just haven't gotten around to pinning that to a release.
  • Xformers, Flashinfer, and a couple other things needed to be built from source

You can pull and build yourself with:

git clone -b gh200-docker https://github.com/drikster80/vllm.git
cd ./vllm\

# Update the max_jobs and nvvc_threads as needed to prevent OOM. This is good for a GH200.
docker build . --target vllm-openai -t drikster80/vllm-aarch64-openai:v0.6.1 --build-arg max_jobs=10 --build-arg nvcc_threads=8

# Can also pin to a specific Nvidia GPU Capability:
# docker build . --target vllm-openai -t drikster80/vllm-gh200-openai:v0.6.1 --build-arg max_jobs=10 --build-arg nvcc_threads=8 --build-arg torch_cuda_arch_list="9.0+PTX"

It takes ~1 hr to build on a pinned capability, and ~3+ hours to build for all GPU capability levels. Longer if you reduce the max_jobs variable.

@skandermoalla, I've been meaning to make a PR for a merged DockerFile that can product both arm64 & amd64... just haven't had the time to work it. This was requested by some of the vllm maintainers and would make my life a lot easier to not need to maintain a separate fork. Is this something you'd be interested in collaborating on?

@skandermoalla
Copy link

There weren't any changes in the Dockerfile or dependencies to compile for arm64 and and64 as most of the tricky packages are compiled from source.
For me what's important is to start from the NGC image for both architectures. If this is something the vllm team is happy to have then I'm happy to collaborate on producing one!
You did all the hard work already of figuring out what to compile and what not and in which order to install the packages and skip their pip deps when needed.

@gongchengli
Copy link

@FanZhang91, I still maintain two docker images for aarch64 on DockerHub. These have both been updated to v0.6.1 as of 30 min ago.

All Supported CUDA caps: drikster80/vllm-aarch64-openai:latest GH200/H100+ only (smaller): drikster80/vllm-gh200-openai:latest

They are slightly different from upstream in a couple small ways:

  • Based on Nvidia Pytorch container 24.07
  • Python 3.10 (haven't upgraded to 3.12 yet to to source compiling problems
  • Using main FlashInfer instead of release... just haven't gotten around to pinning that to a release.
  • Xformers, Flashinfer, and a couple other things needed to be built from source

You can pull and build yourself with:

git clone -b gh200-docker https://github.com/drikster80/vllm.git
cd ./vllm\

# Update the max_jobs and nvvc_threads as needed to prevent OOM. This is good for a GH200.
docker build . --target vllm-openai -t drikster80/vllm-aarch64-openai:v0.6.1 --build-arg max_jobs=10 --build-arg nvcc_threads=8

# Can also pin to a specific Nvidia GPU Capability:
# docker build . --target vllm-openai -t drikster80/vllm-gh200-openai:v0.6.1 --build-arg max_jobs=10 --build-arg nvcc_threads=8 --build-arg torch_cuda_arch_list="9.0+PTX"

It takes ~1 hr to build on a pinned capability, and ~3+ hours to build for all GPU capability levels. Longer if you reduce the max_jobs variable.

@skandermoalla, I've been meaning to make a PR for a merged DockerFile that can product both arm64 & amd64... just haven't had the time to work it. This was requested by some of the vllm maintainers and would make my life a lot easier to not need to maintain a separate fork. Is this something you'd be interested in collaborating on?

Hi @drikster80 , thanks for your docker images. After pulling the docker image, is it still necessary to rebuild or recompile from the source code? Since I got the error:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I am using the NVIDIA Jetson Orin with the docker in vllm v0.6.1, and the device is different from yours.
If it is necessary to do this, could you please provide any files?

@youkaichao
Copy link
Member

can you please try out #8713 ? @drikster80 @gongchengli

I spare some time to investigate the issue, and it looks the most complicated part is to bring your own pytorch ( @drikster80 does this by using ngc pytorch container). other than that, it is pretty straight-forward.

on that branch, I can easily build vllm from scratch with nightly pytorch, in a fresh new environment:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -vvv -e . --no-build-isolation

@KungFuPandaPro
Copy link

can you please try out #8713 ? @drikster80 @gongchengli

I spare some time to investigate the issue, and it looks the most complicated part is to bring your own pytorch ( @drikster80 does this by using ngc pytorch container). other than that, it is pretty straight-forward.

on that branch, I can easily build vllm from scratch with nightly pytorch, in a fresh new environment:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -vvv -e . --no-build-isolation

are your environment arm?

@youkaichao
Copy link
Member

yes, I built it on GH200 successfully.

@KungFuPandaPro
Copy link

pip install -vvv -e . --no-build-isolation

so many erros : 32 errors detected in the compilation of "/home/qz/zww/vllm/csrc/quantization/gptq/q_gemm.cu".
11 errors detected in the compilation of "/home/qz/zww/vllm/csrc/quantization/fp8/common.cu". is that normal?

@KungFuPandaPro
Copy link

yes, I built it on GH200 successfully.

I failed

@KungFuPandaPro
Copy link

@drikster80 when I met the problem Unknown runtime environment, it is usually because pip installs torch from pypi directly, and it does not have aarch64 wheel with cuda support.

make sure you control all the torch installation.

my solution is:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # install pytorch
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py # remove all vllm dependency specification of pytorch
$ pip install -r requirements-build.txt # install the rest build time dependency
$ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch

make sure you followed these steps.

ideally, you should not see any pytorch install/uninstall during the build, because your dockerfile already has pytorch installed.

which dockerfile you use?

@KungFuPandaPro
Copy link

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

my platform is aarch64-linux in jetson

@KungFuPandaPro
Copy link

can you please try out #8713 ? @drikster80 @gongchengli

I spare some time to investigate the issue, and it looks the most complicated part is to bring your own pytorch ( @drikster80 does this by using ngc pytorch container). other than that, it is pretty straight-forward.

on that branch, I can easily build vllm from scratch with nightly pytorch, in a fresh new environment:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -vvv -e . --no-build-isolation

I failed :Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher

@Jerrrrykun
Copy link

@drikster80 when I met the problem Unknown runtime environment, it is usually because pip installs torch from pypi directly, and it does not have aarch64 wheel with cuda support.

make sure you control all the torch installation.

my solution is:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # install pytorch
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py # remove all vllm dependency specification of pytorch
$ pip install -r requirements-build.txt # install the rest build time dependency
$ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch

make sure you followed these steps.

ideally, you should not see any pytorch install/uninstall during the build, because your dockerfile already has pytorch installed.

Hi. I followed your steps here. But I either stuck at the first step on building the pytorch in a new python=3.10 virtual env from scratch or failed at the last step with a pre-installed torch=2.3.1+cuda12.0 and python=3.9 virtual env.

  • For the first case, the pip could NOT figure out which torch to install and seemed to be downloading all available torch packages in the index URL, leading to the endless downloading pytorch to caches.

  • For the second case, the error was this: subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=72', '--target=_moe_C', '--target=vllm_flash_attn_c', '--target=_C']' returned non-zero exit status 1. at the end. (Tried using smaller MAX_JOBS but did not work, still same error with another MAX_JOBS value for -j=.

Info. of my machine:

GH200 aarch64 node (Linux 5.14.0-427.37.1.el9_4.aarch64+64k aarch64)

Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:24:28_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

pre-installed env:

_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                      51_gnu  
blas                      1.0                    openblas  
bzip2                     1.0.8                h998d150_6  
ca-certificates           2024.3.11            hd43f75c_0  
cmake                     3.30.4                   pypi_0    pypi
cuda-cudart               12.4.127             h7d4d7f0_0  
cuda-cudart_linux-aarch64 12.4.127             h7d4d7f0_0  
cuda-nvrtc                12.4.127             hc07b681_1  
cuda-nvtx                 12.4.127             ha8f0017_1  
cuda-version              12.4                 hbda6634_3  
cudnn                     8.9.7.29             h00485f9_3    conda-forge
filelock                  3.13.1           py39hd43f75c_0  
fsspec                    2024.3.1         py39hd43f75c_0  
gmp                       6.2.1                h22f4aa5_3  
gmpy2                     2.1.2            py39h2673b40_0  
jinja2                    3.1.4            py39hd43f75c_0  
ld_impl_linux-aarch64     2.38                 h8131f2d_1  
libabseil                 20240116.2      cxx17_h419075a_0  
libblas                   3.9.0           22_linuxaarch64_openblas    conda-forge
libcblas                  3.9.0           22_linuxaarch64_openblas    conda-forge
libcublas                 12.4.5.8             h7d4d7f0_1  
libcufft                  11.2.1.3             h7d4d7f0_1  
libcurand                 10.3.5.147           h7d4d7f0_1  
libcusolver               11.6.1.9             h7d4d7f0_1  
libcusparse               12.3.1.170           h7d4d7f0_1  
libffi                    3.4.4                h419075a_1  
libgcc-ng                 13.2.0              he277a41_10    conda-forge
libgfortran-ng            13.2.0              he9431aa_10    conda-forge
libgfortran5              13.2.0              h2af0866_10    conda-forge
libgomp                   13.2.0              he277a41_10    conda-forge
liblapack                 3.9.0           22_linuxaarch64_openblas    conda-forge
libmagma                  2.7.2                hd3076f5_2    conda-forge
libmagma_sparse           2.7.2                hd3076f5_3    conda-forge
libnsl                    2.0.1                h31becfc_0    conda-forge
libnvjitlink              12.4.127             h7d4d7f0_1  
libopenblas               0.3.27          pthreads_h5a5ec62_0    conda-forge
libprotobuf               4.25.3               h648ac29_0    conda-forge
libsqlite                 3.46.0               hf51ef55_0    conda-forge
libstdcxx-ng              13.2.0              h3f4de04_10    conda-forge
libtorch                  2.3.1           cuda120_h9f053a3_200    conda-forge
libuuid                   2.38.1               hb4cce97_0    conda-forge
libuv                     1.48.0               h31becfc_0    conda-forge
libxcrypt                 4.4.36               h31becfc_1    conda-forge
libzlib                   1.3.1                h68df207_1    conda-forge
markupsafe                2.1.3            py39h998d150_0  
mpc                       1.1.0                h3d095b0_1  
mpfr                      4.0.2                h51dc842_1  
mpmath                    1.3.0            py39hd43f75c_0  
nccl                      2.21.5.1             h1843a27_0  
ncurses                   6.5                  h0425590_0    conda-forge
networkx                  3.2.1            py39hd43f75c_0  
ninja                     1.11.1.1                 pypi_0    pypi
nomkl                     3.0                           0  
numpy                     1.26.4           py39he45c16d_0  
numpy-base                1.26.4           py39h15d264d_0  
openssl                   3.3.1                h68df207_0    conda-forge
packaging                 24.1                     pypi_0    pypi
pip                       24.0             py39hd43f75c_0  
python                    3.9.19          h4ac3b42_0_cpython    conda-forge
python_abi                3.9                      4_cp39    conda-forge
pytorch                   2.3.1           cuda120_py39h4684420_200    conda-forge
pytorch-gpu               2.3.1           cuda120py39hecaec72_200    conda-forge
readline                  8.2                  h998d150_0  
setuptools                69.5.1           py39hd43f75c_0  
setuptools-scm            8.1.0                    pypi_0    pypi
sleef                     3.5.1                h998d150_2  
sympy                     1.12             py39hd43f75c_0  
tk                        8.6.13               h194ca79_0    conda-forge
tomli                     2.0.2                    pypi_0    pypi
typing_extensions         4.11.0           py39hd43f75c_0  
tzdata                    2024a                h04d1e81_0  
wheel                     0.43.0           py39hd43f75c_0  
xz                        5.4.6                h998d150_1

@youkaichao
Copy link
Member

For the first case, the pip could NOT figure out which torch to install and seemed to be downloading all available torch packages in the index URL, leading to the endless downloading pytorch to caches.

this might be a problem of your pip . you can try to manually specify the version of pytorch you want to install.

@Jerrrrykun
Copy link

For the first case, the pip could NOT figure out which torch to install and seemed to be downloading all available torch packages in the index URL, leading to the endless downloading pytorch to caches.

this might be a problem of your pip . you can try to manually specify the version of pytorch you want to install.

Thanks for replying. Also tried this before but got the same error as case 2.

@SamSoup
Copy link

SamSoup commented Oct 24, 2024

I am experiencing the same issue as @Jerrrrykun on a GH200 Node with architecture 5.14.0-362.24.1.el9_3.aarch64. I have the same cuda version 12.4. I tried PyTorch wheels from the official release (https://download.pytorch.org/whl/cu124) and nightly (https://download.pytorch.org/whl/nightly/cu124) and then following the code above exactly.

The cmake command fails. I run without parallelism as suggested with -j1 and observe that the build fails on the first step:

[ 16%] Building CXX object CMakeFiles/_moe_C.dir/csrc/moe/torch_bindings.cpp.o
"<conda_environ>/lib/python3.10/site-packages/torch/include/c10/util/Half.h", line 334: error: identifier "float16_t" is undefined
  inline float16_t fp16_from_bits(uint16_t h) {
         ^

"<conda_environ>/lib/python3.10/site-packages/torch/include/c10/util/Half.h", line 335: error: identifier "float16_t" is undefined
    return c10::bit_cast<float16_t>(h);
                         ^

"<conda_environ>/lib/python3.10/site-packages/torch/include/c10/util/Half.h", line 338: error: identifier "float16_t" is undefined
  inline uint16_t fp16_to_bits(float16_t f) {
                               ^

"<conda_environ>/lib/python3.10/site-packages/torch/include/c10/util/Half.h", line 339: error: no instance of function template "c10::bit_cast" matches the argument list
            argument types are: (<error-type>)
    return c10::bit_cast<uint16_t>(f);
...
78 errors detected in the compilation of "<WORK_DIR>/vllm/csrc/moe/torch_bindings.cpp".
gmake[3]: *** [CMakeFiles/_moe_C.dir/build.make:76: CMakeFiles/_moe_C.dir/csrc/moe/torch_bindings.cpp.o] Error 2
gmake[2]: *** [CMakeFiles/Makefile2:205: CMakeFiles/_moe_C.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:212: CMakeFiles/_moe_C.dir/rule] Error 2
gmake: *** [Makefile:202: _moe_C] Error 2

A bunch of errors occurs in Half.h and Half-inl.h, a few in torch/include/torch/csrc/api/include/torch/nn/functional/loss.h and torch/include/c10/util/StringUtil.h. Does this mean we are not somehow installing torch correctly?

I can verify that my Pytorch installation is working:

$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.version.cuda)"
2.4.0
True
12.4

@youkaichao
Copy link
Member

you can run some pytorch cuda program to verify if the installation is correct.

@samos123
Copy link
Contributor

samos123 commented Oct 31, 2024

Should we reopen this until we have a docker image that just works on GH200? Can also file a new issue.

I tried the vllm/vllm-openai:v0.6.3.post1 docker image and it doesn't work on GH200. It throws this error:

exec /usr/bin/python3: exec format error

The image from @drikster80 works great!

Is someone working on a PR to merge the required changes that @drikster80 made back into upstream?

@yajuvendrarawat
Copy link

can you please try out #8713 ? @drikster80 @gongchengli
I spare some time to investigate the issue, and it looks the most complicated part is to bring your own pytorch ( @drikster80 does this by using ngc pytorch container). other than that, it is pretty straight-forward.
on that branch, I can easily build vllm from scratch with nightly pytorch, in a fresh new environment:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -vvv -e . --no-build-isolation

I failed :Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher

It worked for me on Jetson Orin Nano, vllm is able to start but I am facing issue with running with models.

vllm serve --device cuda ibm-granite/granite-3.0-2b-instruct

Error I get
ValueError: Model architectures ['GraniteForCausalLM'] failed to be inspected. Please check the logs for more details.

or
vllm serve --device cuda distilbert/distilbert-base-uncased

raise ValueError(
ValueError: Model architectures ['DistilBertForMaskedLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'InternLM2VEForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'FalconMambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Florence2ForConditionalGeneration', 'BertModel', 'Gemma2Model', 'LlamaModel', 'MistralModel', 'Qwen2ForRewardModel', 'Qwen2ForSequenceClassification', 'LlavaNextForConditionalGeneration', 'Phi3VForCausalLM', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'H2OVLChatModel', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'Qwen2AudioForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel']

environment:

(myenv) [yajuvendra@llmhost vllm]$ python3 collect_env.py
Collecting environment information...
INFO 11-06 08:05:53 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
PyTorch version: 2.6.0.dev20241104+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 9.4 (Plow) (aarch64)
GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.34

Python version: 3.10.15 (main, Oct 3 2024, 07:21:53) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.14.0-427.22.1.el9_4.aarch64-aarch64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Vendor ID: ARM
Model name: Cortex-A78AE
Model: 1
Thread(s) per core: 1
Core(s) per cluster: 6
Socket(s): -
Cluster(s): 1
Stepping: r0p1
CPU(s) scaling MHz: 62%
CPU max MHz: 1510.4000
CPU min MHz: 115.2000
BogoMIPS: 62.50
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc flagm
L1d cache: 384 KiB (6 instances)
L1i cache: 384 KiB (6 instances)
L2 cache: 1.5 MiB (6 instances)
L3 cache: 4 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-5
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-ml-py==12.560.30
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0.dev20241104+cu124
[pip3] torchaudio==2.5.0.dev20241105
[pip3] torchvision==0.20.0.dev20241105
[pip3] transformers==4.46.2
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.6.0.dev20241104+cu124 pypi_0 pypi
[conda] torchaudio 2.5.0.dev20241105 pypi_0 pypi
[conda] torchvision 0.20.0.dev20241105 pypi_0 pypi
[conda] transformers 4.46.2 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post2.dev257+g21063c11.d20241106
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

@youkaichao
Copy link
Member

@yajuvendrarawat

Model architectures ['DistilBertForMaskedLM']

this error is clear. it is not supported.

ValueError: Model architectures ['GraniteForCausalLM'] failed to be inspected. Please check the logs for more details.

this is strange. we need more logging output for this.

@yajuvendrarawat
Copy link

yajuvendrarawat commented Nov 7, 2024

@yajuvendrarawat

Model architectures ['DistilBertForMaskedLM']

this error is clear. it is not supported.

ValueError: Model architectures ['GraniteForCausalLM'] failed to be inspected. Please check the logs for more details.

this is strange. we need more logging output for this.

thanks a lot @youkaichao no matter what model I use I am landing on the same error.

attached are two logs, thanks a lot for your help.
error - granite-3b-code-base-2k.txt
error - MiniCPM3-4B.txt

(myenv) [yajuvendra@llmhost vllm]$ git remote -v
origin https://github.com/vllm-project/vllm.git (fetch)
origin https://github.com/vllm-project/vllm.git (push)
(myenv) [yajuvendra@llmhost vllm]$ git branch --show-current
main

br..
Yaju

@youkaichao
Copy link
Member

ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/torch/init.py", line 51, in
ERROR 11-06 21:40:45 registry.py:286] from torch._utils import (
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/torch/_utils.py", line 4, in
ERROR 11-06 21:40:45 registry.py:286] import logging
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/vllm/vllm/logging/init.py", line 1, in
ERROR 11-06 21:40:45 registry.py:286] from vllm.logging.formatter import NewLineFormatter
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/vllm/vllm/logging/init.py", line 1, in
ERROR 11-06 21:40:45 registry.py:286] from vllm.logging.formatter import NewLineFormatter
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/vllm/vllm/logging/formatter.py", line 4, in
ERROR 11-06 21:40:45 registry.py:286] class NewLineFormatter(logging.Formatter):
ERROR 11-06 21:40:45 registry.py:286] AttributeError: partially initialized module 'logging' has no attribute 'Formatter' (most likely due to a circular import)

@yajuvendrarawat you are running your code under vllm/vllm , and then Python incorrectly treats vllm/logging.py as the builtin logging module.

@drikster80
Copy link
Contributor

can you please try out #8713 ? @drikster80 @gongchengli

I spare some time to investigate the issue, and it looks the most complicated part is to bring your own pytorch ( @drikster80 does this by using ngc pytorch container). other than that, it is pretty straight-forward.

on that branch, I can easily build vllm from scratch with nightly pytorch, in a fresh new environment:

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -vvv -e . --no-build-isolation

@youkaichao @samos123
I created a new PR for building Docker with the updates. Still some testing to do on it to confirm it doesn't break anything on x86_64 systems, but confirmed that it's working on the GH200. Also, haven't tested with exhaustive models, but it does work with a couple of Llama's that I tested.

#10499

@drikster80
Copy link
Contributor

If anyone with a MGX or GH200 wants to test the latest version with different models, let me know if any supported models fail.

Uploaded the latest (v0.6.4.post1) to dockerhub: https://hub.docker.com/r/drikster80/vllm-gh200-openai/tags docker pull drikster80/vllm-gh200-openai:latest

The new version ( PR #10499 ) is unique that it doesn't use the Nvidia Pytorch container anymore and matches closer to standard VLLM container. It uses the nightly version of Pytorch (now supports aarch64) and compiles the modules that don't release a aarch64 wheel (e.g. bitsandbytes, flashinfer, triton, mamba, causal-conv1d, etc.). So it should be considered experimental, but I've run some tests on a couple of models and haven't had any issues.

@yajuvendrarawat
Copy link

ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/torch/init.py", line 51, in
ERROR 11-06 21:40:45 registry.py:286] from torch._utils import (
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/torch/_utils.py", line 4, in
ERROR 11-06 21:40:45 registry.py:286] import logging
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/vllm/vllm/logging/init.py", line 1, in
ERROR 11-06 21:40:45 registry.py:286] from vllm.logging.formatter import NewLineFormatter
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/vllm/vllm/logging/init.py", line 1, in
ERROR 11-06 21:40:45 registry.py:286] from vllm.logging.formatter import NewLineFormatter
ERROR 11-06 21:40:45 registry.py:286] File "/home/yajuvendra/vllm/vllm/logging/formatter.py", line 4, in
ERROR 11-06 21:40:45 registry.py:286] class NewLineFormatter(logging.Formatter):
ERROR 11-06 21:40:45 registry.py:286] AttributeError: partially initialized module 'logging' has no attribute 'Formatter' (most likely due to a circular import)

@yajuvendrarawat you are running your code under vllm/vllm , and then Python incorrectly treats vllm/logging.py as the builtin logging module.

Hello @youkaichao ,

I am not sure what I am doing wrong now.

My device is Jetson Orin Nano, compilation is fine one additional step I had to is to install Xformers as device is being detected as Volta and Turing.

and I am getting error while executing vllm serve any advice?

(myenv) [yajuvendra@llmhost ~]$ vllm serve --device cuda ibm-granite/granite-3.0-2b-instruct
INFO 11-26 09:22:19 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 11-26 09:22:22 api_server.py:585] vLLM API server version 0.6.5.dev0+g02dbf30e.d20241126
INFO 11-26 09:22:22 api_server.py:586] args: Namespace(subparser='serve', model_tag='ibm-granite/granite-3.0-2b-instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='ibm-granite/granite-3.0-2b-instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cuda', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0xfffedbf4aa70>)
INFO 11-26 09:22:22 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/40e8a44a-c8fc-4453-a9e9-63e9259e31fd for IPC Path.
INFO 11-26 09:22:22 api_server.py:194] Started engine process with PID 14682
INFO 11-26 09:22:29 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-26 09:22:36 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-26 09:22:36 config.py:791] Possibly too large swap space. 4.00 GiB out of the 7.12 GiB total CPU memory is allocated for the swap space.
WARNING 11-26 09:22:46 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-26 09:22:46 config.py:791] Possibly too large swap space. 4.00 GiB out of the 7.12 GiB total CPU memory is allocated for the swap space.
INFO 11-26 09:22:46 llm_engine.py:249] Initializing an LLM engine (v0.6.5.dev0+g02dbf30e.d20241126) with config: model='ibm-granite/granite-3.0-2b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.0-2b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=ibm-granite/granite-3.0-2b-instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 09:22:47 selector.py:261] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-26 09:22:47 selector.py:144] Using XFormers backend.
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/xformers/ops/swiglu_op.py:107: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(cls, ctx, x, w1, b1, w2, b2, w3, b3):
/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/xformers/ops/swiglu_op.py:128: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(cls, ctx, dx5):
ERROR 11-26 09:22:48 engine.py:366]
Traceback (most recent call last):
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/engine.py", line 71, in init
self.engine = LLMEngine(*args, **kwargs)
File "/home/yajuvendra/vllm/vllm/engine/llm_engine.py", line 347, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
File "/home/yajuvendra/vllm/vllm/executor/executor_base.py", line 36, in init
self._init_executor()
File "/home/yajuvendra/vllm/vllm/executor/gpu_executor.py", line 39, in _init_executor
self.driver_worker.init_device()
File "/home/yajuvendra/vllm/vllm/worker/worker.py", line 137, in init_device
_check_if_gpu_supports_dtype(self.model_config.dtype)
File "/home/yajuvendra/vllm/vllm/worker/worker.py", line 471, in _check_if_gpu_supports_dtype
gpu_name = current_platform.get_device_name()
File "/home/yajuvendra/vllm/vllm/platforms/interface.py", line 103, in get_device_name
raise NotImplementedError
NotImplementedError
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
raise e
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/engine.py", line 71, in init
self.engine = LLMEngine(*args, **kwargs)
File "/home/yajuvendra/vllm/vllm/engine/llm_engine.py", line 347, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
File "/home/yajuvendra/vllm/vllm/executor/executor_base.py", line 36, in init
self._init_executor()
File "/home/yajuvendra/vllm/vllm/executor/gpu_executor.py", line 39, in _init_executor
self.driver_worker.init_device()
File "/home/yajuvendra/vllm/vllm/worker/worker.py", line 137, in init_device
_check_if_gpu_supports_dtype(self.model_config.dtype)
File "/home/yajuvendra/vllm/vllm/worker/worker.py", line 471, in _check_if_gpu_supports_dtype
gpu_name = current_platform.get_device_name()
File "/home/yajuvendra/vllm/vllm/platforms/interface.py", line 103, in get_device_name
raise NotImplementedError
NotImplementedError
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/yajuvendra/vllm/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/home/yajuvendra/vllm/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/zmq/_future.py", line 400, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
File "/home/yajuvendra/anaconda3/envs/myenv/bin/vllm", line 8, in
sys.exit(main())
File "/home/yajuvendra/vllm/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/home/yajuvendra/vllm/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/home/yajuvendra/vllm/vllm/entrypoints/openai/api_server.py", line 609, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/yajuvendra/vllm/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/yajuvendra/anaconda3/envs/myenv/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/yajuvendra/vllm/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

br..
Yaju

@youkaichao
Copy link
Member

@yajuvendrarawat I think you need #9735 , which is landed just now.

@alexchenyu
Copy link

If anyone with a MGX or GH200 wants to test the latest version with different models, let me know if any supported models fail.

Uploaded the latest (v0.6.4.post1) to dockerhub: https://hub.docker.com/r/drikster80/vllm-gh200-openai/tags docker pull drikster80/vllm-gh200-openai:latest

The new version ( PR #10499 ) is unique that it doesn't use the Nvidia Pytorch container anymore and matches closer to standard VLLM container. It uses the nightly version of Pytorch (now supports aarch64) and compiles the modules that don't release a aarch64 wheel (e.g. bitsandbytes, flashinfer, triton, mamba, causal-conv1d, etc.). So it should be considered experimental, but I've run some tests on a couple of models and haven't had any issues.

Hi @drikster80 I tried this command:

sudo docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=xxx" \
    --env NCCL_TIMEOUT=600 \
    -p 8000:8000 \
    --ipc=host \
    --name vllm \
    drikster80/vllm-gh200-openai:latest \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
    --max-num-seqs 1 \
    --tensor-parallel-size 1  \
    --max-model-len 65536 \
    --api-key  eyJhIjoiYmI5ZW \
    --trust-remote-code \
    --gpu-memory-utilization 0.85

But report error like this:

INFO 11-26 10:36:50 model_runner.py:1072] Starting to load model meta-llama/Meta-Llama-3.1-70B-Instruct...
ERROR 11-26 10:36:58 engine.py:366] CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 94.50 GiB of which 767.00 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 93.12 GiB is allocated by PyTorch, and 224.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 11-26 10:36:58 engine.py:366] Traceback (most recent call last):
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-26 10:36:58 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-26 10:36:58 engine.py:366]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 11-26 10:36:58 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 11-26 10:36:58 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 11-26 10:36:58 engine.py:366]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 345, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 11-26 10:36:58 engine.py:366]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 36, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self._init_executor()
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
ERROR 11-26 10:36:58 engine.py:366]     self.driver_worker.load_model()
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 152, in load_model
ERROR 11-26 10:36:58 engine.py:366]     self.model_runner.load_model()
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1074, in load_model
ERROR 11-26 10:36:58 engine.py:366]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 11-26 10:36:58 engine.py:366]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
ERROR 11-26 10:36:58 engine.py:366]     return loader.load_model(vllm_config=vllm_config)
ERROR 11-26 10:36:58 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 337, in load_model
ERROR 11-26 10:36:58 engine.py:366]     model = _initialize_model(vllm_config=vllm_config)
ERROR 11-26 10:36:58 engine.py:366]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
ERROR 11-26 10:36:58 engine.py:366]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 11-26 10:36:58 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 507, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self.model = LlamaModel(vllm_config=vllm_config,
ERROR 11-26 10:36:58 engine.py:366]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 124, in __init__
ERROR 11-26 10:36:58 engine.py:366]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 298, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 11-26 10:36:58 engine.py:366]                                                     ^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 511, in make_layers
ERROR 11-26 10:36:58 engine.py:366]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 11-26 10:36:58 engine.py:366]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 300, in <lambda>
ERROR 11-26 10:36:58 engine.py:366]     lambda prefix: LlamaDecoderLayer(config=config,
ERROR 11-26 10:36:58 engine.py:366]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 231, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self.mlp = LlamaMLP(
ERROR 11-26 10:36:58 engine.py:366]                ^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 73, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self.gate_up_proj = MergedColumnParallelLinear(
ERROR 11-26 10:36:58 engine.py:366]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 424, in __init__
ERROR 11-26 10:36:58 engine.py:366]     super().__init__(input_size=input_size,
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 304, in __init__
ERROR 11-26 10:36:58 engine.py:366]     self.quant_method.create_weights(
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 122, in create_weights
ERROR 11-26 10:36:58 engine.py:366]     weight = Parameter(torch.empty(sum(output_partition_sizes),
ERROR 11-26 10:36:58 engine.py:366]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 106, in __torch_function__
ERROR 11-26 10:36:58 engine.py:366]     return func(*args, **kwargs)
ERROR 11-26 10:36:58 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-26 10:36:58 engine.py:366] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 94.50 GiB of which 767.00 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 93.12 GiB is allocated by PyTorch, and 224.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Process SpawnProcess-1:
Traceback (most recent call last):
 File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
   self.run()
 File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
   self._target(*self._args, **self._kwargs)
 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
   raise e
 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
   engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
   return cls(ipc_path=ipc_path,
          ^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
   self.engine = LLMEngine(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 345, in __init__
   self.model_executor = executor_class(vllm_config=vllm_config, )
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 36, in __init__
   self._init_executor()
 File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
   self.driver_worker.load_model()
 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 152, in load_model
   self.model_runner.load_model()
 File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1074, in load_model
   self.model = get_model(vllm_config=self.vllm_config)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
   return loader.load_model(vllm_config=vllm_config)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 337, in load_model
   model = _initialize_model(vllm_config=vllm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
   return model_class(vllm_config=vllm_config, prefix=prefix)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 507, in __init__
   self.model = LlamaModel(vllm_config=vllm_config,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 124, in __init__
   old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 298, in __init__
   self.start_layer, self.end_layer, self.layers = make_layers(
                                                   ^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 511, in make_layers
   maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 300, in <lambda>
   lambda prefix: LlamaDecoderLayer(config=config,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 231, in __init__
   self.mlp = LlamaMLP(
              ^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 73, in __init__
   self.gate_up_proj = MergedColumnParallelLinear(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 424, in __init__
   super().__init__(input_size=input_size,
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 304, in __init__
   self.quant_method.create_weights(
 File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 122, in create_weights
   weight = Parameter(torch.empty(sum(output_partition_sizes),
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 106, in __torch_function__
   return func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 94.50 GiB of which 767.00 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 93.12 GiB is allocated by PyTorch, and 224.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W1126 10:36:58.745447229 ProcessGroupNCCL.cpp:1432] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
 File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
   while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
   raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
 File "<frozen runpy>", line 198, in _run_module_as_main
 File "<frozen runpy>", line 88, in _run_code
 File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 650, in <module>
   uvloop.run(run_server(args))
 File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
   return __asyncio.run(
          ^^^^^^^^^^^^^^
 File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
   return runner.run(main)
          ^^^^^^^^^^^^^^^^
 File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
   return self._loop.run_until_complete(task)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
 File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
   return await main
          ^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 616, in run_server
   async with build_async_engine_client(args) as engine_client:
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
   return await anext(self.gen)
          ^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 114, in build_async_engine_client
   async with build_async_engine_client_from_engine_args(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
   return await anext(self.gen)
          ^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 211, in build_async_engine_client_from_engine_args
   raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Could you please check why?

@yajuvendrarawat
Copy link

@yajuvendrarawat I think you need #9735 , which is landed just now.

Thanks a lot @youkaichao it moved ahead and I have faced another error, any other pointers? The pytorch is with the cuda
torch 2.6.0.dev20241125+cu124

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
CUDA Driver Version / Runtime Version 12.2 / 12.6
CUDA Capability Major/Minor version number: 8.7

(myenv1) [yajuvendra@llmhost ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_00:08:18_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

(myenv1) [yajuvendra@llmhost vllm]$ vllm serve --device cuda ibm-granite/granite-3.0-2b-instruct
INFO 11-27 15:31:08 api_server.py:625] vLLM API server version 0.6.4.post2.dev157+g2f0a0a17.d20241127
INFO 11-27 15:31:08 api_server.py:626] args: Namespace(subparser='serve', model_tag='ibm-granite/granite-3.0-2b-instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='ibm-granite/granite-3.0-2b-instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cuda', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0xfffed2e43b50>)
INFO 11-27 15:31:08 init.py:42] No plugins found.
INFO 11-27 15:31:08 api_server.py:178] Multiprocessing frontend to use ipc:///tmp/5e516545-b2fa-46a8-8b66-7f6bf9d6587f for IPC Path.
INFO 11-27 15:31:08 api_server.py:197] Started engine process with PID 23247
INFO 11-27 15:31:18 init.py:42] No plugins found.
WARNING 11-27 15:31:28 arg_utils.py:1119] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-27 15:31:28 config.py:820] Possibly too large swap space. 4.00 GiB out of the 7.12 GiB total CPU memory is allocated for the swap space.
WARNING 11-27 15:31:36 arg_utils.py:1119] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-27 15:31:36 config.py:820] Possibly too large swap space. 4.00 GiB out of the 7.12 GiB total CPU memory is allocated for the swap space.
INFO 11-27 15:31:36 llm_engine.py:248] Initializing an LLM engine (v0.6.4.post2.dev157+g2f0a0a17.d20241127) with config: model='ibm-granite/granite-3.0-2b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.0-2b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=ibm-granite/granite-3.0-2b-instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_attention', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes={}, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=<function PrivateAttr at 0xfffefb26f250>, capture_sizes=<function PrivateAttr at 0xfffefb26f250>, enabled_custom_ops=Counter(), disabled_custom_ops=Counter(), static_forward_context={})
INFO 11-27 15:31:40 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 11-27 15:31:41 selector.py:120] Using Flash Attention backend.
INFO 11-27 15:31:42 model_runner.py:1100] Starting to load model ibm-granite/granite-3.0-2b-instruct...
ERROR 11-27 15:31:42 engine.py:366] CUDA error: no kernel image is available for execution on the device
ERROR 11-27 15:31:42 engine.py:366] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-27 15:31:42 engine.py:366] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-27 15:31:42 engine.py:366] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 11-27 15:31:42 engine.py:366] Traceback (most recent call last):
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-27 15:31:42 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 11-27 15:31:42 engine.py:366] return cls(ipc_path=ipc_path,
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/engine.py", line 71, in init
ERROR 11-27 15:31:42 engine.py:366] self.engine = LLMEngine(*args, **kwargs)
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/llm_engine.py", line 335, in init
ERROR 11-27 15:31:42 engine.py:366] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/executor/executor_base.py", line 36, in init
ERROR 11-27 15:31:42 engine.py:366] self._init_executor()
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/executor/gpu_executor.py", line 35, in _init_executor
ERROR 11-27 15:31:42 engine.py:366] self.driver_worker.load_model()
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/worker/worker.py", line 153, in load_model
ERROR 11-27 15:31:42 engine.py:366] self.model_runner.load_model()
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/worker/model_runner.py", line 1102, in load_model
ERROR 11-27 15:31:42 engine.py:366] self.model = get_model(vllm_config=self.vllm_config)
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/model_loader/init.py", line 12, in get_model
ERROR 11-27 15:31:42 engine.py:366] return loader.load_model(vllm_config=vllm_config)
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/model_loader/loader.py", line 339, in load_model
ERROR 11-27 15:31:42 engine.py:366] model = _initialize_model(vllm_config=vllm_config)
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/model_loader/loader.py", line 106, in _initialize_model
ERROR 11-27 15:31:42 engine.py:366] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 383, in init
ERROR 11-27 15:31:42 engine.py:366] self.model = GraniteModel(vllm_config=vllm_config,
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/compilation/decorators.py", line 125, in init
ERROR 11-27 15:31:42 engine.py:366] old_init(self, vllm_config=vllm_config, prefix=prefix, kwargs)
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 286, in init
ERROR 11-27 15:31:42 engine.py:366] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/utils.py", line 507, in make_layers
ERROR 11-27 15:31:42 engine.py:366] [PPMissingLayer() for _ in range(start_layer)] + [
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/utils.py", line 508, in
ERROR 11-27 15:31:42 engine.py:366] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 288, in
ERROR 11-27 15:31:42 engine.py:366] lambda prefix: GraniteDecoderLayer(config=config,
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 206, in init
ERROR 11-27 15:31:42 engine.py:366] self.self_attn = GraniteAttention(
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 152, in init
ERROR 11-27 15:31:42 engine.py:366] self.rotary_emb = get_rope(
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 976, in get_rope
ERROR 11-27 15:31:42 engine.py:366] rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 95, in init
ERROR 11-27 15:31:42 engine.py:366] cache = self._compute_cos_sin_cache()
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 112, in _compute_cos_sin_cache
ERROR 11-27 15:31:42 engine.py:366] inv_freq = self._compute_inv_freq(self.base)
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 106, in _compute_inv_freq
ERROR 11-27 15:31:42 engine.py:366] inv_freq = 1.0 / (base
(torch.arange(
ERROR 11-27 15:31:42 engine.py:366] File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in torch_function
ERROR 11-27 15:31:42 engine.py:366] return func(*args, **kwargs)
ERROR 11-27 15:31:42 engine.py:366] RuntimeError: CUDA error: no kernel image is available for execution on the device
ERROR 11-27 15:31:42 engine.py:366] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-27 15:31:42 engine.py:366] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-27 15:31:42 engine.py:366] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 11-27 15:31:42 engine.py:366]
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
raise e
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/engine.py", line 71, in init
self.engine = LLMEngine(*args, **kwargs)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/llm_engine.py", line 335, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/executor/executor_base.py", line 36, in init
self._init_executor()
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/executor/gpu_executor.py", line 35, in _init_executor
self.driver_worker.load_model()
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/worker/worker.py", line 153, in load_model
self.model_runner.load_model()
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/worker/model_runner.py", line 1102, in load_model
self.model = get_model(vllm_config=self.vllm_config)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/model_loader/init.py", line 12, in get_model
return loader.load_model(vllm_config=vllm_config)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/model_loader/loader.py", line 339, in load_model
model = _initialize_model(vllm_config=vllm_config)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/model_loader/loader.py", line 106, in _initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 383, in init
self.model = GraniteModel(vllm_config=vllm_config,
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/compilation/decorators.py", line 125, in init
old_init(self, vllm_config=vllm_config, prefix=prefix, kwargs)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 286, in init
self.start_layer, self.end_layer, self.layers = make_layers(
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/utils.py", line 507, in make_layers
[PPMissingLayer() for _ in range(start_layer)] + [
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/utils.py", line 508, in
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 288, in
lambda prefix: GraniteDecoderLayer(config=config,
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 206, in init
self.self_attn = GraniteAttention(
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/models/granite.py", line 152, in init
self.rotary_emb = get_rope(
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 976, in get_rope
rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 95, in init
cache = self._compute_cos_sin_cache()
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 112, in _compute_cos_sin_cache
inv_freq = self._compute_inv_freq(self.base)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/model_executor/layers/rotary_embedding.py", line 106, in _compute_inv_freq
inv_freq = 1.0 / (base
(torch.arange(
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in torch_function
return func(*args, **kwargs)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank0]:[W1127 15:31:43.495876646 ProcessGroupNCCL.cpp:1427] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/site-packages/zmq/_future.py", line 400, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
File "/home/yajuvendra/anaconda3/envs/myenv1/bin/vllm", line 8, in
sys.exit(main())
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/scripts.py", line 201, in main
args.dispatch_function(args)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/scripts.py", line 42, in serve
uvloop.run(run_server(args))
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/site-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/entrypoints/openai/api_server.py", line 649, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/entrypoints/openai/api_server.py", line 116, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/yajuvendra/anaconda3/envs/myenv1/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/yajuvendra/vllm/vllm1/vllm/vllm/entrypoints/openai/api_server.py", line 213, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
(myenv1) [yajuvendra@llmhost vllm]$

@youkaichao
Copy link
Member

@yajuvendrarawat how do you install vllm? it seems you don't have the kernels compiled for the cuda architecture.

@yajuvendrarawat
Copy link

yajuvendrarawat commented Nov 28, 2024

@youkaichao

My environment is Rhel 9 on jetson orin nano. Created a conda environment the way it’s explained on vllm documentation and then below commands.

$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -vvv -e . --no-build-isolation

@youkaichao
Copy link
Member

I don't have a jetson machine to try it, you can contact the author @conroy-cheers of #9735 .

@yajuvendrarawat
Copy link

I don't have a jetson machine to try it, you can contact the author @conroy-cheers of #9735 .

Hello @youkaichao ,

The issue was that my pytorch was not supporting sm_87 arch which I got now and now I am facing another issue on my jetson nano orin.

Error is
AttributeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241216-082749.pkl): '_OpNamespace' 'vllm' object has no attribute 'unified_attention_with_output'

attached is the error file

error.txt

Can you please help to see why I am getting error.

br..
Yaju

@youkaichao
Copy link
Member

@yajuvendrarawat It means something is wrong with this line:

direct_register_custom_op(

You can try to debug here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet