Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm multi gpu #24

Open
matbee-eth opened this issue Oct 11, 2024 · 14 comments
Open

vllm multi gpu #24

matbee-eth opened this issue Oct 11, 2024 · 14 comments
Labels

Comments

@matbee-eth
Copy link

Any help on getting multi gpu support running? vLLM fails to load with tensor_parallel_size=2

@xffxff
Copy link
Collaborator

xffxff commented Oct 11, 2024

@matbee-eth Can you include more details about the error to help us debug it?

@davanstrien
Copy link

Possibly a different issue to @matbee-eth, but I run into OOM errors using VLLM, running on a machine with 4xL4 (96GB VRAM).

This code:

import os
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
from PIL import Image
from transformers import AutoTokenizer
from vllm import LLM, ModelRegistry, SamplingParams
from vllm.model_executor.models import _MULTIMODAL_MODELS

from aria.vllm.aria import AriaForConditionalGeneration
import torch
torch.cuda.empty_cache()
ModelRegistry.register_model(
    "AriaForConditionalGeneration", AriaForConditionalGeneration
)
_MULTIMODAL_MODELS["AriaForConditionalGeneration"] = (
    "aria",
    "AriaForConditionalGeneration",
)


def main():
    llm = LLM(
        model="rhymes-ai/Aria",
        tokenizer="rhymes-ai/Aria",
        tokenizer_mode="slow",
        dtype="bfloat16",
        limit_mm_per_prompt={"image": 256},
        enforce_eager=True,
        trust_remote_code=True,
        gpu_memory_utilization=0.6,
        tensor_parallel_size=4,
        max_seq_len_to_capture=1024
    
    )

    tokenizer = AutoTokenizer.from_pretrained(
        "rhymes-ai/Aria", trust_remote_code=True, use_fast=False
    )

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare Image 1 and image 2, tell me about the differences between image 1 and image 2.\nImage 1\n",
                },
                {"type": "image"},
                {"type": "text", "text": "\nImage 2\n"},
                {"type": "image"},
            ],
        }
    ]

    message = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

    outputs = llm.generate(
        {
            "prompt_token_ids": message,
            "multi_modal_data": {
                "image": [
                    Image.open("Screenshot 2024-10-07 at 17.01.14.png"),
                   
                ],
                "max_image_size": 200,  # [Optional] The max image patch size, default `980`
                "split_image": False,  # [Optional] whether to split the images, default `False`
            },
        },
        sampling_params=SamplingParams(max_tokens=200, top_k=1, stop=["<|im_end|>"]),
    )

    for o in outputs:
        generated_tokens = o.outputs[0].token_ids
        print(tokenizer.decode(generated_tokens))


if __name__ == "__main__":
    main()

Results in:


INFO 10-14 12:32:42 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2421) INFO 10-14 12:32:45 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2420) INFO 10-14 12:32:45 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2419) INFO 10-14 12:32:45 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2420) INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
(VllmWorkerProcess pid=2419) INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
(VllmWorkerProcess pid=2421) INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
(VllmWorkerProcess pid=2420) INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
(VllmWorkerProcess pid=2419) INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
(VllmWorkerProcess pid=2421) INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 3 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states = self.language_model(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states, residual = layer(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                               ^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 627, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     output = self.token_dispatcher.token_unpermutation(expert_output, scores)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                         ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 3 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Traceback (most recent call last):
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     self.model_runner.profile_run()
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     raise type(err)(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 3 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 2 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states = self.language_model(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states, residual = layer(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                               ^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 627, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     output = self.token_dispatcher.token_unpermutation(expert_output, scores)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                         ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 2 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Traceback (most recent call last):
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     self.model_runner.profile_run()
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     raise type(err)(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 2 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 1 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states = self.language_model(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states, residual = layer(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                               ^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 627, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     output = self.token_dispatcher.token_unpermutation(expert_output, scores)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]                         ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 1 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Traceback (most recent call last):
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     self.model_runner.profile_run()
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]     raise type(err)(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 1 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:                                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
[rank0]:     hidden_states = self.language_model(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/Aria/aria/vllm/aria.py", line 627, in forward
[rank0]:     output = self.token_dispatcher.token_unpermutation(expert_output, scores)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
[rank0]:     unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
[rank0]:                         ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 0 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/llm.py", line 77, in <module>
[rank0]:     main()
[rank0]:   File "/data/llm.py", line 21, in main
[rank0]:     llm = LLM(
[rank0]:           ^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
[rank0]:     raise type(err)(
[rank0]: torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 0 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 10-14 12:32:48 multiproc_worker_utils.py:124] Killing local vLLM worker processes
Exception in thread Thread-1/home/user/miniconda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I have tried reducing context length, memory utilization, etc, with no success. I have also tried with 4x40LS (192GB VRAM) and also got OOM errors. The transformers version runs without OOM.

@matbee-eth
Copy link
Author

matbee-eth commented Oct 14, 2024 via email

@aria-hacker
Copy link
Collaborator

@davanstrien
Thank you for your issue report. Could you please provide some additional details to help us better understand the situation?

  1. What version of vLLM are you currently using?
  2. Have you checked if there are any other processes that might be consuming VRAM?
  3. At what stage does the Out-of-Memory (OOM) error occur? For example, does the OOM error happen after the model is successfully loaded, or during the inference process?

Your detailed response will help us investigate and resolve the issue more efficiently. Thank you for your cooperation!

And not sure if it will work in your environments, but here are some common settings we tried to reduce memory usage you can set it when loading the model with LLM(...).

  1. Increase gpu_memory_utilization.
  2. We could reduce the max_model_len to a lower number like 4096.
  3. Decrease the max_num_seqs to 1.

@nithingovindugari
Copy link

I also got different error stating below.. it works fine when i use single gpu but doesn't work when tried to increase tensor_parellel_size am using vllm 0.6.2 and 2 A100 gpu's with 80 GB memory

ValueError: Model architectures ['AriaForConditionalGeneration'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MedusaModel', 'EAGLEModel', 'MLPSpeculatorPreTrainedModel', 'JambaForCausalLM', 'GraniteForCausalLM', 'MistralModel', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'UltravoxModel', 'MllamaForConditionalGeneration', 'BartModel', 'BartForConditionalGeneration']

@aria-hacker
Copy link
Collaborator

@nithingovindugari Because Aria is currently not an online supported model, now the way we run with vLLM in OOT registration mode. That's why we add these lines on top of the inference.md. The error message you provided seems related to failing to register the model arch.
Please make sure you add these lines at the top of the file to register the model arch into vLLM.

from vllm import LLM, ModelRegistry, SamplingParams
from vllm.model_executor.models import _MULTIMODAL_MODELS

from aria.vllm.aria import AriaForConditionalGeneration

ModelRegistry.register_model(
    "AriaForConditionalGeneration", AriaForConditionalGeneration
)
_MULTIMODAL_MODELS["AriaForConditionalGeneration"] = (
    "aria",
    "AriaForConditionalGeneration",
)

@davanstrien
Copy link

@aria-hacker I got this working using 4x40LS (192GB VRAM). I couldn't get it working on the 4xL4 (96GB VRAM) even adjusting the GPU utilisation, sequence length, etc. I assume the VLLM implementation just needs more memory for now.

At what stage does the Out-of-Memory (OOM) error occur? For example, does the OOM error happen after the model is successfully loaded, or during the inference process?
The models seems to load successfully but get OOM passing even one prompt.

Packages versions I'm using:

accelerate==0.34.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
anaconda-anon-usage @ file:///croot/anaconda-anon-usage_1710965072196/work
annotated-types==0.7.0
anyio==4.6.2.post1
archspec @ file:///croot/archspec_1709217642129/work
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
aria @ file:///data/Aria
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==24.2.0
babel==2.16.0
beautifulsoup4==4.12.3
bleach==6.1.0
boltons @ file:///work/ci_py311/boltons_1677685195580/work
brotli @ file:///croot/brotli-split_1714483155106/work
certifi @ file:///croot/certifi_1720453481653/work/certifi
cffi @ file:///croot/cffi_1714483155441/work
charset-normalizer @ file:///croot/charset-normalizer_1721748349566/work
click==8.1.7
cloudpickle==3.1.0
comm==0.2.2
conda @ file:///croot/conda_1722004606466/work
conda-content-trust @ file:///croot/conda-content-trust_1714483159009/work
conda-libmamba-solver @ file:///croot/conda-libmamba-solver_1721662679737/work/src
conda-package-handling @ file:///croot/conda-package-handling_1718138267740/work
conda-package-streaming @ file:///croot/conda-package-streaming_1718136078615/work
contourpy==1.3.0
cryptography @ file:///croot/cryptography_1714660666131/work
cycler==0.12.1
datasets==2.14.4
debugpy==1.8.7
decorator==5.1.1
decord==0.6.0
deepspeed==0.15.0
defusedxml==0.7.1
dill==0.3.7
diskcache==5.6.3
distro @ file:///croot/distro_1714488253808/work
docker-pycreds==0.4.0
docstring-parser==0.16
einops==0.8.0
executing==2.1.0
fastapi==0.115.2
fastjsonschema==2.20.0
filelock==3.16.1
flash-attention==1.0.0
flash-attn==2.6.3
fonttools==4.54.1
fqdn==1.5.1
frozendict @ file:///croot/frozendict_1713194832637/work
frozenlist==1.4.1
fsspec==2024.9.0
gguf==0.10.0
gitdb==4.0.11
gitpython==3.1.43
grouped-gemm==0.1.6
h11==0.14.0
hjson==3.1.0
httpcore==1.0.6
httptools==0.6.2
httpx==0.27.2
huggingface-hub==0.25.2
idna @ file:///croot/idna_1714398848350/work
importlib-metadata==8.5.0
interegular==0.3.3
ipykernel==6.29.5
ipython==8.28.0
ipywidgets==8.1.5
isoduration==20.11.0
jedi==0.19.1
jinja2==3.1.4
jiter==0.6.1
json5==0.9.25
jsonpatch @ file:///croot/jsonpatch_1714483231291/work
jsonpointer==2.1
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter-client==8.6.3
jupyter-core==5.7.2
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter-server==2.14.2
jupyter-server-terminals==0.5.3
jupyterlab==4.2.5
jupyterlab-pygments==0.3.0
jupyterlab-server==2.27.3
jupyterlab-widgets==3.0.13
kiwisolver==1.4.7
lark==1.2.2
libmambapy @ file:///croot/mamba-split_1714483352891/work/libmambapy
llvmlite==0.43.0
lm-format-enforcer==0.10.6
markdown-it-py==3.0.0
markupsafe==3.0.1
matplotlib==3.9.2
matplotlib-inline==0.1.7
mdurl==0.1.2
menuinst @ file:///croot/menuinst_1723567589013/work
mistral-common==1.4.4
mistune==3.0.2
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.18.6
multidict==6.1.0
multiprocess==0.70.15
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.1
ninja==1.11.1.1
notebook-shim==0.2.4
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.77
nvidia-nvtx-cu12==12.1.105
openai==1.51.2
outlines==0.0.46
overrides==7.7.0
packaging @ file:///croot/packaging_1720101850331/work
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
partial-json-parser==0.2.1.1.post4
peft==0.12.0
pexpect==4.9.0
pillow==10.4.0
pip @ file:///croot/pip_1723484598856/work
platformdirs @ file:///croot/platformdirs_1692205439124/work
pluggy @ file:///work/ci_py311/pluggy_1676822818071/work
prometheus-client==0.21.0
prometheus-fastapi-instrumentator==7.0.0
prompt-toolkit==3.0.48
propcache==0.2.0
protobuf==5.28.2
psutil==6.0.0
ptyprocess==0.7.0
pure-eval==0.2.3
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==17.0.0
pycosat @ file:///croot/pycosat_1714510623388/work
pycountry==24.6.1
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==2.9.2
pydantic-core==2.23.4
pygments==2.18.0
pyparsing==3.2.0
pysocks @ file:///work/ci_py311/pysocks_1676822712504/work
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
pytz==2024.2
pyyaml==6.0.2
pyzmq==26.2.0
ray==2.37.0
referencing==0.35.1
regex==2024.9.11
requests @ file:///croot/requests_1721410876868/work
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.9.2
rpds-py==0.20.0
ruamel-yaml @ file:///work/ci_py311/ruamel.yaml_1676838772170/work
safetensors==0.4.5
send2trash==1.8.3
sentencepiece==0.2.0
sentry-sdk==2.16.0
setproctitle==1.3.3
setuptools==72.1.0
shtab==1.7.1
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.6
stack-data==0.6.3
starlette==0.40.0
sympy==1.13.3
terminado==0.18.1
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.20.1
torch==2.4.0
torchvision==0.19.0
tornado==6.2
tqdm==4.66.5
traitlets==5.14.3
transformers==4.45.0
triton==3.0.0
trl==0.9.6
truststore @ file:///croot/truststore_1695244293384/work
types-python-dateutil==2.9.0.20241003
typing-extensions==4.12.2
tyro==0.8.12
tzdata==2024.2
uri-template==1.3.0
urllib3 @ file:///croot/urllib3_1718912636303/work
uv==0.4.22
uvicorn==0.32.0
uvloop==0.21.0
vllm==0.6.2
wandb==0.18.1
watchfiles==0.24.0
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
websockets==13.1
wheel==0.43.0
widgetsnbextension==4.0.13
xformers==0.0.27.post2
xxhash==3.5.0
yarl==1.15.3
zipp==3.20.2
zstandard @ file:///croot/zstandard_1714677652653/work

@nithingovindugari
Copy link

@aria-hacker do vllm needs to e in editable mode? because I am trying to run it using python file and inspite of including model registry code am still getting that error .... can anyone give example of aria model running with vllm independently and is it because you are executing on notebook and am doing it on python file

@aria-hacker
Copy link
Collaborator

aria-hacker commented Oct 18, 2024

@nithingovindugari It shouldn't need to be installed in editable mode for vLLM installation. And it should not only work with notebook I test it with python file and it works fine in my environment. The code you can refer to the inference doc.

@joel-simp
Copy link

Facing the same issue when loading the model for mult-gpu. I tried another model to load 'facebook/opt-13b' for multi-gpu and it was working perfectly fine, however the tensor_parallel doesn't seem to work with aria. Getting the following error:

INFO 11-08 05:06:23 config.py:1652] Downcasting torch.float32 to torch.bfloat16.
INFO 11-08 05:06:23 config.py:899] Defaulting to use mp for distributed inference
WARNING 11-08 05:06:23 arg_utils.py:940] The model has a long context length (65536). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 11-08 05:06:23 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 11-08 05:06:23 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/home/images/vlm/models/hub/models--rhymes-ai--Aria/snapshots/c347f2e1f19affd047295f93b8b2adc06231f496', speculative_config=None, tokenizer='/home/images/vlm/models/hub/models--rhymes-ai--Aria/snapshots/c347f2e1f19affd047295f93b8b2adc06231f496', skip_tokenizer_init=False, tokenizer_mode=slow, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/images/vlm/models/hub/models--rhymes-ai--Aria/snapshots/c347f2e1f19affd047295f93b8b2adc06231f496, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 11-08 05:06:23 tokenizer.py:156] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 11-08 05:06:23 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-08 05:06:23 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=22194) INFO 11-08 05:06:23 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22194) INFO 11-08 05:06:23 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] output = executor(*args, **kwargs)
ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] raise RuntimeError(
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/vllm/worker/worker.py", line 166, in init_device
ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22194) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] torch.cuda.set_device(self.device)
ERROR 11-08 05:06:23 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=22195) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=22195) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=22195) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] raise RuntimeError(
INFO 11-08 05:06:23 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=22195) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=22195) (VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233]
ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] File "/home/images/anaconda3/envs/aria/lib/python3.9/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] raise RuntimeError(
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=22196) ERROR 11-08 05:06:23 multiproc_worker_utils.py:233]

@xffxff xffxff added the vllm label Nov 11, 2024
@xffxff
Copy link
Collaborator

xffxff commented Nov 11, 2024

Hey @joel-simp,

You need to set VLLM_WORKER_MULTIPROC_METHOD="spawn" for this to work correctly.

@joel-simp
Copy link

Hey @joel-simp,

You need to set VLLM_WORKER_MULTIPROC_METHOD="spawn" for this to work correctly.

Hey @xffxff ,VLLM_WORKER_MULTIPROC_METHOD="spawn" worked, However encountered with another error
ValueError: Model architectures ['AriaForConditionalGeneration'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MedusaModel', 'EAGLEModel', 'MLPSpeculatorPreTrainedModel', 'JambaForCausalLM', 'GraniteForCausalLM', 'MistralModel', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'UltravoxModel', 'MllamaForConditionalGeneration', 'BartModel', 'BartForConditionalGeneration']

I have added the model registry as shown in the inference.md, Yet facing the same issue

@xffxff
Copy link
Collaborator

xffxff commented Nov 11, 2024

Hi @joel-simp,
I tried using vllm==0.6.2 installed via pip install -e .[vllm], and I didn't encounter the error you're describing. Could you confirm the exact version you're using?

@joel-simp
Copy link

Hi @xffxff
I did install vllm via pip install -e .[vllm], but gettting the version 0.6.1.dev238+ge2c6e0a82. and also getting the same error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants