-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vllm multi gpu #24
Comments
@matbee-eth Can you include more details about the error to help us debug it? |
Possibly a different issue to @matbee-eth, but I run into OOM errors using VLLM, running on a machine with 4xL4 (96GB VRAM). This code: import os
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
from PIL import Image
from transformers import AutoTokenizer
from vllm import LLM, ModelRegistry, SamplingParams
from vllm.model_executor.models import _MULTIMODAL_MODELS
from aria.vllm.aria import AriaForConditionalGeneration
import torch
torch.cuda.empty_cache()
ModelRegistry.register_model(
"AriaForConditionalGeneration", AriaForConditionalGeneration
)
_MULTIMODAL_MODELS["AriaForConditionalGeneration"] = (
"aria",
"AriaForConditionalGeneration",
)
def main():
llm = LLM(
model="rhymes-ai/Aria",
tokenizer="rhymes-ai/Aria",
tokenizer_mode="slow",
dtype="bfloat16",
limit_mm_per_prompt={"image": 256},
enforce_eager=True,
trust_remote_code=True,
gpu_memory_utilization=0.6,
tensor_parallel_size=4,
max_seq_len_to_capture=1024
)
tokenizer = AutoTokenizer.from_pretrained(
"rhymes-ai/Aria", trust_remote_code=True, use_fast=False
)
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare Image 1 and image 2, tell me about the differences between image 1 and image 2.\nImage 1\n",
},
{"type": "image"},
{"type": "text", "text": "\nImage 2\n"},
{"type": "image"},
],
}
]
message = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
outputs = llm.generate(
{
"prompt_token_ids": message,
"multi_modal_data": {
"image": [
Image.open("Screenshot 2024-10-07 at 17.01.14.png"),
],
"max_image_size": 200, # [Optional] The max image patch size, default `980`
"split_image": False, # [Optional] whether to split the images, default `False`
},
},
sampling_params=SamplingParams(max_tokens=200, top_k=1, stop=["<|im_end|>"]),
)
for o in outputs:
generated_tokens = o.outputs[0].token_ids
print(tokenizer.decode(generated_tokens))
if __name__ == "__main__":
main() Results in:
I have tried reducing context length, memory utilization, etc, with no success. I have also tried with 4x40LS (192GB VRAM) and also got OOM errors. The transformers version runs without OOM. |
Yes that’s the same issue as I’m receiving, it doesn’t properly run multi gpu on Vllm
…On Mon, Oct 14, 2024 at 7:09 AM, Daniel van Strien ***@***.***(mailto:On Mon, Oct 14, 2024 at 7:09 AM, Daniel van Strien <<a href=)> wrote:
Possibly a different issue to ***@***.***(https://github.com/matbee-eth), but I run into OOM errors using VLLM, running on a machine with 4xL4 (96GB VRAM).
This code:
import
os
os
.
environ
[
'VLLM_WORKER_MULTIPROC_METHOD'
]
=
'spawn'
from
PIL
import
Image
from
transformers
import
AutoTokenizer
from
vllm
import
LLM
,
ModelRegistry
,
SamplingParams
from
vllm
.
model_executor
.
models
import
_MULTIMODAL_MODELS
from
aria
.
vllm
.
aria
import
AriaForConditionalGeneration
import
torch
torch
.
cuda
.
empty_cache
()
ModelRegistry
.
register_model
(
"AriaForConditionalGeneration"
,
AriaForConditionalGeneration
)
_MULTIMODAL_MODELS
[
"AriaForConditionalGeneration"
]
=
(
"aria"
,
"AriaForConditionalGeneration"
,
)
def
main
():
llm
=
LLM
(
model
=
"rhymes-ai/Aria"
,
tokenizer
=
"rhymes-ai/Aria"
,
tokenizer_mode
=
"slow"
,
dtype
=
"bfloat16"
,
limit_mm_per_prompt
=
{
"image"
:
256
},
enforce_eager
=
True
,
trust_remote_code
=
True
,
gpu_memory_utilization
=
0.6
,
tensor_parallel_size
=
4
,
max_seq_len_to_capture
=
1024
)
tokenizer
=
AutoTokenizer
.
from_pretrained
(
"rhymes-ai/Aria"
,
trust_remote_code
=
True
,
use_fast
=
False
)
messages
=
[
{
"role"
:
"user"
,
"content"
: [
{
"type"
:
"text"
,
"text"
:
"Compare Image 1 and image 2, tell me about the differences between image 1 and image 2.
\n
Image 1
\n
"
,
},
{
"type"
:
"image"
},
{
"type"
:
"text"
,
"text"
:
"
\n
Image 2
\n
"
},
{
"type"
:
"image"
},
],
}
]
message
=
tokenizer
.
apply_chat_template
(
messages
,
add_generation_prompt
=
True
)
outputs
=
llm
.
generate
(
{
"prompt_token_ids"
:
message
,
"multi_modal_data"
: {
"image"
: [
Image
.
open
(
"Screenshot 2024-10-07 at 17.01.14.png"
),
],
"max_image_size"
:
200
,
# [Optional] The max image patch size, default `980`
"split_image"
:
False
,
# [Optional] whether to split the images, default `False`
},
},
sampling_params
=
SamplingParams
(
max_tokens
=
200
,
top_k
=
1
,
stop
=
[
"<|im_end|>"
]),
)
for
o
in
outputs
:
generated_tokens
=
o
.
outputs
[
0
].
token_ids
print
(
tokenizer
.
decode
(
generated_tokens
))
if
__name__
==
"__main__"
:
main
()
Results in:
INFO 10-14 12:32:42 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2421) INFO 10-14 12:32:45 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2420) INFO 10-14 12:32:45 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2419) INFO 10-14 12:32:45 model_runner.py:1025] Loading model weights took 12.4393 GB
(VllmWorkerProcess pid=2420) INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
(VllmWorkerProcess pid=2419) INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
(VllmWorkerProcess pid=2421) INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
INFO 10-14 12:32:47 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl...
(VllmWorkerProcess pid=2420) INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
(VllmWorkerProcess pid=2419) INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
(VllmWorkerProcess pid=2421) INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
INFO 10-14 12:32:47 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123247.pkl.
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 3 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states = self.language_model(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states, residual = layer(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 627, in forward
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] output = self.token_dispatcher.token_unpermutation(expert_output, scores)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 3 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Traceback (most recent call last):
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] self.model_runner.profile_run()
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] raise type(err)(
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 3 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2421) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 2 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states = self.language_model(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states, residual = layer(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 627, in forward
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] output = self.token_dispatcher.token_unpermutation(expert_output, scores)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 2 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Traceback (most recent call last):
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] self.model_runner.profile_run()
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] raise type(err)(
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 2 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2420) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 1 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states = self.language_model(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states, residual = layer(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 627, in forward
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] output = self.token_dispatcher.token_unpermutation(expert_output, scores)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 1 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] Traceback (most recent call last):
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] self.model_runner.profile_run()
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] return func(*args, **kwargs)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] raise type(err)(
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233] torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 1 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorkerProcess pid=2419) ERROR 10-14 12:32:47 multiproc_worker_utils.py:233]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1590, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/Aria/aria/vllm/aria.py", line 1155, in forward
[rank0]: hidden_states = self.language_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
[rank0]: hidden_states, residual = layer(
[rank0]: ^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 261, in forward
[rank0]: hidden_states = self.mlp(hidden_states)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/Aria/aria/vllm/aria.py", line 627, in forward
[rank0]: output = self.token_dispatcher.token_unpermutation(expert_output, scores)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/Aria/aria/vllm/aria.py", line 404, in token_unpermutation
[rank0]: unpermuted_tokens = unpermuted_tokens * scores.unsqueeze(-1)
[rank0]: ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.88 GiB. GPU 0 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/llm.py", line 77, in <module>
[rank0]: main()
[rank0]: File "/data/llm.py", line 21, in main
[rank0]: llm = LLM(
[rank0]: ^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
[rank0]: raise type(err)(
[rank0]: torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123247.pkl): CUDA out of memory. Tried to allocate 1.88 GiB. GPU 0 has a total capacity of 21.96 GiB of which 1.29 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.02 GiB is allocated by PyTorch, and 1.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 10-14 12:32:48 multiproc_worker_utils.py:124] Killing local vLLM worker processes
Exception in thread Thread-1/home/user/miniconda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
I have tried reducing context length, memory utilization, etc, with no success. I have also tried with 4x40LS (192GB VRAM) and also got OOM errors. The transformers version runs without OOM.
—
Reply to this email directly, [view it on GitHub](#24 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AAC5DLAJFSUK7UMXFEKBGK3Z3OQ5BAVCNFSM6AAAAABPZQ5NAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJQHA4DKNZTGY).
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@davanstrien
Your detailed response will help us investigate and resolve the issue more efficiently. Thank you for your cooperation! And not sure if it will work in your environments, but here are some common settings we tried to reduce memory usage you can set it when loading the model with LLM(...).
|
I also got different error stating below.. it works fine when i use single gpu but doesn't work when tried to increase tensor_parellel_size am using vllm 0.6.2 and 2 A100 gpu's with 80 GB memory ValueError: Model architectures ['AriaForConditionalGeneration'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'Qwen2VLForConditionalGeneration', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MedusaModel', 'EAGLEModel', 'MLPSpeculatorPreTrainedModel', 'JambaForCausalLM', 'GraniteForCausalLM', 'MistralModel', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'UltravoxModel', 'MllamaForConditionalGeneration', 'BartModel', 'BartForConditionalGeneration'] |
@nithingovindugari Because Aria is currently not an online supported model, now the way we run with vLLM in OOT registration mode. That's why we add these lines on top of the inference.md. The error message you provided seems related to failing to register the model arch. from vllm import LLM, ModelRegistry, SamplingParams
from vllm.model_executor.models import _MULTIMODAL_MODELS
from aria.vllm.aria import AriaForConditionalGeneration
ModelRegistry.register_model(
"AriaForConditionalGeneration", AriaForConditionalGeneration
)
_MULTIMODAL_MODELS["AriaForConditionalGeneration"] = (
"aria",
"AriaForConditionalGeneration",
) |
@aria-hacker I got this working using 4x40LS (192GB VRAM). I couldn't get it working on the 4xL4 (96GB VRAM) even adjusting the GPU utilisation, sequence length, etc. I assume the VLLM implementation just needs more memory for now.
Packages versions I'm using:
|
@aria-hacker do vllm needs to e in editable mode? because I am trying to run it using python file and inspite of including model registry code am still getting that error .... can anyone give example of aria model running with vllm independently and is it because you are executing on notebook and am doing it on python file |
@nithingovindugari It shouldn't need to be installed in editable mode for vLLM installation. And it should not only work with notebook I test it with python file and it works fine in my environment. The code you can refer to the inference doc. |
Facing the same issue when loading the model for mult-gpu. I tried another model to load 'facebook/opt-13b' for multi-gpu and it was working perfectly fine, however the tensor_parallel doesn't seem to work with aria. Getting the following error: INFO 11-08 05:06:23 config.py:1652] Downcasting torch.float32 to torch.bfloat16. |
Hey @joel-simp, You need to set |
Hey @xffxff ,VLLM_WORKER_MULTIPROC_METHOD="spawn" worked, However encountered with another error I have added the model registry as shown in the inference.md, Yet facing the same issue |
Hi @joel-simp, |
Hi @xffxff |
Any help on getting multi gpu support running? vLLM fails to load with tensor_parallel_size=2
The text was updated successfully, but these errors were encountered: