Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run qwen2-vl inference on Intel integrated GPU. Works fine with CPU #2740

Open
paks100 opened this issue Feb 11, 2025 · 12 comments
Open

Comments

@paks100
Copy link

paks100 commented Feb 11, 2025

Hi,

I'm unable to run inference on Intel Integrated GPU on qwen2-vl-2B model. It works fine if I select CPU as the device.

The exception happens with error code "-5"

The stack trace:

File "ov_qwen2_vl.py", line 763, in forward
self.request.wait()
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:365:
[GPU] clFlush, error code: -5

System details:

OS:

Operating System: Ubuntu 22.04.5 LTS
Kernel: Linux 6.8.0-52-generic

GPU:
description: VGA compatible controller
product: HD Graphics 630
vendor: Intel Corporation

import openvino as ov
core = ov.Core()
print(core.available_devices)

['CPU', 'GPU']

Code snapshot

from pathlib import Path
import requests

from ov_qwen2_vl import model_selector

model_id = model_selector()

print(f"Selected {model_id.value}")
pt_model_id = model_id.value
model_dir = Path(pt_model_id.split("/")[-1])

from ov_qwen2_vl import convert_qwen2vl_model

uncomment these lines to see model conversion code

convert_qwen2vl_model??

import nncf

compression_configuration = {
"mode": nncf.CompressWeightsMode.INT4_ASYM,
"group_size": 128,
"ratio": 1.0,
}

convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration)

from ov_qwen2_vl import OVQwen2VLModel

Uncomment below lines to see the model inference class code

OVQwen2VLModel??

from notebook_utils import device_widget

device = device_widget(default="AUTO", exclude=["NPU"])

model = OVQwen2VLModel(model_dir, device.value)

print(device.value)

from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from transformers import TextStreamer

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

if processor.chat_template is None:
tok = AutoTokenizer.from_pretrained(model_dir)
processor.chat_template = tok.chat_template

example_image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
example_image_path = Path("demo.jpeg")

if not example_image_path.exists():
Image.open(requests.get(example_image_url, stream=True).raw).save(example_image_path)

image = Image.open(example_image_path)
question = "Describe this image."

messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": f"file://{example_image_path}",
},
{"type": "text", "text": question},
],
}
]

Preparation for inference

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)

#display(image)
print("Question:")
print(question)
print("Answer:")

generated_ids = model.generate(**inputs, max_new_tokens=100, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))

@brmarkus
Copy link

From an internet search for OpenCL error codes I found for instance this: "https://streamhpc.com/blog/2013-04-28/opencl-error-codes/", indicating that "-5" could mean:

-5 | CL_OUT_OF_RESOURCES |   | if there is a failure to allocate resources required by the OpenCL implementation on the device.

Could you provide more information about your system, please? Which exact CPU, what amount of total system memory?
When you started the code, could you monitor the system memory usage, please? Do you see if the amount of used system memory wants to consume the total system memory?

@paks100
Copy link
Author

paks100 commented Feb 12, 2025

I monitored memory usage with CPU and GPU options. Here is the screen shot of "htop". As the integrated 630 Intel GPU doesn't have it's own memory and uses system memory, I used htop to monitor the system memory.

Inference on GPU: The system memory and swap looks to be hitting the limits.

Image

Inference on CPU: System memory and swap looks fine..

Image

So, not sure why memory usage increases for GPU inference (with the model being same)? Looks like a bug.

Also, when I ran multiple time with GPU inference, it failed with different error couple of times. Here is the stack trace:

res = self.image_embed_merger([hidden_states, causal_mask, rotary_pos_emb])[0]

File "/usr/local/lib/python3.10/dist-packages/openvino/_ov_api.py", line 427, in call
return self._infer_request.infer(
File "/usr/local/lib/python3.10/dist-packages/openvino/_ov_api.py", line 171, in infer
return OVDict(super().infer(_data_dispatch(
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223:
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_event.cpp:56:
[GPU] clWaitForEvents, error code: -14

Here is the CPU and Memory details of my system:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz
CPU family: 6
Model: 158
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 9
CPU max MHz: 3300.0000
CPU min MHz: 800.0000
BogoMIPS: 5399.81

Memory:

RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000dfffffff 3.5G online yes 0-27
0x0000000100000000-0x000000021fffffff 4.5G online yes 32-67

Memory block size: 128M
Total online memory: 8G
Total offline memory: 0B

@brmarkus
Copy link

Do you run the code (code snapshot shown above) in a Python script or as a Jupyter notebook (in the browser)? Have you changed something in the code when it is based on a Juypiter notebook from this repo?
Have you used "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb" or "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-audio/qwen2-audio.ipynb" as a base?

Can you provide more details about the compression- and conversion-settings, please?

For processing the model in the GPU, the model first is converted into OpenCL kernels (including optimizations, introducing different operations where needed) - so after OpenCL compilation the model could behave different, could consume a different amount of memory compared to the model used by the CPU-plugin.

For GPU the model typically is used in FP16 precission - where for CPU, depending on the model and generation, INT8, INT4, FP16 or FP32 "just works".
You might need to delete the local models and start conversion and compression again, but with FP16 for running on GPU - and in parallel you could convert and compress the model in additional precissions for the CPU.

What is the memory consumption before starting the code? Could you try closing other applications to free memory and try again?

Do the errors/exception only occur during inference, or also during conversion and compression?

@paks100
Copy link
Author

paks100 commented Feb 12, 2025

Thank you for your detailed answer. I used the methods in https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/ov_qwen2_vl.py for compression and conversion of the models for GPU/CPU. i.e

===
compression_configuration = {
"mode": nncf.CompressWeightsMode.INT4_ASYM,
"group_size": 128,
"ratio": 1.0,
}

convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration)
....
device = device_widget(default="AUTO", exclude=["NPU"])

model = OVQwen2VLModel(model_dir, device.value)

=====
You can look at the code snap shot I had sent earlier for the detailed code.

I ran this in a python script and not via jupyter notebook. I ran into issues with the code link you have sent with the "optimum_cli" hung on my system (I didn't debug this further).

Are you pointing that I'm using a wrong code? I'm new to llm code, so any help will be great!

NOTE: I had stopped all applications and seen that CPU and Memory usage were pretty low before starting my script.

I'm not sure whether I have provided all the information you needed. Let me know, if you need more clarifications in further helping :-)

@paks100
Copy link
Author

paks100 commented Feb 12, 2025

The error happens only during inference.

@brmarkus
Copy link

Which of the two links have you tried, visual-language or audio?

You probably haven't seen optimum_cli hung, but just taking very long time......... Conversion, quantization and compression can take very, very long time, especially when running short on system memory during the process (and starting to swap to HDD/SSD).

For GPU, try using FP16 or FP32 instead of INT4 or INT8.

@paks100
Copy link
Author

paks100 commented Feb 12, 2025

visual-language.

Should I use the current code (snapshot attached) for changing it to FP16? OR you want me to try the optimum_cli code?

Thanks

@brmarkus
Copy link

Now using a new environment (deleted my previous python virtual environment), synchronized openvino_notebooks new, installed requirements new and started the Jupyter notebook "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb".
Using a Laptop, MS-Win11, Python 3.12.4.
Using default values in the drop-down fields ("Qwen/Qwen2-VL-2B-Instruct"), including to use INT4.
Downloading the model, 4-bit compression took really long, system memory consumption was really high.
Changed inference device from AUTO to GPU.
And with using the "cat.png" I successfully get this result: "In the image, there is a cat lying inside a cardboard box. The cat has a fluffy coat and is lying on its back with its paws up. The box is placed on a light-colored carpet, and the background shows a portion of a white couch and a window with curtains. The lighting in the room is bright, suggesting it is daytime. The cat appears to be relaxed and comfortable in the box."

Image

(Selecting "Computer" in MS-Win TaskManager)
Image
(if you use e.g. intel_gpu_top under Linux to measure the GPU load, you would need to look into the "Render/3D" section showing ExecutionUnit load due to executing OpenCL kernels doing the inference)

Do you have a chance to test the original Juypter notebook as a consistency check?

At some point the OpenVINO dev team need to jump in - they might ask you for more lower-level driver information like OpenCL version, in case your SoC with integraded iGPU needs a specific version; your SoC "Intel(R) Core(TM) i5-7500T " is an older model, where I used a newer model (Intel Core Ultra 7 155H).

@paks100
Copy link
Author

paks100 commented Feb 13, 2025

Good to see it's working on MS-Win11. I'm running this on Ubuntu 22.04.

I tried "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb" this with a python script on a 16GB ubuntu system and it failed with memory issues.. Looks like I need to run on a server with larger memory.

@brmarkus
Copy link

If there is enough HDD/SSD storage, then the operation system should start swapping when running short on memory (if there is something to swap, other than memory content needed for conversion/compresion/inference).
Could also be a driver issue.

Have you tried to run other notebooks/scripts doing inference in the GPU - just to check if the environment and drivers are consistent?

@paks100
Copy link
Author

paks100 commented Feb 13, 2025

I think there are 2 problems I'm facing, with 2 different codes for inferencing qwen2-vl models with openvino.

  1. First, the code that I have attached in the beginning of this ticket, which uses the APIs in ov_qwen2_vl.py.
    a. The initial issue was with "CL resource error -5" on a 8GB RAM system (during inference). This was only with GPU device. Inference on CPU
    worked fine.
    b. On a 16GB system, I don't see memory as an issue when I run this code. But the script hangs/get stuck at
    "model.generate(**inputs, max_new_tokens=128)" forever.

  2. Now the second problem, is with the code you had pointed ""https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb".
    a. I see issue with model compression code failing due to low memory (both on 8GB abd 16GB systems).

So, the first answer I'm looking for is, which is the right code to pursue further.

Second answer I'm looking for is, has anyone tried any of the above code on Ubuntu 22.04 running on 16GB servers?

Third answer I'm looking for is, what is the recommended hardware configuration for running this on Ubuntu 22.04 servers?

I want to thank for all the clarifications I have got so far. Hopefully the above answers will help me to progress further and debug this issue further for closure. Thanks!

@brmarkus
Copy link

I think there are 2 problems I'm facing, with 2 different codes for inferencing qwen2-vl models with openvino.

  1. First, the code that I have attached in the beginning of this ticket, which uses the APIs in ov_qwen2_vl.py.
    a. The initial issue was with "CL resource error -5" on a 8GB RAM system (during inference). This was only with GPU device. Inference on CPU

This is difficult to comment. Could be too less memory. Could be a driver-version conflict. Could be too old SoC. Could be the source code of the script (which behaves differently than the Jupyter notebook you mentioned under 2a).

worked fine.
b. On a 16GB system, I don't see memory as an issue when I run this code. But the script hangs/get stuck at
"model.generate(**inputs, max_new_tokens=128)" forever.

It could "normally" take very, very, really very long - even several minutes.
The Jupyter notebook uses config.max_new_tokens = 100, can you try with smaller values than 128 (try something extreme like 5, or 10, 50, 100)?

  1. Now the second problem, is with the code you had pointed ""https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb".
    a. I see issue with model compression code failing due to low memory (both on 8GB abd 16GB systems).

Which error messages do you get?
Is it with "compression"? Could you skip the compression, and only convert to FP16/FP32?

So, the first answer I'm looking for is, which is the right code to pursue further.

Second answer I'm looking for is, has anyone tried any of the above code on Ubuntu 22.04 running on 16GB servers?

Could you find a machine, compress&convert&quantize the models on it and copy to your resource-constraint machine?

Third answer I'm looking for is, what is the recommended hardware configuration for running this on Ubuntu 22.04 servers?

I want to thank for all the clarifications I have got so far. Hopefully the above answers will help me to progress further and debug this issue further for closure. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants