-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run qwen2-vl inference on Intel integrated GPU. Works fine with CPU #2740
Comments
From an internet search for OpenCL error codes I found for instance this: "https://streamhpc.com/blog/2013-04-28/opencl-error-codes/", indicating that "-5" could mean:
Could you provide more information about your system, please? Which exact CPU, what amount of total system memory? |
I monitored memory usage with CPU and GPU options. Here is the screen shot of "htop". As the integrated 630 Intel GPU doesn't have it's own memory and uses system memory, I used htop to monitor the system memory. Inference on GPU: The system memory and swap looks to be hitting the limits. Inference on CPU: System memory and swap looks fine.. So, not sure why memory usage increases for GPU inference (with the model being same)? Looks like a bug. Also, when I ran multiple time with GPU inference, it failed with different error couple of times. Here is the stack trace:
File "/usr/local/lib/python3.10/dist-packages/openvino/_ov_api.py", line 427, in call Here is the CPU and Memory details of my system:Architecture: x86_64 Memory:RANGE SIZE STATE REMOVABLE BLOCK Memory block size: 128M |
Do you run the code (code snapshot shown above) in a Python script or as a Jupyter notebook (in the browser)? Have you changed something in the code when it is based on a Juypiter notebook from this repo? Can you provide more details about the compression- and conversion-settings, please? For processing the model in the GPU, the model first is converted into OpenCL kernels (including optimizations, introducing different operations where needed) - so after OpenCL compilation the model could behave different, could consume a different amount of memory compared to the model used by the CPU-plugin. For GPU the model typically is used in FP16 precission - where for CPU, depending on the model and generation, INT8, INT4, FP16 or FP32 "just works". What is the memory consumption before starting the code? Could you try closing other applications to free memory and try again? Do the errors/exception only occur during inference, or also during conversion and compression? |
Thank you for your detailed answer. I used the methods in https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/ov_qwen2_vl.py for compression and conversion of the models for GPU/CPU. i.e === convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration) model = OVQwen2VLModel(model_dir, device.value) ===== I ran this in a python script and not via jupyter notebook. I ran into issues with the code link you have sent with the "optimum_cli" hung on my system (I didn't debug this further). Are you pointing that I'm using a wrong code? I'm new to llm code, so any help will be great! NOTE: I had stopped all applications and seen that CPU and Memory usage were pretty low before starting my script. I'm not sure whether I have provided all the information you needed. Let me know, if you need more clarifications in further helping :-) |
The error happens only during inference. |
Which of the two links have you tried, visual-language or audio? You probably haven't seen For GPU, try using FP16 or FP32 instead of INT4 or INT8. |
visual-language. Should I use the current code (snapshot attached) for changing it to FP16? OR you want me to try the optimum_cli code? Thanks |
Now using a new environment (deleted my previous python virtual environment), synchronized openvino_notebooks new, installed requirements new and started the Jupyter notebook "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb". (Selecting "Computer" in MS-Win TaskManager) Do you have a chance to test the original Juypter notebook as a consistency check? At some point the OpenVINO dev team need to jump in - they might ask you for more lower-level driver information like OpenCL version, in case your SoC with integraded iGPU needs a specific version; your SoC "Intel(R) Core(TM) i5-7500T " is an older model, where I used a newer model (Intel Core Ultra 7 155H). |
Good to see it's working on MS-Win11. I'm running this on Ubuntu 22.04. I tried "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb" this with a python script on a 16GB ubuntu system and it failed with memory issues.. Looks like I need to run on a server with larger memory. |
If there is enough HDD/SSD storage, then the operation system should start swapping when running short on memory (if there is something to swap, other than memory content needed for conversion/compresion/inference). Have you tried to run other notebooks/scripts doing inference in the GPU - just to check if the environment and drivers are consistent? |
I think there are 2 problems I'm facing, with 2 different codes for inferencing qwen2-vl models with openvino.
So, the first answer I'm looking for is, which is the right code to pursue further. Second answer I'm looking for is, has anyone tried any of the above code on Ubuntu 22.04 running on 16GB servers? Third answer I'm looking for is, what is the recommended hardware configuration for running this on Ubuntu 22.04 servers? I want to thank for all the clarifications I have got so far. Hopefully the above answers will help me to progress further and debug this issue further for closure. Thanks! |
This is difficult to comment. Could be too less memory. Could be a driver-version conflict. Could be too old SoC. Could be the source code of the script (which behaves differently than the Jupyter notebook you mentioned under 2a).
It could "normally" take very, very, really very long - even several minutes.
Which error messages do you get?
Could you find a machine, compress&convert&quantize the models on it and copy to your resource-constraint machine?
|
Hi,
I'm unable to run inference on Intel Integrated GPU on qwen2-vl-2B model. It works fine if I select CPU as the device.
The exception happens with error code "-5"
The stack trace:
File "ov_qwen2_vl.py", line 763, in forward
self.request.wait()
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:365:
[GPU] clFlush, error code: -5
System details:
OS:
Operating System: Ubuntu 22.04.5 LTS
Kernel: Linux 6.8.0-52-generic
GPU:
description: VGA compatible controller
product: HD Graphics 630
vendor: Intel Corporation
import openvino as ov
core = ov.Core()
print(core.available_devices)
['CPU', 'GPU']
Code snapshot
from pathlib import Path
import requests
from ov_qwen2_vl import model_selector
model_id = model_selector()
print(f"Selected {model_id.value}")
pt_model_id = model_id.value
model_dir = Path(pt_model_id.split("/")[-1])
from ov_qwen2_vl import convert_qwen2vl_model
uncomment these lines to see model conversion code
convert_qwen2vl_model??
import nncf
compression_configuration = {
"mode": nncf.CompressWeightsMode.INT4_ASYM,
"group_size": 128,
"ratio": 1.0,
}
convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration)
from ov_qwen2_vl import OVQwen2VLModel
Uncomment below lines to see the model inference class code
OVQwen2VLModel??
from notebook_utils import device_widget
device = device_widget(default="AUTO", exclude=["NPU"])
model = OVQwen2VLModel(model_dir, device.value)
print(device.value)
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from transformers import TextStreamer
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)
if processor.chat_template is None:
tok = AutoTokenizer.from_pretrained(model_dir)
processor.chat_template = tok.chat_template
example_image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
example_image_path = Path("demo.jpeg")
if not example_image_path.exists():
Image.open(requests.get(example_image_url, stream=True).raw).save(example_image_path)
image = Image.open(example_image_path)
question = "Describe this image."
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": f"file://{example_image_path}",
},
{"type": "text", "text": question},
],
}
]
Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
#display(image)
print("Question:")
print(question)
print("Answer:")
generated_ids = model.generate(**inputs, max_new_tokens=100, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))
The text was updated successfully, but these errors were encountered: