Feature Request: Qwen 2.5 VL #11483

bold84 · 2025-01-29T11:36:22Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Is anybody implementing this?

If not, I may give it a go. But it will take some time as I am new to the source side of llama.cpp/ggml.

Motivation

Well, it's not currently working. :-)

Possible Implementation

Based on the existing Qwen 2 VL implementation.

HimariO · 2025-01-29T13:37:18Z

I'm currently looking into Transformers' Qwen2.5VL implementation and waiting for the paper to drop so I can better assess the differences between Qwen2VL and Qwen2.5VL. 👀

3unnycheung · 2025-01-29T14:15:10Z

cool

samkoesnadi · 2025-01-29T18:05:34Z

I support this!

Shyryp · 2025-02-02T14:11:45Z

Our world definitely needs this!

peter-ch · 2025-02-13T13:52:43Z

Any progress on this? Who added support for Qwen 2 VL?

pszemraj · 2025-02-20T22:01:30Z

qwen2.5-vl report is up! https://huggingface.co/papers/2502.13923

edit: official codebase here: https://github.com/QwenLM/Qwen2.5-VL

vladislavdonchev · 2025-02-22T17:23:36Z

I can start working on this if no one else is already.

vladislavdonchev · 2025-02-22T21:05:47Z

OK then!

First order of business would be to build the GGUF file(s). Seems there is an issue with that and the latest official Transformers:

python convert_hf_to_gguf.py .\build\bin\Release\Qwen2.5-VL-7B-Instruct\
INFO:hf-to-gguf:Loading model: Qwen2.5-VL-7B-Instruct
ERROR:hf-to-gguf:Model Qwen2_5_VLForConditionalGeneration is not supported

This is pretty hot:
huggingface/transformers#36292
QwenLM/Qwen2.5-VL#723

Appears a temporary workaround would be to use the old Qwen2 templates. People are reporting this works, so I'll post an update in a bit.

vladislavdonchev · 2025-02-22T22:04:18Z

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit:
#10896

For more information refer to:
#11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place:
#11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

vladislavdonchev · 2025-02-23T11:03:21Z

UPDATE: A few 4-bit quants have been uploaded, including two that support online auto-repacking.

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
Batching (multiple images) in a single cli call seems to be working fine:
llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

hvico · 2025-02-24T02:17:01Z

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

vladislavdonchev · 2025-02-24T05:19:46Z

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is?

EDIT: As per Qwen2.5 docs:
min_pixels = 256x28x28
max_pixels = 1280x28x28

A RTFM moment for me...

hvico · 2025-02-24T12:52:14Z

Hi! Excelent news, thank you very much for this!
I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is? With 24G VRAM, you can expect an OOM with images >1400x1400 pixels, so you need to make sure the files are pre-processed correctly.

Thanks.

My image was 1475x1062. I was able to run inference successfuly using a 1077x671 sample, without OOM. Would it be possible to run Clip and VL on separate GPUs? Thanks again.

zrrraa · 2025-02-25T13:31:57Z

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]
Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

vladislavdonchev · 2025-02-25T16:23:49Z

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]
Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.
Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

Get it from our HF:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF

ChmHsm · 2025-02-27T09:43:17Z

Thank you for the effort, a lot of people really need this.

Any updates on the progress? Will this still take a few days? or is it more like a few weeks or months?

Thanks a lot again, we appreciate you guys a lot!.

samkoesnadi · 2025-02-27T14:07:54Z

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

vladislavdonchev · 2025-02-27T14:31:36Z

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

Working on it as we speak, along with a quantization tool:

https://github.com/Independent-AI-Labs/local-super-agents/tree/feat/additional-output-formats/quantbench

vladislavdonchev · 2025-02-28T22:21:49Z

UPDATE:

Opened a draft PR here: #12119

Long story short, I'll need some help debugging the vision models and llama-qwen2vl-cli as we're unable to produce anything reliably.

In addition, this still isn't resolved:
#11322

I've also asked the Qwen folks for help:
QwenLM/Qwen2.5-VL#869

ChmHsm · 2025-02-28T22:49:08Z

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

vladislavdonchev · 2025-02-28T23:00:01Z

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

Unfortunately, we're unable to reliably produce a working vision model from either 7B or 3B. I am not sure how the one in the repo was exported, but it seems to be working, so it's either some weird coincidence or a mistake. I've verified the LM part, including in quants and it also appears to match what you'd expect from Qwen2.5 (parameters in .gguf seem correct, responses are OK).

David33706 · 2025-03-01T17:07:50Z

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]
Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:

./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."

key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p

Could somebody please help out?

tomjpalamattam · 2025-03-02T23:22:59Z

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]
Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.
Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]
Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.
I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:

./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."
key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p
Could somebody please help out?

Did you figure this out?

David33706 · 2025-03-03T10:43:39Z

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]
Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.
Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]
Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.
I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:
./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."
key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p
Could somebody please help out?
Did you figure this out?

Nope

vladislavdonchev · 2025-03-03T10:51:09Z

Please stop spamming this thread. Qwen2.5 is still a WIP!

Regarding the issue above:
./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."
You cannot use the Language Model as a Vision Model (mmproj - in your command you are specifying the same thing twice).

Please wait until the implementation has been finalized.

HimariO · 2025-03-20T14:21:06Z

@jfernandrezj Merging the LoRA parameter into the base model will be the way to do it, since neither qwen2_vl_surgery.py nor clip.cpp recognize LoRA parameter names.

jfernandrezj · 2025-03-20T16:53:57Z

Thank you very much @HimariO for your response. I merged the base and the lora, and ran the surgery to get a ~2.5 gb gguf.

Hrayo712 · 2025-03-20T20:47:45Z

@HimariO Thanks for the amazing work! I will try this shortly.

Can you comment on the status for the 7B variant ? Also, have you tested any quantization on the model as well ?

green-s · 2025-03-20T23:44:57Z

@HimariO Thanks for the amazing work! I will try this shortly.

Can you comment on the status for the 7B variant ? Also, have you tested any quantization on the model as well ?

I haven't tried 7B but I uploaded a conversion/quant of 72B to HuggingFace here. Only a couple quant levels right now since they take a while to upload but I can make others if you want. It seems to work well.

Hrayo712 · 2025-03-20T23:51:55Z

Thanks for the prompt response @green-s !

Would you foresee any issues with applying the same procedure you followed for the 72B, but for the 7B (e.g., on 4 bits) ?

Would be great if you could share pointers for running and enabling the conversion quantization

green-s · 2025-03-20T23:57:06Z

Thanks for the prompt response @green-s !

Would you foresee any issues with applying the same procedure you followed for the 72B, but for the 7B (e.g., on 4 bits) ?

Would be great if you could share pointers for running and enabling the conversion quantization

I used the exact same commands provided by @HimariO here. I'm not sure if I was supposed to do something differently but it seemed to work fine, so I presume 7B should too. I could convert 7B also if you like.

Hrayo712 · 2025-03-20T23:58:33Z

Thanks for the prompt response @green-s !
Would you foresee any issues with applying the same procedure you followed for the 72B, but for the 7B (e.g., on 4 bits) ?
Would be great if you could share pointers for running and enabling the conversion quantization

I used the exact same commands provided by @HimariO here. I'm not sure if I was supposed to do something differently but it seemed to work fine, so I presume 7B should too. I could convert 7B also if you like.

That would be amazing! Thanks :)

green-s · 2025-03-21T00:28:10Z

@Hrayo712 Just uploaded Q4_K_M here. Uploading a few more variants now.

RicksThread · 2025-03-21T14:09:44Z

Is it possible to make sure the model gets loaded only once? An implementation similar to llama-server, would it improve inference time?

HimariO · 2025-03-24T14:37:45Z

@RicksThread Yes, it is possible. But with vision API refactoring(and the re-enable of multi-modal support of llama-server) on the horizon, I have no plan to add more functionality to llama-qwen2vl-cli.

CKwasd · 2025-03-25T05:51:47Z

work great with Green-s Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf and HimariO 's llama.cpp. qwen25-vl branches

iqddd · 2025-03-25T07:46:35Z

They just dropped 32B VL version.

HimariO · 2025-03-25T08:09:55Z

I just ran a simple test using the 32B variant of the model with a smaller sample image (500x300 pixels, to be specific). It still took around 20 minutes to generate a single caption on my setup with CPU backend, but the result looked pretty decent.

I've uploaded the GGUF files to the Hugging Face Hub so that others with better hardware can give it a try.

Image	Output
	The image shows a serene and heartwrenching scene of a young woman sitting on a sandy beach, enjoying a moment of connection with her dog. Here are the details: 1. Location: The setting is a sandy beach with gentle, rolling surf in the background. The beach is calm, and the sand appears smooth, indicating a tranquil environment. 2. Time of Day: The warm, golden light suggests that it is either sunrise or, more likely, late afternoon, as the warm hues are consistent with the golden hour before the sun sets. 3. The Person: A young woman is sitting on the sand with her back slightly angled toward the camera. She has dark, long hair that is pulled back, and she is wearing a plaid flannel top and dark-colored bottoms. She has a calm, content expression on her face, looking at her dog. 4. The Animal: A large, light-colored Labrador Retriever is sitting close to the woman. The dog has a blue collar and is looking attentively at the woman. The dog’s body language appears calm and friendly, and it seems to be enjoying the moment. 5. Interactions: The woman is playfully reaching out toward the dog, as if offering or receiving something

llama-qwen2vl-cli -m qwen25-vl-32b-instruct-vision-00001-of-00002.gguf --mmproj qwen-qwen2.5-vl-32b-instruct-vision.gguf -p "Describe this image." --image demo_small.jpg --threads 24

green-s · 2025-03-25T10:05:01Z

@HimariO The llama-llava-clip-quantize-cli command doesn't seem to be working with the vision ggufs (I get no output and it just immediately exits) and that prevents the 32B at 4bit from being able to easily fit on one 24GB GPU. Any chance you could fix that?

panda44312 · 2025-03-26T18:15:42Z

Also, consider supporting Qwen2.5-Omni?

jfernandrezj · 2025-03-28T02:22:24Z

What is the right way of running a btach of 4 images in batch? When I include several --image arguments it just seems to run it sequentially.

jfernandrezj · 2025-03-28T17:00:19Z

UPDATE: A few 4-bit quants have been uploaded, including two that support online auto-repacking.

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.

A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.

Batching (multiple images) in a single cli call seems to be working fine:
llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

@vladislavdonchev when you say batching, it does not really batch right? It seems to load the model and run inference for each image sequentially, right? Am I missing something?

HimariO · 2025-03-30T12:56:58Z

@green-s according to my testing, 7B and 32B models' vision encoder gguf are working fine with clip-quantize-cli tool when quantized to Q4, only the 3B variant will fail the conversation due to the hidden state size(channel) that is not divisible by block size.
Could you provide the 32B vision gguf file causing the problem?

green-s · 2025-03-30T19:14:09Z

@green-s according to my testing, 7B and 32B models' vision encoder gguf are working fine with clip-quantize-cli tool when quantized to Q4, only the 3B variant will fail the conversation due to the hidden state size(channel) that is not divisible by block size. Could you provide the 32B vision gguf file causing the problem?

Ah sorry I only tried running it on Windows. Tried it on Linux and it worked fine.

Kreijstal · 2025-03-31T13:59:11Z

please add Qwen 2.5 omni support.

ER-EPR · 2025-04-01T07:17:44Z

when will llama.cpp officially support Qwen 2.5 VL

HimariO · 2025-04-04T13:53:03Z

@ER-EPR, I've just wrapped up the PR(#12402) for Qwen 2.5 VL support today. It should be integrated once the review process is complete.

green-s · 2025-04-04T21:16:46Z

@HimariO Will existing conversions/quants work after those changes or do they have to be redone?

soldivelot · 2025-04-06T17:57:07Z

i got error with this build: https://github.com/HimariO/llama.cpp.qwen2.5vl/releases/tag/b5043
and this model: https://huggingface.com/samgreen/Qwen2.5-VL-32B-Instruct-GGUF

command:

.\bin-qvl\llama-qwen2vl-cli `
-m .\models\samgreen\Qwen25-VL-32B-Instruct-Q4_K_M.gguf --mmproj .\models\samgreen\qwen2.5-vl-32b-instruct-vision-f16.gguf --image .\image.png `
-p "describe this image" `
-t 16 -ngl 32

outout:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
build: 5043 (c262bedd) with MSVC 19.29.30158.0 for
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070) - 11094 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 771 tensors from .\models\samgreen\Qwen25-VL-32B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 VL 32B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-VL
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv   8:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   9:                        qwen2vl.block_count u32              = 64
llama_model_loader: - kv  10:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv  11:                   qwen2vl.embedding_length u32              = 5120
llama_model_loader: - kv  12:                qwen2vl.feed_forward_length u32              = 27648
llama_model_loader: - kv  13:               qwen2vl.attention.head_count u32              = 40
llama_model_loader: - kv  14:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  16:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,151387]  = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen2.5 VL 32B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 '膴'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 32/65 layers to GPU
load_tensors:        CUDA0 model buffer size =  8949.62 MiB
load_tensors:   CPU_Mapped model buffer size =  9976.38 MiB
.................................................................................................
clip_init: model name:   Qwen2.5-VL-32B-Instruct
clip_init: description:  image encoder for Qwen2VL
clip_init: GGUF version: 3
clip_init: alignment:    32
clip_init: n_tensors:    520
clip_init: n_kv:         24
clip_init: ftype:        f16

clip_init: loaded meta data with 24 key-value pairs and 520 tensors from .\models\samgreen\qwen2.5-vl-32b-instruct-vision-f16.gguf
clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_init: - kv   0:                       general.architecture str              = clip
clip_init: - kv   1:                        general.description str              = image encoder for Qwen2VL
clip_init: - kv   2:                          general.file_type u32              = 1
clip_init: - kv   3:                      clip.has_text_encoder bool             = false
clip_init: - kv   4:                    clip.has_vision_encoder bool             = true
clip_init: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_init: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_init: - kv   7:                              clip.use_silu bool             = true
clip_init: - kv   8:                              clip.use_gelu bool             = false
clip_init: - kv   9:                           clip.use_glu_mlp bool             = true
clip_init: - kv  10:                          clip.use_rms_norm bool             = true
clip_init: - kv  11:          clip.vision.fullatt_block_indexes arr[i32,4]       = [7, 15, 23, 31]
clip_init: - kv  12:                    clip.vision.window_size u32              = 112
clip_init: - kv  13:               clip.vision.embedding_length u32              = 1280
clip_init: - kv  14:                 clip.vision.projection_dim u32              = 5120
clip_init: - kv  15:                     clip.vision.patch_size u32              = 14
clip_init: - kv  16:                     clip.vision.image_size u32              = 560
clip_init: - kv  17:           clip.vision.attention.head_count u32              = 16
clip_init: - kv  18:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_init: - kv  19:                    clip.vision.block_count u32              = 32
clip_init: - kv  20:            clip.vision.feed_forward_length u32              = 0
clip_init: - kv  21:                               general.name str              = Qwen2.5-VL-32B-Instruct
clip_init: - kv  22:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_init: - kv  23:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_init: - type  f32:  292 tensors
clip_init: - type  f16:  228 tensors
clip_ctx: CLIP using CUDA0 backend
clip_init: text_encoder:   0
clip_init: vision_encoder: 1
clip_init: llava_projector:  0
clip_init: minicpmv_projector:  0
clip_init: minicpmv_version:  2
clip_init: glm_projector:  0
clip_init: model size:     1314.85 MB
clip_init: metadata size:  0.18 MB
clip_init: params backend buffer size =  1314.85 MB (520 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.feature_layer not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file

HimariO · 2025-04-07T15:56:52Z

@green-s Previously converted gguf should work fine with recent changes; nothing has changed in the vision encoder gguf format.

@soldivelot Those "errors" are normal, since they are raised by non-essential parameters that Qwen2.5VL doesn't use.

ColumbusAI · 2025-04-08T20:58:05Z

How are you guys using Qwen VL models to have a conversation about images?

I am only finding the llama.cpp binaries to provide zero-shot prompting and not a true conversation or OAI endpoint that I can use with Open-WebUI to incorporate images in my text conversations.

Appreciate the insight!

zrrraa · 2025-04-09T08:04:00Z

@green-s Previously converted gguf should work fine with recent changes; nothing has changed in the vision encoder gguf format.

@soldivelot Those "errors" are normal, since they are raised by non-essential parameters that Qwen2.5VL doesn't use.

@HimariO Then how to quantize the vision encoder of 3B variant? I also failed to quantize the vision encoder fo 7B and 32B. May I know which version of the code you are using?

Melon-Bread · 2025-04-11T23:14:25Z

How are you guys using Qwen VL models to have a conversation about images?

I am only finding the llama.cpp binaries to provide zero-shot prompting and not a true conversation or OAI endpoint that I can use with Open-WebUI to incorporate images in my text conversations.

Appreciate the insight!

Koboldcpp is the only way I know of at this moment if you want to do it locally.

bold84 added the enhancement New feature or request label Jan 29, 2025

bold84 mentioned this issue Jan 30, 2025

Feature Request: Support for Qwen2.5-VL #11524

Closed

4 tasks

vladislavdonchev mentioned this issue Feb 22, 2025

add Qwen2-VL/Qwen2.5-VL ollama/ollama#6564

Open

hvico mentioned this issue Feb 28, 2025

The server does not to run inference and responds immediately with null (sometimes) gpustack/llama-box#40

Closed

vladislavdonchev mentioned this issue Feb 28, 2025

请问2.5-VL可以集成到ollama与open-webui中使用吗 QwenLM/Qwen2.5-VL#863

Open

csabakecskemeti mentioned this issue Mar 26, 2025

Support Qwen2_5_VLForConditionalGeneration #12595

Merged

panda44312 mentioned this issue Mar 26, 2025

Qwen2.5-vl support and conversion？ #12584

Open

Feature Request: Qwen 2.5 VL #11483

Feature Request: Qwen 2.5 VL #11483

Comments

bold84 commented Jan 29, 2025 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

HimariO commented Jan 29, 2025

3unnycheung commented Jan 29, 2025

samkoesnadi commented Jan 29, 2025

Shyryp commented Feb 2, 2025

peter-ch commented Feb 13, 2025

pszemraj commented Feb 20, 2025 • edited Loading

vladislavdonchev commented Feb 22, 2025

vladislavdonchev commented Feb 22, 2025 • edited Loading

vladislavdonchev commented Feb 22, 2025 • edited Loading

vladislavdonchev commented Feb 23, 2025 • edited Loading

hvico commented Feb 24, 2025 • edited Loading

vladislavdonchev commented Feb 24, 2025 • edited Loading

hvico commented Feb 24, 2025

zrrraa commented Feb 25, 2025

vladislavdonchev commented Feb 25, 2025

ChmHsm commented Feb 27, 2025

samkoesnadi commented Feb 27, 2025

vladislavdonchev commented Feb 27, 2025

vladislavdonchev commented Feb 28, 2025

ChmHsm commented Feb 28, 2025 • edited Loading

vladislavdonchev commented Feb 28, 2025 • edited Loading

David33706 commented Mar 1, 2025

tomjpalamattam commented Mar 2, 2025

David33706 commented Mar 3, 2025

vladislavdonchev commented Mar 3, 2025 • edited Loading

HimariO commented Mar 20, 2025

jfernandrezj commented Mar 20, 2025 • edited Loading

Hrayo712 commented Mar 20, 2025

green-s commented Mar 20, 2025

Hrayo712 commented Mar 20, 2025

green-s commented Mar 20, 2025

Hrayo712 commented Mar 20, 2025

green-s commented Mar 21, 2025

RicksThread commented Mar 21, 2025 • edited Loading

HimariO commented Mar 24, 2025

CKwasd commented Mar 25, 2025

iqddd commented Mar 25, 2025

HimariO commented Mar 25, 2025 • edited Loading

green-s commented Mar 25, 2025 • edited Loading

panda44312 commented Mar 26, 2025

jfernandrezj commented Mar 28, 2025

jfernandrezj commented Mar 28, 2025

HimariO commented Mar 30, 2025

green-s commented Mar 30, 2025

Kreijstal commented Mar 31, 2025

ER-EPR commented Apr 1, 2025 • edited Loading

HimariO commented Apr 4, 2025

green-s commented Apr 4, 2025

soldivelot commented Apr 6, 2025

HimariO commented Apr 7, 2025

ColumbusAI commented Apr 8, 2025

zrrraa commented Apr 9, 2025

Melon-Bread commented Apr 11, 2025 • edited Loading

bold84 commented Jan 29, 2025 •

edited

Loading

pszemraj commented Feb 20, 2025 •

edited

Loading

vladislavdonchev commented Feb 22, 2025 •

edited

Loading

vladislavdonchev commented Feb 22, 2025 •

edited

Loading

vladislavdonchev commented Feb 23, 2025 •

edited

Loading

hvico commented Feb 24, 2025 •

edited

Loading

vladislavdonchev commented Feb 24, 2025 •

edited

Loading

ChmHsm commented Feb 28, 2025 •

edited

Loading

vladislavdonchev commented Feb 28, 2025 •

edited

Loading

vladislavdonchev commented Mar 3, 2025 •

edited

Loading

jfernandrezj commented Mar 20, 2025 •

edited

Loading

RicksThread commented Mar 21, 2025 •

edited

Loading

HimariO commented Mar 25, 2025 •

edited

Loading

green-s commented Mar 25, 2025 •

edited

Loading

ER-EPR commented Apr 1, 2025 •

edited

Loading

Melon-Bread commented Apr 11, 2025 •

edited

Loading