Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Qwen 2.5 VL #11483

Open
4 tasks done
bold84 opened this issue Jan 29, 2025 · 62 comments
Open
4 tasks done

Feature Request: Qwen 2.5 VL #11483

bold84 opened this issue Jan 29, 2025 · 62 comments
Labels
enhancement New feature or request

Comments

@bold84
Copy link

bold84 commented Jan 29, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Is anybody implementing this?

If not, I may give it a go. But it will take some time as I am new to the source side of llama.cpp/ggml.

Motivation

Well, it's not currently working. :-)

Possible Implementation

Based on the existing Qwen 2 VL implementation.

@bold84 bold84 added the enhancement New feature or request label Jan 29, 2025
@HimariO
Copy link
Contributor

HimariO commented Jan 29, 2025

I'm currently looking into Transformers' Qwen2.5VL implementation and waiting for the paper to drop so I can better assess the differences between Qwen2VL and Qwen2.5VL. 👀

@3unnycheung
Copy link

cool

@samkoesnadi
Copy link
Contributor

I support this!

@Shyryp
Copy link

Shyryp commented Feb 2, 2025

Our world definitely needs this!

@peter-ch
Copy link

Any progress on this? Who added support for Qwen 2 VL?

@pszemraj
Copy link

pszemraj commented Feb 20, 2025

qwen2.5-vl report is up! https://huggingface.co/papers/2502.13923

edit: official codebase here: https://github.com/QwenLM/Qwen2.5-VL

@vladislavdonchev
Copy link

I can start working on this if no one else is already.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 22, 2025

OK then!

First order of business would be to build the GGUF file(s). Seems there is an issue with that and the latest official Transformers:

python convert_hf_to_gguf.py .\build\bin\Release\Qwen2.5-VL-7B-Instruct\
INFO:hf-to-gguf:Loading model: Qwen2.5-VL-7B-Instruct
ERROR:hf-to-gguf:Model Qwen2_5_VLForConditionalGeneration is not supported

This is pretty hot:
huggingface/transformers#36292
QwenLM/Qwen2.5-VL#723

Appears a temporary workaround would be to use the old Qwen2 templates. People are reporting this works, so I'll post an update in a bit.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 22, 2025

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit:
#10896

For more information refer to:
#11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place:
#11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 23, 2025

UPDATE: A few 4-bit quants have been uploaded, including two that support online auto-repacking.

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

  • 1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
  • A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
  • Batching (multiple images) in a single cli call seems to be working fine:
    llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

@hvico
Copy link

hvico commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is?

EDIT: As per Qwen2.5 docs:
min_pixels = 256x28x28
max_pixels = 1280x28x28

A RTFM moment for me...

@hvico
Copy link

hvico commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!
I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is? With 24G VRAM, you can expect an OOM with images >1400x1400 pixels, so you need to make sure the files are pre-processed correctly.

Thanks.

My image was 1475x1062. I was able to run inference successfuly using a 1077x671 sample, without OOM. Would it be possible to run Clip and VL on separate GPUs? Thanks again.

@zrrraa
Copy link

zrrraa commented Feb 25, 2025

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

@vladislavdonchev
Copy link

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

Get it from our HF:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF

@ChmHsm
Copy link

ChmHsm commented Feb 27, 2025

Thank you for the effort, a lot of people really need this.

Any updates on the progress? Will this still take a few days? or is it more like a few weeks or months?

Thanks a lot again, we appreciate you guys a lot!.

@samkoesnadi
Copy link
Contributor

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

@vladislavdonchev
Copy link

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

Working on it as we speak, along with a quantization tool:

Image

https://github.com/Independent-AI-Labs/local-super-agents/tree/feat/additional-output-formats/quantbench

@vladislavdonchev
Copy link

UPDATE:

Opened a draft PR here: #12119

Long story short, I'll need some help debugging the vision models and llama-qwen2vl-cli as we're unable to produce anything reliably.

In addition, this still isn't resolved:
#11322

I've also asked the Qwen folks for help:
QwenLM/Qwen2.5-VL#869

@ChmHsm
Copy link

ChmHsm commented Feb 28, 2025

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 28, 2025

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

Unfortunately, we're unable to reliably produce a working vision model from either 7B or 3B. I am not sure how the one in the repo was exported, but it seems to be working, so it's either some weird coincidence or a mistake. I've verified the LM part, including in quants and it also appears to match what you'd expect from Qwen2.5 (parameters in .gguf seem correct, responses are OK).

@David33706
Copy link

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:

./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."

key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p

Could somebody please help out?

@tomjpalamattam
Copy link

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:

./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."

key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p

Could somebody please help out?

Did you figure this out?

@David33706
Copy link

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:
./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."

key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p

Could somebody please help out?

Did you figure this out?

Nope

@vladislavdonchev
Copy link

vladislavdonchev commented Mar 3, 2025

Please stop spamming this thread. Qwen2.5 is still a WIP!

Regarding the issue above:
./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."
You cannot use the Language Model as a Vision Model (mmproj - in your command you are specifying the same thing twice).

Please wait until the implementation has been finalized.

@HimariO
Copy link
Contributor

HimariO commented Mar 20, 2025

@jfernandrezj Merging the LoRA parameter into the base model will be the way to do it, since neither qwen2_vl_surgery.py nor clip.cpp recognize LoRA parameter names.

@jfernandrezj
Copy link

jfernandrezj commented Mar 20, 2025

Thank you very much @HimariO for your response. I merged the base and the lora, and ran the surgery to get a ~2.5 gb gguf.

@Hrayo712
Copy link

@HimariO Thanks for the amazing work! I will try this shortly.

Can you comment on the status for the 7B variant ? Also, have you tested any quantization on the model as well ?

@green-s
Copy link

green-s commented Mar 20, 2025

@HimariO Thanks for the amazing work! I will try this shortly.

Can you comment on the status for the 7B variant ? Also, have you tested any quantization on the model as well ?

I haven't tried 7B but I uploaded a conversion/quant of 72B to HuggingFace here. Only a couple quant levels right now since they take a while to upload but I can make others if you want. It seems to work well.

@Hrayo712
Copy link

Thanks for the prompt response @green-s !

Would you foresee any issues with applying the same procedure you followed for the 72B, but for the 7B (e.g., on 4 bits) ?

Would be great if you could share pointers for running and enabling the conversion quantization

@green-s
Copy link

green-s commented Mar 20, 2025

Thanks for the prompt response @green-s !

Would you foresee any issues with applying the same procedure you followed for the 72B, but for the 7B (e.g., on 4 bits) ?

Would be great if you could share pointers for running and enabling the conversion quantization

I used the exact same commands provided by @HimariO here. I'm not sure if I was supposed to do something differently but it seemed to work fine, so I presume 7B should too. I could convert 7B also if you like.

@Hrayo712
Copy link

Thanks for the prompt response @green-s !
Would you foresee any issues with applying the same procedure you followed for the 72B, but for the 7B (e.g., on 4 bits) ?
Would be great if you could share pointers for running and enabling the conversion quantization

I used the exact same commands provided by @HimariO here. I'm not sure if I was supposed to do something differently but it seemed to work fine, so I presume 7B should too. I could convert 7B also if you like.

That would be amazing! Thanks :)

@green-s
Copy link

green-s commented Mar 21, 2025

@Hrayo712 Just uploaded Q4_K_M here. Uploading a few more variants now.

@RicksThread
Copy link

RicksThread commented Mar 21, 2025

Is it possible to make sure the model gets loaded only once? An implementation similar to llama-server, would it improve inference time?

@HimariO
Copy link
Contributor

HimariO commented Mar 24, 2025

@RicksThread Yes, it is possible. But with vision API refactoring(and the re-enable of multi-modal support of llama-server) on the horizon, I have no plan to add more functionality to llama-qwen2vl-cli.

@CKwasd
Copy link

CKwasd commented Mar 25, 2025

work great with Green-s Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf and HimariO 's llama.cpp. qwen25-vl branches

@iqddd
Copy link

iqddd commented Mar 25, 2025

They just dropped 32B VL version.

@HimariO
Copy link
Contributor

HimariO commented Mar 25, 2025

I just ran a simple test using the 32B variant of the model with a smaller sample image (500x300 pixels, to be specific). It still took around 20 minutes to generate a single caption on my setup with CPU backend, but the result looked pretty decent.

I've uploaded the GGUF files to the Hugging Face Hub so that others with better hardware can give it a try.

Image Output
Image The image shows a serene and heartwrenching scene of a young woman sitting on a sandy beach, enjoying a moment of connection with her dog. Here are the details:
1. Location: The setting is a sandy beach with gentle, rolling surf in the background. The beach is calm, and the sand appears smooth, indicating a tranquil environment.
2. Time of Day: The warm, golden light suggests that it is either sunrise or, more likely, late afternoon, as the warm hues are consistent with the golden hour before the sun sets.
3. The Person: A young woman is sitting on the sand with her back slightly angled toward the camera. She has dark, long hair that is pulled back, and she is wearing a plaid flannel top and dark-colored bottoms. She has a calm, content expression on her face, looking at her dog.
4. The Animal: A large, light-colored Labrador Retriever is sitting close to the woman. The dog has a blue collar and is looking attentively at the woman. The dog’s body language appears calm and friendly, and it seems to be enjoying the moment.
5. Interactions: The woman is playfully reaching out toward the dog, as if offering or receiving something
llama-qwen2vl-cli -m qwen25-vl-32b-instruct-vision-00001-of-00002.gguf --mmproj qwen-qwen2.5-vl-32b-instruct-vision.gguf -p "Describe this image." --image demo_small.jpg --threads 24

@green-s
Copy link

green-s commented Mar 25, 2025

@HimariO The llama-llava-clip-quantize-cli command doesn't seem to be working with the vision ggufs (I get no output and it just immediately exits) and that prevents the 32B at 4bit from being able to easily fit on one 24GB GPU. Any chance you could fix that?

@panda44312
Copy link

Also, consider supporting Qwen2.5-Omni?

@jfernandrezj
Copy link

What is the right way of running a btach of 4 images in batch? When I include several --image arguments it just seems to run it sequentially.

@jfernandrezj
Copy link

UPDATE: A few 4-bit quants have been uploaded, including two that support online auto-repacking.

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

  • 1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
  • A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
  • Batching (multiple images) in a single cli call seems to be working fine:
    llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

@vladislavdonchev when you say batching, it does not really batch right? It seems to load the model and run inference for each image sequentially, right? Am I missing something?

@HimariO
Copy link
Contributor

HimariO commented Mar 30, 2025

@green-s according to my testing, 7B and 32B models' vision encoder gguf are working fine with clip-quantize-cli tool when quantized to Q4, only the 3B variant will fail the conversation due to the hidden state size(channel) that is not divisible by block size.
Could you provide the 32B vision gguf file causing the problem?

@green-s
Copy link

green-s commented Mar 30, 2025

@green-s according to my testing, 7B and 32B models' vision encoder gguf are working fine with clip-quantize-cli tool when quantized to Q4, only the 3B variant will fail the conversation due to the hidden state size(channel) that is not divisible by block size. Could you provide the 32B vision gguf file causing the problem?

Ah sorry I only tried running it on Windows. Tried it on Linux and it worked fine.

@Kreijstal
Copy link

please add Qwen 2.5 omni support.

@ER-EPR
Copy link

ER-EPR commented Apr 1, 2025

when will llama.cpp officially support Qwen 2.5 VL

@HimariO
Copy link
Contributor

HimariO commented Apr 4, 2025

@ER-EPR, I've just wrapped up the PR(#12402) for Qwen 2.5 VL support today. It should be integrated once the review process is complete.

@green-s
Copy link

green-s commented Apr 4, 2025

@HimariO Will existing conversions/quants work after those changes or do they have to be redone?

@soldivelot
Copy link

i got error with this build: https://github.com/HimariO/llama.cpp.qwen2.5vl/releases/tag/b5043
and this model: https://huggingface.com/samgreen/Qwen2.5-VL-32B-Instruct-GGUF

command:

.\bin-qvl\llama-qwen2vl-cli `
-m .\models\samgreen\Qwen25-VL-32B-Instruct-Q4_K_M.gguf --mmproj .\models\samgreen\qwen2.5-vl-32b-instruct-vision-f16.gguf --image .\image.png `
-p "describe this image" `
-t 16 -ngl 32

outout:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
build: 5043 (c262bedd) with MSVC 19.29.30158.0 for
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070) - 11094 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 771 tensors from .\models\samgreen\Qwen25-VL-32B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 VL 32B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-VL
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv   8:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   9:                        qwen2vl.block_count u32              = 64
llama_model_loader: - kv  10:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv  11:                   qwen2vl.embedding_length u32              = 5120
llama_model_loader: - kv  12:                qwen2vl.feed_forward_length u32              = 27648
llama_model_loader: - kv  13:               qwen2vl.attention.head_count u32              = 40
llama_model_loader: - kv  14:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  16:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,151387]  = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen2.5 VL 32B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 '膴'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 32/65 layers to GPU
load_tensors:        CUDA0 model buffer size =  8949.62 MiB
load_tensors:   CPU_Mapped model buffer size =  9976.38 MiB
.................................................................................................
clip_init: model name:   Qwen2.5-VL-32B-Instruct
clip_init: description:  image encoder for Qwen2VL
clip_init: GGUF version: 3
clip_init: alignment:    32
clip_init: n_tensors:    520
clip_init: n_kv:         24
clip_init: ftype:        f16

clip_init: loaded meta data with 24 key-value pairs and 520 tensors from .\models\samgreen\qwen2.5-vl-32b-instruct-vision-f16.gguf
clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_init: - kv   0:                       general.architecture str              = clip
clip_init: - kv   1:                        general.description str              = image encoder for Qwen2VL
clip_init: - kv   2:                          general.file_type u32              = 1
clip_init: - kv   3:                      clip.has_text_encoder bool             = false
clip_init: - kv   4:                    clip.has_vision_encoder bool             = true
clip_init: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_init: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_init: - kv   7:                              clip.use_silu bool             = true
clip_init: - kv   8:                              clip.use_gelu bool             = false
clip_init: - kv   9:                           clip.use_glu_mlp bool             = true
clip_init: - kv  10:                          clip.use_rms_norm bool             = true
clip_init: - kv  11:          clip.vision.fullatt_block_indexes arr[i32,4]       = [7, 15, 23, 31]
clip_init: - kv  12:                    clip.vision.window_size u32              = 112
clip_init: - kv  13:               clip.vision.embedding_length u32              = 1280
clip_init: - kv  14:                 clip.vision.projection_dim u32              = 5120
clip_init: - kv  15:                     clip.vision.patch_size u32              = 14
clip_init: - kv  16:                     clip.vision.image_size u32              = 560
clip_init: - kv  17:           clip.vision.attention.head_count u32              = 16
clip_init: - kv  18:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_init: - kv  19:                    clip.vision.block_count u32              = 32
clip_init: - kv  20:            clip.vision.feed_forward_length u32              = 0
clip_init: - kv  21:                               general.name str              = Qwen2.5-VL-32B-Instruct
clip_init: - kv  22:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_init: - kv  23:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_init: - type  f32:  292 tensors
clip_init: - type  f16:  228 tensors
clip_ctx: CLIP using CUDA0 backend
clip_init: text_encoder:   0
clip_init: vision_encoder: 1
clip_init: llava_projector:  0
clip_init: minicpmv_projector:  0
clip_init: minicpmv_version:  2
clip_init: glm_projector:  0
clip_init: model size:     1314.85 MB
clip_init: metadata size:  0.18 MB
clip_init: params backend buffer size =  1314.85 MB (520 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.feature_layer not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file

@HimariO
Copy link
Contributor

HimariO commented Apr 7, 2025

@green-s Previously converted gguf should work fine with recent changes; nothing has changed in the vision encoder gguf format.

@soldivelot Those "errors" are normal, since they are raised by non-essential parameters that Qwen2.5VL doesn't use.

@ColumbusAI
Copy link

How are you guys using Qwen VL models to have a conversation about images?

I am only finding the llama.cpp binaries to provide zero-shot prompting and not a true conversation or OAI endpoint that I can use with Open-WebUI to incorporate images in my text conversations.

Appreciate the insight!

@zrrraa
Copy link

zrrraa commented Apr 9, 2025

@green-s Previously converted gguf should work fine with recent changes; nothing has changed in the vision encoder gguf format.

@soldivelot Those "errors" are normal, since they are raised by non-essential parameters that Qwen2.5VL doesn't use.

@HimariO Then how to quantize the vision encoder of 3B variant? I also failed to quantize the vision encoder fo 7B and 32B. May I know which version of the code you are using?

@Melon-Bread
Copy link

Melon-Bread commented Apr 11, 2025

How are you guys using Qwen VL models to have a conversation about images?

I am only finding the llama.cpp binaries to provide zero-shot prompting and not a true conversation or OAI endpoint that I can use with Open-WebUI to incorporate images in my text conversations.

Appreciate the insight!

Koboldcpp is the only way I know of at this moment if you want to do it locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests