The server does not to run inference and responds immediately with null (sometimes) #40

hvico · 2025-02-27T15:33:58Z

Hi!

First of all, thanks for this! I am testing generation using QWEN 2.5 VL 7B on HIP backend (Linux with 4 x 7900 XTX) and it kinda works, but sometimes the server just fails to execute the inference.

It seems random, sometimes I send a request to the server and I get a response, then I immediately call it again with the same request and it just responds with something like this:

{"choices":[{"finish_reason":null,"index":0,"logprobs":null,"message":{"content":"","role":"assistant"}}],"created":1740670032,"id":"chatcmpl-5GBBtfR6HnG2L023sBRwLrvzsd2Sr9Vq","model":"qwen2.5-vl","object":"chat.completion","usage":null}

On the server side, even enabling verbose output it doesn't log anything on those cases.

But when it runs it does perfectly well (logging all the inference) and giving a good response. It is like sometimes the call it is just ignored and the server immediately responds with that empty stuff without running the model.

Could you give me some guidance on this? Thanks!

thxCode · 2025-02-28T03:58:30Z

Hi, can you provide the source of the model? I haven't tested with Qwen2.5 VL 7B, it should work with Qwen2 VL.

hvico · 2025-02-28T04:14:08Z

Hi!

I got the models from here: https://huggingface.co/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF and is referenced here on the main feature request from Llama.cpp:

ggml-org/llama.cpp#11483

But I must say that after setting the context to 32K (was 16K) and also setting GGML_CANN_DISABLE_BUF_POOL_CLEAN=1 the inference runs stable!

This is great, I think you should mention this tool can serve Qwen 2.5 VL as neither llama.cpp server or Ollama currently can. And this model is really really good.

Thanks again!

vladislavdonchev · 2025-02-28T05:22:33Z

Hi!

I got the models fron here: https://huggingface.co/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF and is referenced here on the main feature request from Llama.cpp:

ggml-org/llama.cpp#11483

But I must say that after setting the context to 32K (was 16K) and also setting GGML_CANN_DISABLE_BUF_POOL_CLEAN=1 the inference runs stable!

This is great, I think you should mention this tool can serve Qwen 2.5 VL as neither llama.cpp server or Ollama currently can. And this model is really really good.

Thanks again!

Hey, thanks for testing the quants on llama-box. We will add parameter variance and the BUF_POOL_CLEAN options to our benchmarking to cover cases like this!

I am also working on Qwen2.5 VL 3B, so stay tuned.

thxCode · 2025-02-28T06:30:42Z

@hvico , GGML_CANN_DISABLE_BUF_POOL_CLEAN is for CANN backend, but I saw your original case is on HIP, so is there anything I missed?

hvico · 2025-02-28T16:03:37Z

@hvico , GGML_CANN_DISABLE_BUF_POOL_CLEAN is for CANN backend, but I saw your original case is on HIP, so is there anything I missed?

Woa, so probably that wasn't the issue. Maybe the increased context I configured (which is the original ctx of the model). But probably there is some missing logging code because in that scenario the server replied with null but in the llama-box logger nothing was logged (setting the maximum verbosity).

Anyway the model is working fine! Nice tool!

thxCode mentioned this issue Feb 28, 2025

Add qwen2.5-vl to catalog gpustack/gpustack#1185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The server does not to run inference and responds immediately with null (sometimes) #40

The server does not to run inference and responds immediately with null (sometimes) #40

hvico commented Feb 27, 2025

thxCode commented Feb 28, 2025

hvico commented Feb 28, 2025 •

edited

Loading

vladislavdonchev commented Feb 28, 2025

thxCode commented Feb 28, 2025

hvico commented Feb 28, 2025

The server does not to run inference and responds immediately with null (sometimes) #40

The server does not to run inference and responds immediately with null (sometimes) #40

Comments

hvico commented Feb 27, 2025

thxCode commented Feb 28, 2025

hvico commented Feb 28, 2025 • edited Loading

vladislavdonchev commented Feb 28, 2025

thxCode commented Feb 28, 2025

hvico commented Feb 28, 2025

hvico commented Feb 28, 2025 •

edited

Loading