Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The server does not to run inference and responds immediately with null (sometimes) #40

Open
hvico opened this issue Feb 27, 2025 · 5 comments

Comments

@hvico
Copy link

hvico commented Feb 27, 2025

Hi!

First of all, thanks for this! I am testing generation using QWEN 2.5 VL 7B on HIP backend (Linux with 4 x 7900 XTX) and it kinda works, but sometimes the server just fails to execute the inference.

It seems random, sometimes I send a request to the server and I get a response, then I immediately call it again with the same request and it just responds with something like this:

{"choices":[{"finish_reason":null,"index":0,"logprobs":null,"message":{"content":"","role":"assistant"}}],"created":1740670032,"id":"chatcmpl-5GBBtfR6HnG2L023sBRwLrvzsd2Sr9Vq","model":"qwen2.5-vl","object":"chat.completion","usage":null}

On the server side, even enabling verbose output it doesn't log anything on those cases.

But when it runs it does perfectly well (logging all the inference) and giving a good response. It is like sometimes the call it is just ignored and the server immediately responds with that empty stuff without running the model.

Could you give me some guidance on this? Thanks!

@thxCode
Copy link
Collaborator

thxCode commented Feb 28, 2025

Hi, can you provide the source of the model? I haven't tested with Qwen2.5 VL 7B, it should work with Qwen2 VL.

@hvico
Copy link
Author

hvico commented Feb 28, 2025

Hi!

I got the models from here: https://huggingface.co/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF and is referenced here on the main feature request from Llama.cpp:

ggml-org/llama.cpp#11483

But I must say that after setting the context to 32K (was 16K) and also setting GGML_CANN_DISABLE_BUF_POOL_CLEAN=1 the inference runs stable!

This is great, I think you should mention this tool can serve Qwen 2.5 VL as neither llama.cpp server or Ollama currently can. And this model is really really good.

Thanks again!

@vladislavdonchev
Copy link

Hi!

I got the models fron here: https://huggingface.co/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF and is referenced here on the main feature request from Llama.cpp:

ggml-org/llama.cpp#11483

But I must say that after setting the context to 32K (was 16K) and also setting GGML_CANN_DISABLE_BUF_POOL_CLEAN=1 the inference runs stable!

This is great, I think you should mention this tool can serve Qwen 2.5 VL as neither llama.cpp server or Ollama currently can. And this model is really really good.

Thanks again!

Hey, thanks for testing the quants on llama-box. We will add parameter variance and the BUF_POOL_CLEAN options to our benchmarking to cover cases like this!

I am also working on Qwen2.5 VL 3B, so stay tuned.

@thxCode
Copy link
Collaborator

thxCode commented Feb 28, 2025

@hvico , GGML_CANN_DISABLE_BUF_POOL_CLEAN is for CANN backend, but I saw your original case is on HIP, so is there anything I missed?

@hvico
Copy link
Author

hvico commented Feb 28, 2025

@hvico , GGML_CANN_DISABLE_BUF_POOL_CLEAN is for CANN backend, but I saw your original case is on HIP, so is there anything I missed?

Woa, so probably that wasn't the issue. Maybe the increased context I configured (which is the original ctx of the model). But probably there is some missing logging code because in that scenario the server replied with null but in the llama-box logger nothing was logged (setting the maximum verbosity).

Anyway the model is working fine! Nice tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants