-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The server does not to run inference and responds immediately with null (sometimes) #40
Comments
Hi, can you provide the source of the model? I haven't tested with Qwen2.5 VL 7B, it should work with Qwen2 VL. |
Hi! I got the models from here: https://huggingface.co/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF and is referenced here on the main feature request from Llama.cpp: But I must say that after setting the context to 32K (was 16K) and also setting GGML_CANN_DISABLE_BUF_POOL_CLEAN=1 the inference runs stable! This is great, I think you should mention this tool can serve Qwen 2.5 VL as neither llama.cpp server or Ollama currently can. And this model is really really good. Thanks again! |
Hey, thanks for testing the quants on llama-box. We will add parameter variance and the BUF_POOL_CLEAN options to our benchmarking to cover cases like this! I am also working on Qwen2.5 VL 3B, so stay tuned. |
@hvico , |
Woa, so probably that wasn't the issue. Maybe the increased context I configured (which is the original ctx of the model). But probably there is some missing logging code because in that scenario the server replied with null but in the llama-box logger nothing was logged (setting the maximum verbosity). Anyway the model is working fine! Nice tool! |
Hi!
First of all, thanks for this! I am testing generation using QWEN 2.5 VL 7B on HIP backend (Linux with 4 x 7900 XTX) and it kinda works, but sometimes the server just fails to execute the inference.
It seems random, sometimes I send a request to the server and I get a response, then I immediately call it again with the same request and it just responds with something like this:
{"choices":[{"finish_reason":null,"index":0,"logprobs":null,"message":{"content":"","role":"assistant"}}],"created":1740670032,"id":"chatcmpl-5GBBtfR6HnG2L023sBRwLrvzsd2Sr9Vq","model":"qwen2.5-vl","object":"chat.completion","usage":null}
On the server side, even enabling verbose output it doesn't log anything on those cases.
But when it runs it does perfectly well (logging all the inference) and giving a good response. It is like sometimes the call it is just ignored and the server immediately responds with that empty stuff without running the model.
Could you give me some guidance on this? Thanks!
The text was updated successfully, but these errors were encountered: