-
Notifications
You must be signed in to change notification settings - Fork 11.5k
server : (experimental) vision support via libmtmd #12898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@qnixsynapse can you capture the raw http request? If the json paymoad is big, you can share it via a gist |
@ngxson Will this be okay? https://gist.github.com/qnixsynapse/a4c61368d05180d3cb6c00f1baedf92c |
at minimum I ask for this https://wiki.wireshark.org/hyper_text_transfer_protocol not the raw IP packet |
I don't have wireshark installed unfortunately. But you can still inspect for example:
|
@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch. It should be fixed now, could you give it a try? |
Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:
And I also have a question regarding the logic around Edit: optionally one more refactoring, we should split |
@ngxson Nvm. Ended up using your fork .. On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.
|
This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for I'll think about the input chunk question today and let you know if I have any thoughts. |
common/arg.cpp
Outdated
params.mmproj.hf_repo = params.model.hf_repo; | ||
} | ||
// TODO @ngxson : this will break non-vision model with -hf, need to fix before merging | ||
common_params_handle_model(params.mmproj, params.hf_token, "", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson Is it possible to add a --no-offload-mmproj
param here to keep the mmproj model on the CPU and the larger text model on GPU?
We can use mtmd_param_param: use_gpu=false to keep the projector model on CPU itself.
llama.cpp/examples/llava/mtmd.h
Lines 52 to 58 in 0019279
struct mtmd_context_params { | |
bool use_gpu = true; | |
bool print_timings = true; | |
int n_threads = 4; | |
enum ggml_log_level verbosity = GGML_LOG_LEVEL_INFO; | |
const char * image_marker = "<__image__>"; | |
}; |
It will be useful where the GPU VRAM is limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's added in #13093
the flag is --no-mmproj-offload
instead of --no-offload-mmproj
, to align with the existing --no-kv-offload
Seems like the batch decoding dies when you send a variety of longer requests.
Easiest way to trigger is to just wiggle the sequence length around, like with the example code import json
import base64
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Path to your image
image_path = "../models/bliss.png"
# Getting the Base64 string
base64_image = encode_image(image_path)
for mult in [100, 200]: # (beinsezii) make sure it has to rebuild some cache the 2nd time
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.1,
stream=True,
messages=[
{
"role": "user",
"content": [
{ "type": "text", "text": "describe what you see in details\n" * mult },
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_image}",
},
},
],
}
],
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
print("\n\n") |
examples/server/utils.hpp
Outdated
|
||
void reserve_embd_batch(float * embd, int32_t n_tokens, llama_pos pos_0, llama_seq_id seq_id) { | ||
GGML_ASSERT(n_tokens <= (int32_t)pos.size()); | ||
seq_ids[n_tokens] = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Beinsezii yes seems like it's due to this single line which shouldn't be here, I'm removing this, will push a commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, server will automatically have everything libmtmd does? Like #13050 post merge
I'll try to break the decode again here in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, everything supported by libmtmd will be available on server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using gemma 27 I've run multiple fat images interspersed with varying text sequences and it seems to be stable now. I'm up to like 5 images in one prompt, so that should be all good now.
I know from experience GLM is broken on ROCm, but I am curious about MiniCPM and SmolVLM so I'll try those at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ibm-research/granite-vision-3.2-2b-GGUF
fails to encode when feeding in a 3840x2160 image but otherwise it's been a positive experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes some models requires larger --batch
for now. this will be fixed soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
larger --batch is no longer required. for some models, you may want to bump the ctx size to 8k or 16k because they use a lot more tokens for bigger images
Yes but storing it hex string is easier for debugging, so it must be converted to a hex string to prevent potential problems null byte. This conversion is currently missing in the code. |
Significant changes in last commits:
|
curious if small 3.1 is the same vision mechanism or if that will need more work as well Update: seems like Pixtral is broken. It thinks bliss.png is a "blue and green grid" and other images it just interprets as corrupted or noise. |
Which backend you're using? Does it give the same result when running via |
ROCm and it seems to be temperature dependent? Like 0.1 temp it will reply Meanwhile on CPU it always recognizes it as a landscape even at temp 2.0. ROCm at 2.0 claims there isn't an image at all lol. I imagine something is wrong because I don't think temp should swing the results that hard for such a simple prompt. Haven't tried Vulkan yet. Identical behavior with mtmd-cli. Shall I open an issue? Slight update: even pure textually the model just seems really bad on ROCm with a moderate or high temp. |
@HAV0X1014 if you're trying cpu, try a clean cpu only build without HIP compiled at all. For some reason compiling with HIP but using |
For the problem with pixtral, please follow: #13065 (comment) |
Is there a way to pass images via non-chat completion yet? I see in the server readme at one point http post http://127.0.0.1:8080/completion --content-type application/json {
prompt: 'What is in this image?[img-12]',
"image_data": [{"data": (open /tmp/bliss.png | encode base64), "id": 12}]
} but I don't believe that's functional anymore. |
@Beinsezii I don't spend my time adding |
Putting pixtral aside (because the bug is not related to server support), @Beinsezii would you be interested in testing other models? I think the current PR is in a "working" state, meaning it can already handle the most important use case, the So would be nice to know if the current approach works well with these cases:
Things will not be supported for now:
|
For Gemma 3 QAT I've done basically all of this quite a bit as of the sha1 commit by setting up a SillyTavern instance pointed at /v1/chat/completions/ and having gemma iteratively write short stories using groups of images as inputs. Silly is kind of a pain to configure for this but its easier than directly writing OAI requests imo. Pretty fun once you get it going; most images I had at once was 5 spread out over a few instructions. The only real annoyance I've had monkeying around so far is the mmproj being locked to the GPU eating half of the memory I normally use for k/v cache. With checksumming now I think it would make sense to add an Basically all of the other models that currently work are glorified captioners so I haven't extensively played with them since I don't train t2i loras. Otherwise I'm not sure what else to monkey with besides having it yap for 20k tokens to see if something eventually breaks.
Is this different from a cache hit when the first X of a new request matches in tokens so it doesn't re-process? |
Ah, seems like |
There is a recently added
yes it is
it should be fixed in #13082 , but I haven't tested yet. Are you running the latest commit in the current PR? 2df8c1a Edit: ok currently it does not work because tokenizer expect a multimodal, but that will be an easy fix |
Ah, that's perfect. I missed that on the new commit. Update: vram usage is exactly 0 MiB lower. Given it's processing images in milliseconds and not seconds, I'm going to assume the flag isn't working, at least on ROCm.
seemed to be doing good last I tried. Prompt processing is slow on my card so I'd notice if it was missing a lot. |
ok thanks for reporting, --no-mmproj-offload should be fixed in the last commit |
confirm it works now. I got all my vram back at the cost of waiting a whole minute for all 7 images to encode lol. Currently it seems like the image hashing doesn't actually save the tokens, rather just is a global kv re-use check. I'm on a 7900X, but I imagine someone with a quad core waiting 30 seconds for their image, realizing their prompt had a typo, then waiting 30 more seconds because the history changed so the cache was evicted. Not sure if it's possible to also cache the image tokens themselves or not, but it would probably help some people. |
yeah I was thinking about caching image embeddings too, but it can be a bit tricky because we don't yet have a cache eviction strategy. but this can be added later on |
// each chunk can contain either one SINGLE text token or pointer to image | ||
// this is to simplify the logic of KV cache management | ||
struct server_token { | ||
llama_token txt; | ||
std::shared_ptr<mtmd_image_tokens> img; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order not to affect too much on the existing functionalities, I end up using this struct. My idea is that each instance of this struct correspond to exactly 1 token in KV cache.
For functionalities that rely on std::vector<llama_token>
, like ctx shift or speculative decoding, it can continue to function with minimal change. Ofc we make sure to disable these features if mtmd is used, because these features cannot (yet?) handle image tokens.
Also, here I have to use shared_ptr
because an image can be represented by multiple tokens, so all of these tokens will point to one single image. I do this way so tokens.size()
automatically reflects the token count in the prompt. It may sound ugly, but I think this is the best interim solution before we come up with something more clean.
@ggerganov If you have a bit of time, could you do a quick review just to see if my global implementation is ok?
Cont #12849
This is my first trial to bring
libmtmd
toserver.cpp
.For the list of supported models, see: #13012
The current goals of this PR are:
libmtmd
can be used in a different context than CLI, so that I can adapt it progressively in upcoming PRsThere are still quite a lot of problems:
Implementation
The core idea of this implementation is to migrate the input from using a
std::vector<llama_token>
tostd::vector<server_inp_chunk>
There was an API called
mtmd_input_chunk
introduced in #12849 , and the difference betweenmtmd_input_chunk
vsserver_inp_chunk
is thatserver_inp_chunk
only store one single token in case of text ; in case of image, it stores a pointer to themtmd_image_tokens
The reason why I did this is because keeping track of KV this way seems easier (i.e. the code easier to write). Here we mostly care about the individual tokens ; We never need to look into individual image tokens anyway, if an image is different from the last one in cache, their embeddings will be completely different, so we simply throw away the whole image.
As
mtmd_image_tokens_ptr
usesunique_ptr
under the hood, a side effect of this change is that now we eliminated some copy when passing thetask
from one function to another, hence manystd::move
are added to the change.TODOs
image_url
in addition ofbase64
Demo
The server can be run with this command:
Client code, ONLY
base64
input is supported atm:With the image:
This will output: