Skip to content

server : (experimental) vision support via libmtmd #12898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 11, 2025

Cont #12849

This is my first trial to bring libmtmd to server.cpp.

For the list of supported models, see: #13012

The current goals of this PR are:

  • To see how libmtmd can be used in a different context than CLI, so that I can adapt it progressively in upcoming PRs
  • To provide a place to test the integration of other vision models

There are still quite a lot of problems:

  • Many features are too hard to be compatible, like speculative decoding, context shifting, slot cache save/load,
  • Cached prompt is half working now:
    • Missing image hash compare (to know if we should remove cached tokens of the image)
    • Sometimes, we get batch with 0 tokens (for example, enter 2 times the same prompt)
  • Batched decode is disabled on image embd batch, which will degrade performance in case of multi-slots

Implementation

The core idea of this implementation is to migrate the input from using a std::vector<llama_token> to std::vector<server_inp_chunk>

There was an API called mtmd_input_chunk introduced in #12849 , and the difference between mtmd_input_chunk vs server_inp_chunk is that server_inp_chunk only store one single token in case of text ; in case of image, it stores a pointer to the mtmd_image_tokens

struct server_inp_chunk {
    llama_token tok_text; // one single token, not a list of tokens
    mtmd_image_tokens_ptr tok_image;
};

The reason why I did this is because keeping track of KV this way seems easier (i.e. the code easier to write). Here we mostly care about the individual tokens ; We never need to look into individual image tokens anyway, if an image is different from the last one in cache, their embeddings will be completely different, so we simply throw away the whole image.

As mtmd_image_tokens_ptr uses unique_ptr under the hood, a side effect of this change is that now we eliminated some copy when passing the task from one function to another, hence many std::move are added to the change.

TODOs

  • automatically deactivate certain features if vision is enabled, we will work on these features later
  • implement hash function for image (to keep track of the cache)
  • fix detokenize(server_inp_chunk)
  • add more error handlings
  • maybe support remote image_url in addition of base64

Demo

The server can be run with this command:

llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Client code, ONLY base64 input is supported atm:

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.1,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": [
                { "type": "text", "text": "describe what you see in details" },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                    },
                },
            ],
        }
    ],
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

print("\n\n")

With the image:

bliss

This will output:

image

@qnixsynapse
Copy link
Collaborator

Awesome work. However, I noticed the model usually ignores the text prompt when the prompt is the first in the conversation!
image

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 12, 2025

@qnixsynapse can you capture the raw http request? If the json paymoad is big, you can share it via a gist

@qnixsynapse
Copy link
Collaborator

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 12, 2025

at minimum I ask for this https://wiki.wireshark.org/hyper_text_transfer_protocol

not the raw IP packet

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Apr 12, 2025

I don't have wireshark installed unfortunately. But you can still inspect for example:

POST /v1/chat/completions HTTP/1.1
Host: localhost:8080
Authorization: Bearer -key
Content-Type: application/json
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.11 aiohttp/3.11.11
Content-Length: 615117

{"stream": true, "model": "Gemma", "messages": [{"role": "user", "content": [{"type": "text", "text": "Fact check the content in this image please."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,<base64 png data from line 88>"}}]}], "stream_options": {"include_usage": true}, "temperature": 1.0, "top_p": 0.9}

HTTP/1.1 200 OK
Keep-Alive: timeout=5, max=100
Content-Type: text/event-stream
Server: llama.cpp
Transfer-Encoding: chunked
Access-Control-Allow-Origin: 

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 13, 2025

@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch.

It should be fixed now, could you give it a try?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 13, 2025

Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:

  • The first one is quite simple, currently server_task is passed-by-copy in some places, we need to add some std::move
  • The second one is a bit more tricky. Currently, we track everything using a std::vector<llama_token>. However, for multimodal, I introduced the notion of "input chunks" along with libmtmd. Server need to be adapted to work with chunks of tokens / embeddings instead of a simple list of tokens.
    In the current PR, I'm kinda hacking this by having server_inp_chunk to wrap around one single text token (so most of the text-related logic are unchanged). But obviously this brings some complication when dealing with both text + image chunks. Do you have any better ideas to handle this?

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.


Edit: optionally one more refactoring, we should split llama-server into different compilation units, currently it may takes up to 20s to compile

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Apr 14, 2025

@ngxson Can you please refresh this branch with master?

Nvm. Ended up using your fork .. working great!!! 👍

On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.

common/common.cpp:1161: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

@ggerganov
Copy link
Member

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for n_tokens it won't be able to find a contiguous slot, so attempting to split the batch into smaller chunks was a way to workaround this. With defragmentation enabled by default this is now rarely necessary. So yes, this should be simplified in a separate PR.

I'll think about the input chunk question today and let you know if I have any thoughts.

common/arg.cpp Outdated
params.mmproj.hf_repo = params.model.hf_repo;
}
// TODO @ngxson : this will break non-vision model with -hf, need to fix before merging
common_params_handle_model(params.mmproj, params.hf_token, "", true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Is it possible to add a --no-offload-mmproj param here to keep the mmproj model on the CPU and the larger text model on GPU?

We can use mtmd_param_param: use_gpu=false to keep the projector model on CPU itself.

struct mtmd_context_params {
bool use_gpu = true;
bool print_timings = true;
int n_threads = 4;
enum ggml_log_level verbosity = GGML_LOG_LEVEL_INFO;
const char * image_marker = "<__image__>";
};

It will be useful where the GPU VRAM is limited.

Copy link
Collaborator Author

@ngxson ngxson Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's added in #13093

the flag is --no-mmproj-offload instead of --no-offload-mmproj, to align with the existing --no-kv-offload

@Beinsezii
Copy link

Seems like the batch decoding dies when you send a variety of longer requests.

common/common.cpp:1159: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

Easiest way to trigger is to just wiggle the sequence length around, like with the example code

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

for mult in [100, 200]:  # (beinsezii) make sure it has to rebuild some cache the 2nd time
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.1,
        stream=True,
        messages=[
            {
                "role": "user",
                "content": [
                    { "type": "text", "text": "describe what you see in details\n" * mult },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                        },
                    },
                ],
            }
        ],
    )

    for chunk in response:
        print(chunk.choices[0].delta.content, end="")

    print("\n\n")


void reserve_embd_batch(float * embd, int32_t n_tokens, llama_pos pos_0, llama_seq_id seq_id) {
GGML_ASSERT(n_tokens <= (int32_t)pos.size());
seq_ids[n_tokens] = nullptr;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Beinsezii yes seems like it's due to this single line which shouldn't be here, I'm removing this, will push a commit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in f8bc466

And btw, as #13012 is merged, we can now test other models (except for qwen2vl, not yet supported)

Copy link

@Beinsezii Beinsezii Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, server will automatically have everything libmtmd does? Like #13050 post merge

I'll try to break the decode again here in a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, everything supported by libmtmd will be available on server

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using gemma 27 I've run multiple fat images interspersed with varying text sequences and it seems to be stable now. I'm up to like 5 images in one prompt, so that should be all good now.

I know from experience GLM is broken on ROCm, but I am curious about MiniCPM and SmolVLM so I'll try those at some point.

Copy link

@Beinsezii Beinsezii Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ibm-research/granite-vision-3.2-2b-GGUF fails to encode when feeding in a 3840x2160 image but otherwise it's been a positive experience.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes some models requires larger --batch for now. this will be fixed soon

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

larger --batch is no longer required. for some models, you may want to bump the ctx size to 8k or 16k because they use a lot more tokens for bigger images

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

A hash can be an arbitrary byte sequence, right? It's not necessarily a valid string.

Yes but storing it hex string is easier for debugging, so it must be converted to a hex string to prevent potential problems null byte. This conversion is currently missing in the code.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

Significant changes in last commits:

  • bump to latest master, we're now supporting Pixtral 12B
  • using FNV hash, using the image bitmap (NOT the raw file data)
  • support large image batches, so models like granite-vision or minicpm-v won't crash

@Beinsezii
Copy link

Beinsezii commented Apr 23, 2025

bump to latest master, we're now supporting Pixtral 12B

curious if small 3.1 is the same vision mechanism or if that will need more work as well

Update: seems like Pixtral is broken. It thinks bliss.png is a "blue and green grid" and other images it just interprets as corrupted or noise.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 23, 2025

Update: seems like Pixtral is broke

Which backend you're using? Does it give the same result when running via llama-mtmd-cli?

@Beinsezii
Copy link

Beinsezii commented Apr 24, 2025

Which backend you're using? Does it give the same result when running via llama-mtmd-cli?

ROCm and it seems to be temperature dependent?

Like 0.1 temp it will reply
It seems we're starting with an image of a serene landscape featuring a clear blue sky transitioning into lush green fields below.
where temp 1.0 is
it seems that the image you've shared contains a pattern of repeating colors and shapes that might be difficult to describe precisely without more context.

Meanwhile on CPU it always recognizes it as a landscape even at temp 2.0. ROCm at 2.0 claims there isn't an image at all lol. I imagine something is wrong because I don't think temp should swing the results that hard for such a simple prompt.

Haven't tried Vulkan yet. Identical behavior with mtmd-cli. Shall I open an issue?

Slight update: even pure textually the model just seems really bad on ROCm with a moderate or high temp. I wonder if this is just fp16 vs fp32 compute? alright even with CUDA_F16 off and f32 k/v cache the whole model is completely unusable on ROCm with even a mild temp lol.

@HAV0X1014
Copy link

I'm getting wildly incorrect outputs with Pixtral. I'm using the server API and llama-mtmd-cli and the server seems to completely ignore that I've sent an image, but the CLI outputs garbage - mentioning either a mosaic of colors or just outputting complete nonsense. This image in particular made it go nuts, counting up from 2013 until generation stopped.
IMG_20240224_194019

I'm using a 7900xtx, compiled with ROCm. Running it on CPU and GPU produced different, but still incorrect, results.

@Beinsezii
Copy link

@HAV0X1014 if you're trying cpu, try a clean cpu only build without HIP compiled at all. For some reason compiling with HIP but using --ngl 0 can still break some models. GLM 4 is the same way.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

For the problem with pixtral, please follow: #13065 (comment)

@Beinsezii
Copy link

Is there a way to pass images via non-chat completion yet? I see in the server readme at one point /completions could substitute images like

http post http://127.0.0.1:8080/completion --content-type application/json {
    prompt: 'What is in this image?[img-12]',
    "image_data": [{"data": (open /tmp/bliss.png | encode base64), "id": 12}]
}

but I don't believe that's functional anymore.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

@Beinsezii I don't spend my time adding /completions because this PR already took me a lot of time

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

Putting pixtral aside (because the bug is not related to server support), @Beinsezii would you be interested in testing other models?

I think the current PR is in a "working" state, meaning it can already handle the most important use case, the /chat/completions endpoint. That's the first phase on my plan. I'm thinking about moving to the second phase, it's to iterate on the current idea and fix bugs.

So would be nice to know if the current approach works well with these cases:

  • Chat completion, text-only
  • Chat completion, text and image
  • Chat completion, text and image, but multiple texts and multiple images interleaved in the same message
  • Chat completion, with KV reuse

Things will not be supported for now:

  • context shift and prompt truncation
  • speculative decoding
  • slot save/load
  • raw /completions (non-chat)
  • /embeddings, /rerank, /infill
  • n_cache_reuse

@Beinsezii
Copy link

So would be nice to know if the current approach works well with these cases:

  • Chat completion, text-only

  • Chat completion, text and image

  • Chat completion, text and image, but multiple texts and multiple images interleaved in the same message

For Gemma 3 QAT I've done basically all of this quite a bit as of the sha1 commit by setting up a SillyTavern instance pointed at /v1/chat/completions/ and having gemma iteratively write short stories using groups of images as inputs. Silly is kind of a pain to configure for this but its easier than directly writing OAI requests imo. Pretty fun once you get it going; most images I had at once was 5 spread out over a few instructions.

The only real annoyance I've had monkeying around so far is the mmproj being locked to the GPU eating half of the memory I normally use for k/v cache. With checksumming now I think it would make sense to add an -nglv (or -nglm?) at some point since the mmproj isn't part of the AR that I'm aware.

Basically all of the other models that currently work are glorified captioners so I haven't extensively played with them since I don't train t2i loras.

Otherwise I'm not sure what else to monkey with besides having it yap for 20k tokens to see if something eventually breaks.

Chat completion, with KV reuse

Is this different from a cache hit when the first X of a new request matches in tokens so it doesn't re-process?

@Beinsezii
Copy link

Ah, seems like -hf might not work properly on models without an mmproj? I did bin/llama-server -hf bartowski/THUDM_GLM-4-32B-0414-GGUF:Q4_K_L and it 404'd repeatedly until I switched to the non-mtmd server branch.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 25, 2025

The only real annoyance I've had monkeying around so far is the mmproj being locked to the GPU eating half of the memory I normally use for k/v cache.

There is a recently added --no-mmproj-offload, I don't add fine control for offloading layers because the mmproj usually very small compared to text model, and also it may take me more time to actually implement it.

a cache hit when the first X of a new request matches in tokens so it doesn't re-process?

yes it is

Ah, seems like -hf might not work properly on models without an mmproj?

it should be fixed in #13082 , but I haven't tested yet. Are you running the latest commit in the current PR? 2df8c1a

Edit: ok currently it does not work because tokenizer expect a multimodal, but that will be an easy fix

@Beinsezii
Copy link

Beinsezii commented Apr 25, 2025

There is a recently added --no-mmproj-offload, I don't add fine control for offloading layers because the mmproj usually very small compared to text model, and also it may take me more time to actually implement it.

Ah, that's perfect. I missed that on the new commit.

Update: vram usage is exactly 0 MiB lower. Given it's processing images in milliseconds and not seconds, I'm going to assume the flag isn't working, at least on ROCm.

yes it is

seemed to be doing good last I tried. Prompt processing is slow on my card so I'd notice if it was missing a lot.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 25, 2025

ok thanks for reporting, --no-mmproj-offload should be fixed in the last commit

@Beinsezii
Copy link

ok thanks for reporting, --no-mmproj-offload should be fixed in the last commit

confirm it works now. I got all my vram back at the cost of waiting a whole minute for all 7 images to encode lol.

Currently it seems like the image hashing doesn't actually save the tokens, rather just is a global kv re-use check. I'm on a 7900X, but I imagine someone with a quad core waiting 30 seconds for their image, realizing their prompt had a typo, then waiting 30 more seconds because the history changed so the cache was evicted. Not sure if it's possible to also cache the image tokens themselves or not, but it would probably help some people.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 25, 2025

yeah I was thinking about caching image embeddings too, but it can be a bit tricky because we don't yet have a cache eviction strategy. but this can be added later on

Comment on lines +966 to +970
// each chunk can contain either one SINGLE text token or pointer to image
// this is to simplify the logic of KV cache management
struct server_token {
llama_token txt;
std::shared_ptr<mtmd_image_tokens> img;
Copy link
Collaborator Author

@ngxson ngxson Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order not to affect too much on the existing functionalities, I end up using this struct. My idea is that each instance of this struct correspond to exactly 1 token in KV cache.

For functionalities that rely on std::vector<llama_token>, like ctx shift or speculative decoding, it can continue to function with minimal change. Ofc we make sure to disable these features if mtmd is used, because these features cannot (yet?) handle image tokens.

Also, here I have to use shared_ptr because an image can be represented by multiple tokens, so all of these tokens will point to one single image. I do this way so tokens.size() automatically reflects the token count in the prompt. It may sound ugly, but I think this is the best interim solution before we come up with something more clean.

@ggerganov If you have a bit of time, could you do a quick review just to see if my global implementation is ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants