Enable qwen2vl video #2756

drbh · 2024-11-18T17:59:01Z

This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started.

TODOS

suport video_urls
fetch video contents in router
update protobufs to support video chunks
handle padding video token inputs
tokenize video bytes
integrate video logic with vision model (update position ids)
ensure tokenization process is correct
add tests
refactor/improve

update*

start server

text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-batch-prefill-tokens 10000 \
--max-input-tokens 10000 \
--max-total-tokens 10001

send request

import requests
import json

def chat_completion(url="http://127.0.0.1:3000", video_url=None, prompt=None):
    messages = [{
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": { 
                    "url": video_url
                }
            },
            {
                "type": "text",
                "text": prompt
            }
        ]
    }]

    payload = {
        "messages": messages,
        "seed": 42,
        "max_tokens": 30
    }

    response = requests.post(
        f"{url}/v1/chat/completions",
        json=payload,
        headers={"Content-Type": "application/json"}
    )

    return response.json()

video_url = "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"
result = chat_completion(
    video_url=video_url,
    prompt="Describe this video."
)
print(json.dumps(result, indent=2))
# {
#     "object": "chat.completion",
#     "id": "",
#     "created": 1731964042,
#     "model": "Qwen/Qwen2-VL-7B-Instruct",
#     "system_fingerprint": "2.4.1-dev0-native",
#     "choices": [
#         {
#             "index": 0,
#             "message": {
#                 "role": "assistant",
#                 "content": "The video showcases lush green trees with vibrant shades of green and various shades of yellow and brown, as well as moss-covered stumps and piles of moss",
#             },
#             "logprobs": null,
#             "finish_reason": "length",
#         }
#     ],
#     "usage": {"prompt_tokens": 9593, "completion_tokens": 30, "total_tokens": 9623},
# }

HuggingFaceDocBuilderDev · 2024-11-18T21:17:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

mfarre · 2024-11-20T11:39:16Z

router/src/validation.rs

I think it would help if at this point we already sample the video at 1fps and resize the frames to 360x420 if those are bigger.
this 1fps sampling forces us to figure out the framerate, I think this is something we definitely want to do. Do you think we can import ffmpeg and call ffprobe for this?

We only convert to tokens the frames that we end up consuming for inference. qwen has this specific 'smart' logic and I think other models will have other logic. Where do you think is best to place the smart logic selection of qwen? I don't find it bad to be already in validation if afterwards we can launch other video models with the qwen frame selection logic, it can actually be interesting.

if we do it that way, once fetch_video selects the number of frames, when we want to estimate the number of tokens is way simpler and we do not depend on estimating the framerate.

… frames

drbh force-pushed the enable-qwen2vl-video branch from b780f00 to 6b4697e Compare November 18, 2024 18:03

mfarre reviewed Nov 20, 2024

View reviewed changes

mfarre and others added 15 commits November 25, 2024 16:41

WIP video support

a9cfdd2

router changes

a78301d

adopting video url

d79d123

connecting video to qwen2

fdfa002

fix

f7786b8

downloading videos

5798aa2

fix

9d68bf3

refactoring

e4f7640

fix

d4a6428

feat: support video input chunks and enable qwen2 vl to process video

ac2c1d2

fix: add protobuf update and mp4parse dep

17db520

fix: remove unused deps and imports

bbd7d95

moving video sampling and resize to validation. downstream we receive…

86bbf22

… frames

flatten frames to data block when needed

94da944

fix: adjust video process, reduce to 1 fps and adjust tensor shape

32438fc

drbh force-pushed the enable-qwen2vl-video branch from b9707b9 to 32438fc Compare November 25, 2024 21:43

drbh added 11 commits November 25, 2024 16:47

fix: adjust deps after rebase

619a2d5

feat: adjust impure shell deps and autodocs workflow

96cc1b1

fix: include more deps for ffmpeg as docs suggest

28d7142

fix: add ffmpeg deps to test build

7d40f3a

fix: debug ffmpeg install in tests workflow

3067dcc

fix: debug ffmpeg deps in tests II

aabcd93

fix: adjust dependencies and bump pip along with python

300e8bc

fix: adjust pkg config in test

5d85740

fix: ensure pip is installed after installing deps in test workflow

9ba9d5d

fix: add libavfilter dep to test

d2ed6eb

fix: add libavdevice dep to tests workflow

f6b5fb2

drbh added 3 commits November 27, 2024 00:14

fix: add ffmpeg overlay and enable build

40fb8c4

fix: adjust image parsing to fix tests

ce07242

fix: include ffmpeg deps in autodocs workflow

65388c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable qwen2vl video #2756

Enable qwen2vl video #2756

drbh commented Nov 18, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 18, 2024

mfarre Nov 20, 2024 •

edited

Loading

Enable qwen2vl video #2756

Are you sure you want to change the base?

Enable qwen2vl video #2756

Conversation

drbh commented Nov 18, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Nov 18, 2024

mfarre Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

drbh commented Nov 18, 2024 •

edited

Loading

mfarre Nov 20, 2024 •

edited

Loading