Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable qwen2vl video #2756

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft

Enable qwen2vl video #2756

wants to merge 29 commits into from

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Nov 18, 2024

This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started.

TODOS

  • suport video_urls
  • fetch video contents in router
  • update protobufs to support video chunks
  • handle padding video token inputs
  • tokenize video bytes
  • integrate video logic with vision model (update position ids)
  • ensure tokenization process is correct
  • add tests
  • refactor/improve

update*

start server

text-generation-launcher \
--model-id Qwen/Qwen2-VL-7B-Instruct \
--max-batch-prefill-tokens 10000 \
--max-input-tokens 10000 \
--max-total-tokens 10001

send request

import requests
import json

def chat_completion(url="http://127.0.0.1:3000", video_url=None, prompt=None):
    messages = [{
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": { 
                    "url": video_url
                }
            },
            {
                "type": "text",
                "text": prompt
            }
        ]
    }]

    payload = {
        "messages": messages,
        "seed": 42,
        "max_tokens": 30
    }

    response = requests.post(
        f"{url}/v1/chat/completions",
        json=payload,
        headers={"Content-Type": "application/json"}
    )

    return response.json()

video_url = "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"
result = chat_completion(
    video_url=video_url,
    prompt="Describe this video."
)
print(json.dumps(result, indent=2))
# {
#     "object": "chat.completion",
#     "id": "",
#     "created": 1731964042,
#     "model": "Qwen/Qwen2-VL-7B-Instruct",
#     "system_fingerprint": "2.4.1-dev0-native",
#     "choices": [
#         {
#             "index": 0,
#             "message": {
#                 "role": "assistant",
#                 "content": "The video showcases lush green trees with vibrant shades of green and various shades of yellow and brown, as well as moss-covered stumps and piles of moss",
#             },
#             "logprobs": null,
#             "finish_reason": "length",
#         }
#     ],
#     "usage": {"prompt_tokens": 9593, "completion_tokens": 30, "total_tokens": 9623},
# }

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

@mfarre mfarre Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help if at this point we already sample the video at 1fps and resize the frames to 360x420 if those are bigger.
this 1fps sampling forces us to figure out the framerate, I think this is something we definitely want to do. Do you think we can import ffmpeg and call ffprobe for this?

We only convert to tokens the frames that we end up consuming for inference. qwen has this specific 'smart' logic and I think other models will have other logic. Where do you think is best to place the smart logic selection of qwen? I don't find it bad to be already in validation if afterwards we can launch other video models with the qwen frame selection logic, it can actually be interesting.

if we do it that way, once fetch_video selects the number of frames, when we want to estimate the number of tokens is way simpler and we do not depend on estimating the framerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants