Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen 2.5 VL grounding mode - coordinates scaling? #676

Open
borisloktev opened this issue Jan 27, 2025 · 5 comments
Open

Qwen 2.5 VL grounding mode - coordinates scaling? #676

borisloktev opened this issue Jan 27, 2025 · 5 comments

Comments

@borisloktev
Copy link

Thank you for an amazing release!
I do have one question - will the model output bounding box based directly on the image resolution without scaling those?

@ShuaiBai623
Copy link
Collaborator

Yes, we will output absolute coordinates based on the actual input size of the image. However, it's important to note that this is the size after processer. Regarding obtaining positions and visualization, we have prepared many examples in the cookbooks for reference.

@fsommers
Copy link

Yes, congrats on the release! Already loving Qwen 2.5 VL.

One question about text grounding: I noticed that image segment localization works perfectly (following the cookbook examples), but text localization (i.e., OCR) not always: Sometimes the bounding box coordinates are not exact for words. Is this is a known issue, or am I just doing something wrong?

@JJJYmmm
Copy link

JJJYmmm commented Feb 3, 2025

If the model outputs absolute coordinates while processing visual tokens that represent 28x28 pixel blocks, does this mean that there could be an error of up to 28 pixels in the output?

Yes, we will output absolute coordinates based on the actual input size of the image. However, it's important to note that this is the size after processer. Regarding obtaining positions and visualization, we have prepared many examples in the cookbooks for reference.

@ShuaiBai623
Copy link
Collaborator

Although the patch is 28x28, the 28x28 patch is converted into a larger embedding that can contain positional information within the 28x28 area. Therefore, the model can also predict positions inside the 28x28 region, meaning the theoretical error would be equivalent to the position of 1 pixel in the input image.

@GingL
Copy link

GingL commented Feb 3, 2025

Yes, congrats on the release! Already loving Qwen 2.5 VL.

One question about text grounding: I noticed that image segment localization works perfectly (following the cookbook examples), but text localization (i.e., OCR) not always: Sometimes the bounding box coordinates are not exact for words. Is this is a known issue, or am I just doing something wrong?

You can try to set different min_pixels and max_pixels to better suite your data. For example, document parsing is more appropriate if you set the min_pixels to 1000x800, and max_pixels to 3000x2500.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants