Qwen 2.5 VL grounding mode - coordinates scaling? #676

borisloktev · 2025-01-27T19:51:26Z

Thank you for an amazing release!
I do have one question - will the model output bounding box based directly on the image resolution without scaling those?

ShuaiBai623 · 2025-01-28T04:42:57Z

Yes, we will output absolute coordinates based on the actual input size of the image. However, it's important to note that this is the size after processer. Regarding obtaining positions and visualization, we have prepared many examples in the cookbooks for reference.

fsommers · 2025-01-29T15:51:19Z

Yes, congrats on the release! Already loving Qwen 2.5 VL.

One question about text grounding: I noticed that image segment localization works perfectly (following the cookbook examples), but text localization (i.e., OCR) not always: Sometimes the bounding box coordinates are not exact for words. Is this is a known issue, or am I just doing something wrong?

JJJYmmm · 2025-02-03T04:09:59Z

If the model outputs absolute coordinates while processing visual tokens that represent 28x28 pixel blocks, does this mean that there could be an error of up to 28 pixels in the output?

Yes, we will output absolute coordinates based on the actual input size of the image. However, it's important to note that this is the size after processer. Regarding obtaining positions and visualization, we have prepared many examples in the cookbooks for reference.

ShuaiBai623 · 2025-02-03T10:50:04Z

Although the patch is 28x28, the 28x28 patch is converted into a larger embedding that can contain positional information within the 28x28 area. Therefore, the model can also predict positions inside the 28x28 region, meaning the theoretical error would be equivalent to the position of 1 pixel in the input image.

GingL · 2025-02-03T14:10:57Z

Yes, congrats on the release! Already loving Qwen 2.5 VL.

One question about text grounding: I noticed that image segment localization works perfectly (following the cookbook examples), but text localization (i.e., OCR) not always: Sometimes the bounding box coordinates are not exact for words. Is this is a known issue, or am I just doing something wrong?

You can try to set different min_pixels and max_pixels to better suite your data. For example, document parsing is more appropriate if you set the min_pixels to 1000x800, and max_pixels to 3000x2500.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen 2.5 VL grounding mode - coordinates scaling? #676

Qwen 2.5 VL grounding mode - coordinates scaling? #676

borisloktev commented Jan 27, 2025

ShuaiBai623 commented Jan 28, 2025

fsommers commented Jan 29, 2025

JJJYmmm commented Feb 3, 2025

ShuaiBai623 commented Feb 3, 2025

GingL commented Feb 3, 2025

Qwen 2.5 VL grounding mode - coordinates scaling? #676

Qwen 2.5 VL grounding mode - coordinates scaling? #676

Comments

borisloktev commented Jan 27, 2025

ShuaiBai623 commented Jan 28, 2025

fsommers commented Jan 29, 2025

JJJYmmm commented Feb 3, 2025

ShuaiBai623 commented Feb 3, 2025

GingL commented Feb 3, 2025