-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen 2.5 VL grounding mode - coordinates scaling? #676
Comments
Yes, we will output absolute coordinates based on the actual input size of the image. However, it's important to note that this is the size after processer. Regarding obtaining positions and visualization, we have prepared many examples in the cookbooks for reference. |
Yes, congrats on the release! Already loving Qwen 2.5 VL. One question about text grounding: I noticed that image segment localization works perfectly (following the cookbook examples), but text localization (i.e., OCR) not always: Sometimes the bounding box coordinates are not exact for words. Is this is a known issue, or am I just doing something wrong? |
If the model outputs absolute coordinates while processing visual tokens that represent 28x28 pixel blocks, does this mean that there could be an error of up to 28 pixels in the output?
|
Although the patch is 28x28, the 28x28 patch is converted into a larger embedding that can contain positional information within the 28x28 area. Therefore, the model can also predict positions inside the 28x28 region, meaning the theoretical error would be equivalent to the position of 1 pixel in the input image. |
You can try to set different min_pixels and max_pixels to better suite your data. For example, document parsing is more appropriate if you set the min_pixels to 1000x800, and max_pixels to 3000x2500. |
Thank you for an amazing release!
I do have one question - will the model output bounding box based directly on the image resolution without scaling those?
The text was updated successfully, but these errors were encountered: