Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smol improvements to support more flexible usage #34857

Merged

Conversation

andimarafioti
Copy link
Member

What does this PR do?

This PR improves a few smol issues we had with Idefics 3:

  1. We couldn't use the model with images larger than 5*364. This was the default max_image_size. The method where this was computed took a parameter as input, but it was never used. It would also raise an error if we wanted to resize to a larger size. I changed this for a default value of 4k resolution, as this is already considerably larger than what we trained on, ie, anything larger is pretty outrageous.
  2. We couldn't train with datasets that contained grayscale images since the input_data_format wasn't properly parsed. I fixed this by switching around the processing order. Now, if the images are grayscale, I add a channel to the end or start of the images. Then, the input_data_format can be correctly inferred if it is none.
  3. Finally, when converting to pil_image, we were not passing the input_data_format. For images that have 4 channels, this was breaking the processing. Since we already have the input_data_format in these functions, I added it.

Who can review?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kashif
Copy link
Contributor

kashif commented Nov 21, 2024

thanks @andimarafioti LGTM, I can load the dataset nicely now

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andimarafioti! Thanks for the update, just a note regarding converting to RGB

Comment on lines +751 to +760
# Extra channel dimension for grayscale images
if input_data_format in [ChannelDimension.LAST, None]:
images_list = [
[np.expand_dims(img, axis=-1) if img.ndim == 2 else img for img in images] for images in images_list
]
elif input_data_format == ChannelDimension.FIRST:
images_list = [
[np.expand_dims(img, axis=0) if img.ndim == 2 else img for img in images] for images in images_list
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a method that converts PIL images to RGB format (see do_convert_rgb param in other image processors). It's better to use it + extend to numpy arrays if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know, the order of the conversion here is actually important as RGB conversion before image_splitting was hurting the model performance quite a bit D:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is why I'm not converting to RGB until later in the pipeline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know, the order of the conversion here is actually important as RGB conversion before image_splitting was hurting the model performance quite a bit D:

haha, that's interesting, do you have any clarification for such a behavior? 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My observation is that the order here makes the resulting images are slightly different (if you just plot the differences you see large values on edges in the images). They look the same to me, but clearly not to the model since it was trained with a different pipeline that more closely resembles this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it, thanks for clarifying 🤗

@mfarre
Copy link

mfarre commented Nov 21, 2024

@andimarafioti LGTM: I reran vlmevalkit with the current changes and I can confirm there is no regression

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that RGB better be done if do_convert_rbg by updating the helper function from utils, otherwise LGTM

@andimarafioti
Copy link
Member Author

To clarify, here I'm not changing where or if RGB conversion happens.

@qubvel
Copy link
Member

qubvel commented Nov 22, 2024

Yeah, the idea is that we don’t actually need this code because the image will be converted to RGB. However, since maintaining this order is important for quality, it seems fine to me.

@andimarafioti andimarafioti merged commit 861758e into huggingface:main Nov 22, 2024
11 checks passed
@andimarafioti andimarafioti deleted the idefics3-smol-improvements branch November 22, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants