Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move from PIL to torchvision.io.decode_image #2303

Open
ebsmothers opened this issue Jan 28, 2025 · 12 comments
Open

Move from PIL to torchvision.io.decode_image #2303

ebsmothers opened this issue Jan 28, 2025 · 12 comments
Assignees
Labels
best practice Things we should be doing but aren't community help wanted We would love the community's help completing this issue

Comments

@ebsmothers
Copy link
Contributor

Instead of using PIL.open, @NicolasHug has pointed out that we can move to torchvision.io.decode_image to speed up image processing by doing everything on pure tensors (see also. This would also allow us to drop our explicit PIL requirement). This should entail:

  1. changing load_image to use torchvision.io.decode_image
  2. updating CLIPImageTransform to accept tensors instead of PIL.Image (in the short term we can keep PIL.Image support for backwards compatibility)
@ebsmothers ebsmothers added best practice Things we should be doing but aren't community help wanted We would love the community's help completing this issue labels Jan 28, 2025
@Ankur-singh
Copy link
Contributor

@ebsmothers Can you please assign this to me?

Thank you 😁

@ebsmothers
Copy link
Contributor Author

@Ankur-singh just assigned it to you, thanks for taking it on!

@NicolasHug
Copy link
Member

Feel free to ping me for help / reviews / best practices!

@RdoubleA
Copy link
Contributor

RdoubleA commented Feb 8, 2025

If we do this we should update all the documentation on Messages that image content is expected to be a tensor instead of a PIL image @ebsmothers @Ankur-singh

@Ankur-singh
Copy link
Contributor

@RdoubleA Just torch.Tensor or Union[torch.Tensor, PIL.Image]?

Vision transformer as of now supports both.

@pbontrager
Copy link
Contributor

PIL was used as the only input type intentionally to enforce a consistent image API. A PIL image contains all the necessary meta data about an image to ensure the right transforms are being applied to an image (e.g. RGB channels vs BGR) whereas tensors don't have this information and we have to put the responsibility on the user to input the data in the expected format. If we're able to get that info from torchvision though that changes things.

@Ankur-singh
Copy link
Contributor

Keeping backward compatibility in mind, should be support PIL images as well? We can add a deprecation warning and drop PIL image support in the next release.

OR we can drop the support right away and update all the docs to use torch.Tensor.

@NicolasHug
Copy link
Member

Torchvision doesn't carry the channel ordering (RGB vs BRG), but it's not a torchvision thing, it's a pytorch thing: all image tensors in pytorch are expected to be NCHW RGB images. That's the one and only standard, so there never was a need to explicitly store that. I think it's a widely understood standard, e.g. every model (on HF or other hubs) expects RGB tensors.

I'm happy to dig more with you to better understand your concerns.

@Ankur-singh
Copy link
Contributor

Thanks @NicolasHug.

I was talking about ClipImageTransform. It accepts both PIL image and torch tensor. With the new PR (#2366), load_image returns torch tensor so all built-in multimodal datasets make use of torch tensor all the way.

I see a few examples in the docs that make use of PIL (Interleaving images in text and Multimodal Transforms) when creating messages. I was talking about these docs. Should we drop the support right away and update all the docs? or add deprecation warning and drop support in next release?

@pbontrager
Copy link
Contributor

I guess the key issue is if it's reasonable for load_image to assume the image provided is in RGB format. Yes Pytorch expects users to convert images to rgb but that's when they're setting up all the code themselves and are responsible for the dataset. torchtune is going a bit further and saying "just point to any HF dataset" and this recipe will work. Also load_image is used in the inference recipe and allows the user to point to any random url.

The PIL API is nice in that it's very consistent, but an alternative approach would be to use torchvision in load_image and just have load_image verify that the image is rgb (I think that's the only format information we need to verify?). A user could bypass this safety check by not using load_image but at least our default out of the box solution would be safe.

@NicolasHug do you know if that verification could be done cheaply without PIL?

@pbontrager
Copy link
Contributor

@Ankur-singh I talked to Nicolas and he informed me that torchvision decode guarantees it'll always return rgb ordering during the decoding process. So I have no more issues with this change.

I think it would be fairly safe to drop PIL support right away since it's mostly used as an intermediary between load_image and the model transform, but it would be better to deprecate it over one release.

@NicolasHug
Copy link
Member

NicolasHug commented Feb 12, 2025

One thought for the transforms (not the decoders): If you think it's worth it in terms of UX, you might still be able to transparently support both tensors and PIL images, without necessarily having a direct dependency on PIL. As long as you just use the torchvision transforms (which support both PIL and tensors), you should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
best practice Things we should be doing but aren't community help wanted We would love the community's help completing this issue
Projects
None yet
Development

No branches or pull requests

5 participants