Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping from spatial features to summary feature #81

Open
OasisArtisan opened this issue Aug 8, 2024 · 3 comments
Open

Mapping from spatial features to summary feature #81

OasisArtisan opened this issue Aug 8, 2024 · 3 comments

Comments

@OasisArtisan
Copy link

Hello, I was wondering if their is a way to map the spatial features (or a crop of it) to the summary feature?

I am seeing that the released 2.5 models use CLS tokens for the summary as opposed to pooling the spatial features. So there is no direct mapping between spatial and summary.

Why do I care? because I want to be able to get summary features for many crops of a single image without rerunning the whole encoder for each crop.

Why summary features? because those are the ones that can be language aligned after the CLIP summary adapter.

Your thoughts on this are highly appreciated.

@mranzinger
Copy link
Collaborator

Hi. Unfortunately, there isn't a direct way to "ground" the spatial features to be in the same representation space as the summary token(s). I've been looking into doing this for the CLIP head so that we get true grounding (thus unlocking zero shot semantic segmentation), but haven't found an approach that works well yet.

@OasisArtisan
Copy link
Author

I see. Yeah zero shot semantic segmentation would be amazing to see which is I guess taking the crop down to a single pixel.

What if we use the trained RADIO model that uses some global pooling to get the summary feature. This way we should be able to crop the spatial features then pool to get the summary. I don't think it would work for very tiny crops or pixels but might work for bigger crops ?

Do you have plans on releasing the weights of the pooling version of RADIO mentioned in the paper ?

I can run tests to see the feasibility.

@mranzinger
Copy link
Collaborator

We don't currently have any plans to release the pooled model. I was developed on some quite old code, so wouldn't really be compatible with what we've done since, particularly wrt fixing mode switching. I definitely understand the desire to have the summary and spatial features be in a shared manifold because it makes this form of zero/few shot grounding significantly easier. I've found that naively trying to train with the CLIP model where we average pool, attention pool, or "linear attention pool" (the weights are attention-like, but the "values" head is identity instead of an affine projection), we can get zero-shot semantic segmentation on ADE20k to be somewhere between 6 and 8 mIOU which is far worse than our linear probe mIOU. Weirder yet, our ViT-B model does better than our ViT-H model, as the latter seems to map most/all spatial features to a single value for the summary, as opposed to learning to combine distinct concepts via the feature summation.

Take for example the attached image (Left: Input, Middle: Prediction, Right: GT) for zero-shot semseg. While some structure exists in the predictions, it's mostly just pulling a few key concepts and arbitrarily placing them.

We're actively looking into ways to recitfy this bad behavior.
bad_semseg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants