Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the dimension of per-frame image conditionning is 16? #119

Open
huge123 opened this issue Aug 1, 2024 · 3 comments
Open

why the dimension of per-frame image conditionning is 16? #119

huge123 opened this issue Aug 1, 2024 · 3 comments

Comments

@huge123
Copy link

huge123 commented Aug 1, 2024

if l_context == 77 + t*16: ## !!! HARD CODE here

I think this code will rearrange the per-frame condition embeddings from (t l) to t l, but why the dimension of image condition is 16, I think the embedding dimension after img_proj_model should be 256.

img_emb = model.image_proj_model(img_emb)

I think it should be if l_context == 77 +t*256, right?
if I miss something?

@lzhangbj
Copy link

lzhangbj commented Aug 1, 2024

I was also confused here but figured it out.

In the config file, video_length is set to 16 while num_queries is set to 16 for image_proj_stage_config, which means each condition image will be projected to 16*16=256 tokens. Then the first 77 tokens of l_context are text tokens, while the 256 tokens after them are image tokens. In contrast to text tokens which are repeated for each of the 16 frame, the image tokens already have temporal dimentsion(video_length) and are rearranged to (16, 16), corresponding to 16 tokens for each frame.

So essentially, the image projector is projecting the conditioning image's CLIP features into video features. It was trained to predict what kind of spatial-temporal patterns can exist from a single image. There is no need to repeat it by t again.

This can be inferred from here: Point 2 of this issue.

Having said that, I am also wondering why the author did not project into only image features (say, 16 queries) and then repeat it in temporal dimension. Does it make a difference? @Doubiiu

@Doubiiu
Copy link
Owner

Doubiiu commented Aug 8, 2024

Hi @huge123 and @lzhangbj, yeah what @lzhangbj has said is correct. We intend to provide the learning space for the model to learn some temporal variations in a video. However, due to the limited training compute and temporal coherence quality (e.g. the max. video length we can hold), it just made slight difference and improvement (That is the reason why we didn't emphasize this arch in the main paper and just show it in the supplementary document). We hope this insight can inspire further research to some extent.

@lzhangbj
Copy link

lzhangbj commented Aug 8, 2024

Thank you for the answer! It helped a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants