why the dimension of per-frame image conditionning is 16? #119

huge123 · 2024-08-01T17:59:07Z

DynamiCrafter/lvdm/modules/networks/openaimodel3d.py

Line 556 in c453369

if l_context == 77 + t*16: ## !!! HARD CODE here

I think this code will rearrange the per-frame condition embeddings from (t l) to t l, but why the dimension of image condition is 16, I think the embedding dimension after img_proj_model should be 256.

DynamiCrafter/scripts/evaluation/inference.py

Line 177 in c453369

img_emb = model.image_proj_model(img_emb)

I think it should be if l_context == 77 +t*256, right?
if I miss something?

lzhangbj · 2024-08-01T23:48:18Z

I was also confused here but figured it out.

In the config file, video_length is set to 16 while num_queries is set to 16 for image_proj_stage_config, which means each condition image will be projected to 16*16=256 tokens. Then the first 77 tokens of l_context are text tokens, while the 256 tokens after them are image tokens. In contrast to text tokens which are repeated for each of the 16 frame, the image tokens already have temporal dimentsion(video_length) and are rearranged to (16, 16), corresponding to 16 tokens for each frame.

So essentially, the image projector is projecting the conditioning image's CLIP features into video features. It was trained to predict what kind of spatial-temporal patterns can exist from a single image. There is no need to repeat it by t again.

This can be inferred from here: Point 2 of this issue.

Having said that, I am also wondering why the author did not project into only image features (say, 16 queries) and then repeat it in temporal dimension. Does it make a difference? @Doubiiu

Doubiiu · 2024-08-08T02:09:36Z

Hi @huge123 and @lzhangbj, yeah what @lzhangbj has said is correct. We intend to provide the learning space for the model to learn some temporal variations in a video. However, due to the limited training compute and temporal coherence quality (e.g. the max. video length we can hold), it just made slight difference and improvement (That is the reason why we didn't emphasize this arch in the main paper and just show it in the supplementary document). We hope this insight can inspire further research to some extent.

lzhangbj · 2024-08-08T02:24:25Z

Thank you for the answer! It helped a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the dimension of per-frame image conditionning is 16? #119

why the dimension of per-frame image conditionning is 16? #119

huge123 commented Aug 1, 2024

lzhangbj commented Aug 1, 2024

Doubiiu commented Aug 8, 2024

lzhangbj commented Aug 8, 2024

why the dimension of per-frame image conditionning is 16? #119

why the dimension of per-frame image conditionning is 16? #119

Comments

huge123 commented Aug 1, 2024

lzhangbj commented Aug 1, 2024

Doubiiu commented Aug 8, 2024

lzhangbj commented Aug 8, 2024