-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why the dimension of per-frame image conditionning is 16? #119
Comments
I was also confused here but figured it out. In the config file, video_length is set to 16 while num_queries is set to 16 for image_proj_stage_config, which means each condition image will be projected to 16*16=256 tokens. Then the first 77 tokens of l_context are text tokens, while the 256 tokens after them are image tokens. In contrast to text tokens which are repeated for each of the 16 frame, the image tokens already have temporal dimentsion(video_length) and are rearranged to (16, 16), corresponding to 16 tokens for each frame. So essentially, the image projector is projecting the conditioning image's CLIP features into video features. It was trained to predict what kind of spatial-temporal patterns can exist from a single image. There is no need to repeat it by t again. This can be inferred from here: Point 2 of this issue. Having said that, I am also wondering why the author did not project into only image features (say, 16 queries) and then repeat it in temporal dimension. Does it make a difference? @Doubiiu |
Hi @huge123 and @lzhangbj, yeah what @lzhangbj has said is correct. We intend to provide the learning space for the model to learn some temporal variations in a video. However, due to the limited training compute and temporal coherence quality (e.g. the max. video length we can hold), it just made slight difference and improvement (That is the reason why we didn't emphasize this arch in the main paper and just show it in the supplementary document). We hope this insight can inspire further research to some extent. |
Thank you for the answer! It helped a lot. |
DynamiCrafter/lvdm/modules/networks/openaimodel3d.py
Line 556 in c453369
I think this code will rearrange the per-frame condition embeddings from (t l) to t l, but why the dimension of image condition is 16, I think the embedding dimension after img_proj_model should be 256.
DynamiCrafter/scripts/evaluation/inference.py
Line 177 in c453369
I think it should be if l_context == 77 +t*256, right?
if I miss something?
The text was updated successfully, but these errors were encountered: