-
Notifications
You must be signed in to change notification settings - Fork 176
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
724148f
commit 19154fd
Showing
2 changed files
with
47 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# [WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens](https://arxiv.org/abs/2401.09985) | ||
|
||
_March 2024_ | ||
|
||
tl;dr: Multimodal world model via masked token prediction. | ||
|
||
#### Overall impression | ||
The model takes in a variety of modalities such as image/video, text, actions, and generate videos conditioned on these multimodal prompts. | ||
|
||
World models hold great promise for learning motion and physics in the genral world, essential for coherent and reasonable video generation. | ||
|
||
> During training, MaskGIT is trained on a similar proxy task to the mask prediction in BERT. At inference time, MaskGIT adopts a novel non-autoregressive decoding method to synthesize an image in constant number of steps. | ||
The paper seems unfinished and rushed to release on Arxiv, without much comparison with contemporary work. The paper is also heavily inspired by MaskGIT, especially the masked token prediction and parallel decoding. | ||
|
||
#### Key ideas | ||
- Architecture | ||
- Encoder | ||
- Vision: VQ-GAN, vocab = 8192 | ||
- Text: pretrained T5, similar to [GAIA-1](gaia_1.md). | ||
- Action: MLP | ||
- Text and action embedding can be missing. | ||
- Masked prediciton | ||
- Decoder: Parallel decoding | ||
- Training with masks. | ||
- Dataset: triplet (visual, text, action), but also supports data with missing modalities. | ||
- Inference: parallel decoding | ||
- DIffusion: requires ~30 steps to reduce noise | ||
- Autoregressive: needs ~200 steps to iteratively predict next token | ||
- Parallel decoding: video generation in ~10 steps. | ||
|
||
|
||
#### Technical details | ||
- The key assumption underlying the effectiveness of the parallel decoding is a Markovian | ||
property that many tokens are conditionally independent given other tokens. (From [MaskGIT](https://masked-generative-image-transformer.github.io/) and Muse) | ||
- [PySceneDetect](https://github.com/Breakthrough/PySceneDetect) to detect scene switching | ||
|
||
|
||
#### Notes | ||
- Questions and notes on how to improve/revise the current work | ||
|