Add DriveGAN by Nvidia

patrick-llgc · Mar 4, 2024 · 1732e7b · 1732e7b
1 parent ae68ed1
commit 1732e7b
Show file tree

Hide file tree

Showing 2 changed files with 47 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -34,11 +34,12 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Multimodal Regression](https://towardsdatascience.com/anchors-and-multi-bin-loss-for-multi-modal-target-regression-647ea1974617)
 - [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)
 
-## 2024-03 (4)
+## 2024-03 (5)
 - [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model]
 - [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777) [[Notes](paper_notes/drive_dreamer.md)] [Jiwen Lu, World Model]
 - [WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens](https://arxiv.org/abs/2401.09985) [Jiwen Lu, World Model]
 - [Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2304.08818) [[Notes](paper_notes/video_ldm.md)] <kbd>CVPR 2023</kbd> [Sanja, Nvidia, VideoLDM, Video prediction]
+- [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos](https://arxiv.org/abs/2206.11795) <kbd>NeurIPS 2022</kbd> [OpenAI]
 - [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/abs/2104.15060) [[Notes](paper_notes/drive_gan.md)] <kbd>CVPR 2021 oral</kbd> [Nvidia, Sanja]
 - [VideoGPT: Video Generation using VQ-VAE and Transformers](https://arxiv.org/abs/2104.10157) [[Notes](paper_notes/videogpt.md)] [Pieter Abbeel]
 - [OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving](https://arxiv.org/abs/2311.16038) [Jiwen Lu, World Model]
@@ -1479,7 +1480,6 @@ Feature Extraction](https://arxiv.org/abs/2010.02893) [monodepth, semantics, Nav
 - [Interactive Prediction and Planning for Autonomous Driving: from Algorithms to Fundamental Aspects](https://escholarship.org/uc/item/0vf4q2x1) [PhD thesis of Wei Zhan, 2019]
 - [Lyft1001: One Thousand and One Hours: Self-driving Motion Prediction Dataset](https://arxiv.org/abs/2006.14480) [Lyft Level 5, prediction dataset]
 - [PCAccumulation: Dynamic 3D Scene Analysis by Point Cloud Accumulation](https://arxiv.org/abs/2207.12394) <kbd>ECCV 2022</kbd>
-- [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos](https://arxiv.org/abs/2206.11795) <kbd>NeurIPS 2022</kbd>
 - [UniSim: A Neural Closed-Loop Sensor Simulator](https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_UniSim_A_Neural_Closed-Loop_Sensor_Simulator_CVPR_2023_paper.pdf) <kbd>CVPR 2023</kbd> [simulation, Raquel]
 - [GeoSim: Realistic Video Simulation via Geometry-Aware Composition for
 Self-Driving](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_GeoSim_Realistic_Video_Simulation_via_Geometry-Aware_Composition_for_Self-Driving_CVPR_2021_paper.pdf) <kbd>CVPR 2023</kbd>

diff --git a/paper_notes/drive_gan.md b/paper_notes/drive_gan.md
@@ -0,0 +1,45 @@
+# [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/abs/2104.15060)
+
+_March 2024_
+
+tl;dr: A neural simulator with disentangled latent space, based on GAN encoder and RNN-style dynamics model.
+
+#### Overall impression
+DriveGAN uses a VAE to map pixels into a latent space. GAN-style adversarial training is used to train the VAE, thus the name driveGAN.
+
+The proposed architecture is a very general one for a world model, actually very similar to more recent works such as [GAIA-1](gaia_1.md) and [Genie](genie.md). The original World Model by Schmidhuber is also based on VAE and RNN. Over the years, the encoder/decoder can evolved from VAE + GAN to VQ-VAE + diffusion model, and the dynamics model evolved from RNN to Transformer-based GPT-like next-token prediction. It is interesting to see how new techniques shine in the relatively old (although only two-year old) framework. Two advances: more powerful and scalable modules and much much more data.
+
+The main innovation of the paper is the disentanglement of latent space representation into spatial-agnostic **theme** and spatial-aware **contents** in encoding stage, and further disentange **content** action-**dependent** and action-**independent** in dynamics engine.
+
+The controllability of driveGAN is achieved via careful architecture design. More modern and scalable approach relies on scalability and more unified interfaces (e.g. natural langauge).
+
+Action of the agent is recovered by training another model, and then used to reproduce the scene. This is similar to the idea of [VPT](vpt.md). In a way, it verifies the controllability  and geometric consistency of the simulation.
+
+The paper has very nice discussion regarding what a neural simulator should look like. First, generation has to be realistic, and second, they need to be failthful to the action sequence used to produce them.
+
+#### Key ideas
+- Controllability via disentagnled latent representation
+	- 1152 dimension latent space
+	- Theme: weather, background color
+	- Content: spatial info
+		- Action depdendent: layout
+		- Action indepdendent: object types
+- Latent representation
+	- Uses VAE and GAN to learn a latent space of images, enabling high-resolution and fidelity in frame synthesis conditioned on agent actions. GAN comes into play in decoding stage. --> This is upgraded to diffusion models in more recent works.
+	- The **disentanglement** is achieved by forcing theme latent to be one single vector without spatial dimension.
+- Dynamics engine based on RNN
+	- The disentanglement is achieved by using two RNNs, forcing one conditioned on action and the other not. 
+	- The neural network seems to learn meaningful representation well aligned with the intention of the architecture design. SGD will cheat and use shortcuts whenever possible.
+- Multistage Training
+	- Encoder/decoder training
+	- Dynamics engine training with encoder/decoder frozen.
+- Differentiable simulation: varying the disentangled latent vectors to reproduce or generate new contents in a controllable way
+- Eval
+	- FVD
+	- **Action consistency**: action of the agent can be predicted by feeding two images from real videos into a dedicated model. This model can be deployed on simulated images (generated with GT actions) to verify action consistency.
+
+#### Technical details
+- Summary of technical details
+
+#### Notes
+- Questions and notes on how to improve/revise the current work