Skip to content

Commit

Permalink
Add DriveGAN by Nvidia
Browse files Browse the repository at this point in the history
  • Loading branch information
patrick-llgc committed Mar 4, 2024
1 parent ae68ed1 commit 1732e7b
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 2 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,12 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
- [Multimodal Regression](https://towardsdatascience.com/anchors-and-multi-bin-loss-for-multi-modal-target-regression-647ea1974617)
- [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)

## 2024-03 (4)
## 2024-03 (5)
- [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model]
- [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777) [[Notes](paper_notes/drive_dreamer.md)] [Jiwen Lu, World Model]
- [WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens](https://arxiv.org/abs/2401.09985) [Jiwen Lu, World Model]
- [Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2304.08818) [[Notes](paper_notes/video_ldm.md)] <kbd>CVPR 2023</kbd> [Sanja, Nvidia, VideoLDM, Video prediction]
- [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos](https://arxiv.org/abs/2206.11795) <kbd>NeurIPS 2022</kbd> [OpenAI]
- [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/abs/2104.15060) [[Notes](paper_notes/drive_gan.md)] <kbd>CVPR 2021 oral</kbd> [Nvidia, Sanja]
- [VideoGPT: Video Generation using VQ-VAE and Transformers](https://arxiv.org/abs/2104.10157) [[Notes](paper_notes/videogpt.md)] [Pieter Abbeel]
- [OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving](https://arxiv.org/abs/2311.16038) [Jiwen Lu, World Model]
Expand Down Expand Up @@ -1479,7 +1480,6 @@ Feature Extraction](https://arxiv.org/abs/2010.02893) [monodepth, semantics, Nav
- [Interactive Prediction and Planning for Autonomous Driving: from Algorithms to Fundamental Aspects](https://escholarship.org/uc/item/0vf4q2x1) [PhD thesis of Wei Zhan, 2019]
- [Lyft1001: One Thousand and One Hours: Self-driving Motion Prediction Dataset](https://arxiv.org/abs/2006.14480) [Lyft Level 5, prediction dataset]
- [PCAccumulation: Dynamic 3D Scene Analysis by Point Cloud Accumulation](https://arxiv.org/abs/2207.12394) <kbd>ECCV 2022</kbd>
- [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos](https://arxiv.org/abs/2206.11795) <kbd>NeurIPS 2022</kbd>
- [UniSim: A Neural Closed-Loop Sensor Simulator](https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_UniSim_A_Neural_Closed-Loop_Sensor_Simulator_CVPR_2023_paper.pdf) <kbd>CVPR 2023</kbd> [simulation, Raquel]
- [GeoSim: Realistic Video Simulation via Geometry-Aware Composition for
Self-Driving](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_GeoSim_Realistic_Video_Simulation_via_Geometry-Aware_Composition_for_Self-Driving_CVPR_2021_paper.pdf) <kbd>CVPR 2023</kbd>
Expand Down
45 changes: 45 additions & 0 deletions paper_notes/drive_gan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/abs/2104.15060)

_March 2024_

tl;dr: A neural simulator with disentangled latent space, based on GAN encoder and RNN-style dynamics model.

#### Overall impression
DriveGAN uses a VAE to map pixels into a latent space. GAN-style adversarial training is used to train the VAE, thus the name driveGAN.

The proposed architecture is a very general one for a world model, actually very similar to more recent works such as [GAIA-1](gaia_1.md) and [Genie](genie.md). The original World Model by Schmidhuber is also based on VAE and RNN. Over the years, the encoder/decoder can evolved from VAE + GAN to VQ-VAE + diffusion model, and the dynamics model evolved from RNN to Transformer-based GPT-like next-token prediction. It is interesting to see how new techniques shine in the relatively old (although only two-year old) framework. Two advances: more powerful and scalable modules and much much more data.

The main innovation of the paper is the disentanglement of latent space representation into spatial-agnostic **theme** and spatial-aware **contents** in encoding stage, and further disentange **content** action-**dependent** and action-**independent** in dynamics engine.

The controllability of driveGAN is achieved via careful architecture design. More modern and scalable approach relies on scalability and more unified interfaces (e.g. natural langauge).

Action of the agent is recovered by training another model, and then used to reproduce the scene. This is similar to the idea of [VPT](vpt.md). In a way, it verifies the controllability and geometric consistency of the simulation.

The paper has very nice discussion regarding what a neural simulator should look like. First, generation has to be realistic, and second, they need to be failthful to the action sequence used to produce them.

#### Key ideas
- Controllability via disentagnled latent representation
- 1152 dimension latent space
- Theme: weather, background color
- Content: spatial info
- Action depdendent: layout
- Action indepdendent: object types
- Latent representation
- Uses VAE and GAN to learn a latent space of images, enabling high-resolution and fidelity in frame synthesis conditioned on agent actions. GAN comes into play in decoding stage. --> This is upgraded to diffusion models in more recent works.
- The **disentanglement** is achieved by forcing theme latent to be one single vector without spatial dimension.
- Dynamics engine based on RNN
- The disentanglement is achieved by using two RNNs, forcing one conditioned on action and the other not.
- The neural network seems to learn meaningful representation well aligned with the intention of the architecture design. SGD will cheat and use shortcuts whenever possible.
- Multistage Training
- Encoder/decoder training
- Dynamics engine training with encoder/decoder frozen.
- Differentiable simulation: varying the disentangled latent vectors to reproduce or generate new contents in a controllable way
- Eval
- FVD
- **Action consistency**: action of the agent can be predicted by feeding two images from real videos into a dedicated model. This model can be deployed on simulated images (generated with GT actions) to verify action consistency.

#### Technical details
- Summary of technical details

#### Notes
- Questions and notes on how to improve/revise the current work

0 comments on commit 1732e7b

Please sign in to comment.