Add Drive WM

patrick-llgc · Feb 29, 2024 · a8530a8 · a8530a8
1 parent 1a31b40
commit a8530a8
Show file tree

Hide file tree

Showing 3 changed files with 55 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -43,7 +43,7 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [DriveLM: Drive on Language](https://arxiv.org/abs/2312.14150) [Hongyang Li]
 - [GAIA-1: A Generative World Model for Autonomous Driving](https://arxiv.org/abs/2309.17080) [[Notes](paper_notes/gaia_1.md)] [Wayve, vision foundation model]
 - [ADriver-I: A General World Model for Autonomous Driving](https://arxiv.org/abs/2311.13549) [[Notes](paper_notes/adriver_i.md)] [Megvii, Xiangyu]
-- [Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving](https://arxiv.org/abs/2311.17918)
+- [Drive-WM: Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving](https://arxiv.org/abs/2311.17918) [[Notes](paper_notes/drive_wm.md)]
 - [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391)
 - [GenAD: Generative End-to-End Autonomous Driving](https://arxiv.org/abs/2402.11502)
 - [TCP: Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline](https://arxiv.org/abs/2206.08129) <kbd>NeurIPS 2022</kbd> [E2E planning, Hongyang]
@@ -65,6 +65,7 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Learning from All Vehicles](https://arxiv.org/abs/2203.11934) <kbd>CVPR 2022</kbd> [Philipp Krähenbühl]
 - [VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning](https://arxiv.org/abs/2402.13243) [Horizon]
 - [Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2304.08818) <kbd>CVPR 2023</kbd> [Sanja, Nvidia, VideoLD, Video prediction]
+- [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/abs/2104.15060) <kbd>CVPR 2021 oral</kbd>
 - [VQ-VAE: Neural Discrete Representation Learning](https://arxiv.org/abs/1711.00937)
 - [Vector-quantized Image Modeling with Improved VQGAN](https://arxiv.org/abs/2110.04627) <kbd>ICLR 2022</kbd>
 

diff --git a/paper_notes/adriver_i.md b/paper_notes/adriver_i.md
@@ -19,7 +19,7 @@ The **biggest drawback** of this paper, as I see it, is the action conditioning
 - Architecture
 	- World model is a VLM Similar to Llava. The LLM component is based on Vicuna-7B (SFT'ed from Llama2) and 
 	- Video diffusion model (VDM), similar to [VideoLDM](videoldm.md). It is conditioned on historical frames and control signals. 
-		- The text encoder of VDM is not powerful enough to understand the control signal, so it is **transcribed to qualitative descroption by GPT3.5**. --> This needs to be improved.
+		- The text encoder of VDM is not powerful enough to understand the control signal, so it is **transcribed to qualitative descroption by GPT3.5**. --> This is NOT reasonable, and must be improved. See the notes of [Drive into the Future](drive_wm.md) for training tricks.
 
 #### Technical details
 - Training of VLM (MLLM)

diff --git a/paper_notes/drive_wm.md b/paper_notes/drive_wm.md
@@ -0,0 +1,52 @@
+# [Drive-WM: Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving](https://arxiv.org/abs/2311.17918)
+
+_February 2024_
+
+tl;dr: First consistent, controllable, multiview videos generation for autonomous driving.
+
+#### Overall impression
+The main contribution of the paper is **multiview** consistent with video generation, and the application of this world model to planning, through **a tree search**, and **OOD planning recovery**.
+
+Drive-WM generates future videos, conditioned on past videos, text, actions, and vectorized perception results, x_t+1 ~ f(x_t, a_t). It does NOT predicts actions. In this way, it is very similar to [GAIA-1](gaia_1.md), but extends GAIA-1 by multicam video generation. It is also conditioned on vectorized perception output, like [DriveDreamer](drive_dreamer.md).
+
+[Def] In a broad sense, a world model is a model that learns a general representation of the world and predicts future world state resultsing from a seq of actions. In the sense of autonomous driving and robotics, a world model is a video prediciton model conditioned on past video and actions. In this sense, Sora generates videos conditioned on input of text and video. But qualitative actions can be expressed as text, so Sora can also quality as a world model. The usage of world model is two fold: to act as a neural simulator (for closed loop training), and to act as a strong feature extractor for policy tuning finetraining.
+
+Video prediction can be regarded as a special form of video generation, conditioned on past observation. If the video prediction can be controled by actions (or in the qualitative form of text), then the video prediction model is a world model.
+
+For a (world) model that does not predict actions, it may act as a neural simulator, but it may not generate enough representation to be finetuned for policy prediction.
+
+It seems that the world model heavily depend on external crutches such as view factorization, and BEV layout. It does NOT learn geometric consistencies through large model traning like [GAIA-1](gaia_1.md) or Sora.
+
+#### Key ideas
+- Multiview Video Diffusion Model
+	- Image diffusion model, trained first. Initialized from Stable Diffusion ckpt.
+	- Temporal encoding layers (as in VideoLDM), and multiview encoding layers. These two are trained later with image diffusion model frozen.
+	- Principle, to ensure consistency across one dimention (temeporal or cross camera) there must be info exchange along that dim.
+- **Factorization** of joint multiview modeling
+	- Divde all frames into reference (anchor) views and stiched (interpolated) views. Views belonging to the same type do not overlap with each other and can be generated independently. --> Sort of, actually some cars will span across more than 2 camras.
+	- The factorization is a good engineering trick to ensure cross camera consistency and only applies to the multicam config on autonomous driving cars. -->  It is NOT general enough to guarantee geometric consistency in a wider sense.
+	- The factorization significantly boost the multiview consistency in terms of point matching.
+- Unifined conditional generation
+	- initial context frames, text, ego action, 3D bboxes, BEV maps are used to condition/control the generation of multiview videos. All of the info are convereted to d-dim features and concatenated.
+	- BEV Layout is first projected to image space and each class rendered in diff colors. --> BEV Layout is very important to the model's high performance. This means that the video does not necessarily learns the physics rules of the world, such as cars cannot hit the curb, etc.
+	- The generation of future frame videos are not conditioned on previously generated videos. --> **The video generation is NOT autoregressive**. Maybe due to training limitations.
+- WM for planning via Tree-based rollout with actions. 
+	- For each sampled action, future frame is generated.
+	- Based on the generated frame, perception is performed and the results are evaluated in terms of map reward (away from the curb and stick to centerline) and object reward (away from other objects).
+	- Action with max reward is selected, and the planning tree forward to the next step and plans subsequent trajectory.
+	- There are 3 command used to explore the planning tree: turn left, turn right and go straight. --> This is a bit too coarse.
+- OOD recovery by finetuning planner with generated OOD videos with supervision by the trajectory that the ego drives back to the lane. 
+- Data curation
+	- Training dataset is rebalanced by re-sampling rare ego actions. The speed and steering angle are divided into multiple bins, and the data are sampled. 
+- Eval
+	- multicam consistency: keypoint matching score. 
+	- Video quality: FID (Fréchet Inception Distance) and FVD.
+
+#### Technical details
+- It shows an example of how to drive on a road with puddles, very similar to the [tesla FSD V12 demo](https://x.com/AIDRIVR/status/1760841783708418094).
+
+#### Notes
+- The paper includes a nice summary of the Dreamer series paper on P3.
+- The code initiates from that of [VideoLDM](video_ldm.md) (Align Your Latents). 
+- The cross view consistency is very nice (showcaing the effectiveness of factorized multiview modeling), but the temporal consistency is not that great, with the appearance of vechiles change througout the video. This may be related to the fact that the future frame geenration is only conditioned on the first frame but not genrated frame. 
+- Q: I wonder how much the controllablity from action is from the BEV vectorized results. --> The BEV layout was given as a static resutls and will not change with diff action. So indeed the video generation is conditioned on the action. Yet the action is very hard to learn, and can only be learned when video difusion model is convered.