Code for the paper Video Occupancy Models
, includes three versions of quantizing the input video frames -- vae
which uses a VQ-VAE, dino
which uses quantized DINO, and musik
which uses quantized Multi-step Inverse Dynamics.
![Screenshot 2024-07-16 at 12 05 30 PM](https://private-user-images.githubusercontent.com/13482152/349190128-cbcfbf89-b5aa-42b1-86ba-8c4f89e41276.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk4ODE4MTYsIm5iZiI6MTczOTg4MTUxNiwicGF0aCI6Ii8xMzQ4MjE1Mi8zNDkxOTAxMjgtY2JjZmJmODktYjVhYS00MmIxLTg2YmEtOGM0Zjg5ZTQxMjc2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE4VDEyMjUxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM2MWVmZDg2MTA4NWMzN2JkNGIzMTVhOWEzMjNhMzdkMjBjZTc3M2FmY2Y5ZTI0MDY3N2Q5NTdlNTA2Y2NjMjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.K3u6HjV8GrGIA4RdSlAd5K05Lk9hYEcX5_dw1_y-59o)
This is a PyTorch/GPU implementation of the paper Video Occupancy Models:
@Article{VideoOccupancyModels2024,
author = {Manan Tomar and Philippe Hansen-Estruch and Philip Bachman and Alex Lamb and John Langford and Matthew E. Taylor and Sergey Levine,
journal = {arXiv:2407.09533},
title = {Video Occupancy Models},
year = {2024},
}
The main packages are provided in the requirements.txt
file. This code has been tested on a virtual env with Python-3.8 with the package versions listed in the requirements file.
The following table provides the pre-trained model checkpoints and datasets used in the paper:
Cheetah | Walker | |
---|---|---|
VQ-VAE fine-tuned model checkpoint | download | download |
DINO latent datasets | link | |
VQ-VAE latent datasets | link | link |
You would need to download the contents of this folder and place them one directory above where this repo is present. This folder contains model descriptions for using a VQ-VAE model from the taming-transformers codebase.
Run train_vq_vae_voc.py to train a VOC model on stored VQ-VAE latents. If you want to train both the VQ-VAE and the VOC model on pixel data then run train_pixel_vq_vae_voc.py. In case you want to create your own latents by traning VQ-VAE on a custom dataset use the collect_latents()
and train_vq_latents()
methods in save_vq_codes.py.
We use a quantized verison of DINO from BEiT-v2. You would need to download this dino model file and place them one directory above where this repo is present.
Run train_vq_dino_voc.py to train a VOC model on stored DINO latents. Again, in case you want to create your own latents by running a quantized version of DINO on a custom dataset use the collect_latents()
method in save_dino_codes.py.
In the case, action data is also available, we use a quantized multi-step inverse kinematics (MUSIK) objective to train the representation.
Run train_vq_musik_voc.py to train a VOC model along with the MUSIK objective on pixel data.