diff --git a/3dda_env.yaml b/3dda_env.yaml
new file mode 100644
index 0000000..e7fd7da
--- /dev/null
+++ b/3dda_env.yaml
@@ -0,0 +1,34 @@
+name: equ_act
+channels:
+  - pytorch
+  - nvidia
+  - conda-forge
+  - defaults
+dependencies:
+  - pip
+  - pip:
+    - healpy
+    - git+https://github.com/openai/CLIP.git
+    - pillow
+    - typed-argument-parser
+    - tqdm
+    - transformers
+    - absl-py
+    - matplotlib
+    - scipy
+    - tensorboard
+    - opencv-python
+    - blosc
+    - setuptools==57.5.0
+    - beautifulsoup4
+    - bleach>=6.0.0
+    - defusedxml
+    - jinja2>=3.0
+    - jupyter-core>=4.7
+    - jupyterlab-pygments
+    - mistune==2.0.5
+    - nbclient>=0.5.0
+    - nbformat>=5.7
+    - pandocfilters>=1.4.1
+    - tinycss2
+    - traitlets>=5.1
\ No newline at end of file
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..641b395
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Tsung-Wei Ke, Nikolaos Gkanatsios and Katerina Fragkiadaki
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..53e9334
--- /dev/null
+++ b/README.md
@@ -0,0 +1,157 @@
+[//]: # ([![PWC]&#40;https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/3d-diffuser-actor-policy-diffusion-with-3d/zero-shot-generalization-on-calvin&#41;]&#40;https://paperswithcode.com/sota/zero-shot-generalization-on-calvin?p=3d-diffuser-actor-policy-diffusion-with-3d&#41;)
+
+[//]: # ([![PWC]&#40;https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/3d-diffuser-actor-policy-diffusion-with-3d/robot-manipulation-on-rlbench&#41;]&#40;https://paperswithcode.com/sota/robot-manipulation-on-rlbench?p=3d-diffuser-actor-policy-diffusion-with-3d&#41;)
+
+
+# ? Placeholder for EquAct, 》= 3D Actor Diffuser，Cogito, ergo sum
+
+[//]: # (By [Tsung-Wei Ke*]&#40;https://twke18.github.io/&#41;, [Nikolaos Gkanatsios*]&#40;https://nickgkan.github.io/&#41; and [Katerina Fragkiadaki]&#40;https://www.cs.cmu.edu/~katef/&#41;)
+
+[//]: # (Official implementation of ["3D Diffuser Actor: Policy Diffusion with 3D Scene Representations"]&#40;https://arxiv.org/abs/2402.10885&#41;.)
+
+[//]: # (This code base also includes our re-implementation of ["Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation"]&#40;https://arxiv.org/abs/2306.17817&#41;.  We provide trained model weights for both methods.)
+
+[//]: # (<!-- ![teaser]&#40;https://3d-diffuser-actor.github.io/static/videos/3d_scene.mp4&#41; -->)
+[//]: # (![teaser]&#40;fig/teaser.gif&#41;)
+
+We marry diffusion policies and 3D scene representations for robot manipulation. Diffusion policies learn the action distribution conditioned on the robot and environment state using conditional diffusion models. They have recently shown to outperform both deterministic and alternative state-conditioned action distribution learning methods. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy architecture that, given a language instruction, builds a 3D representation of the visual scene and conditions on it to iteratively denoise 3D rotations and translations for the robot’s end-effector. At each denoising iteration, our model represents end-effector pose estimates as 3D scene tokens and predicts the 3D translation and rotation error for each of them, by featurizing them using 3D relative attention to other 3D visual and language tokens. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 16.3% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it outperforms the current SOTA in the setting of zero-shot unseen scene generalization by being able to successfully run 0.2 more tasks, a 7% relative increase. It also works in the real world from a handful of demonstrations. We ablate our model’s architectural design choices, such as 3D scene featurization and 3D relative attentions, and show they all help generalization. Our results suggest that 3D scene representations and powerful generative modeling are keys to efficient robot learning from demonstrations.
+
+
+# ? Model overview and stand-alone usage
+To facilitate fast development on top of our model, we provide here an [overview of our implementation of 3D Diffuser Actor](./docs/OVERVIEW.md).
+
+The model can be indenpendently installed and used as stand-alone package.
+```
+> pip install -e .
+# import the model
+> from diffuser_actor import DiffuserActor, Act3D
+> model = DiffuserActor(...)
+```
+
+# ? Installation
+Create a conda environment with the following command:
+We recommend Mambaforge instead of the standard anaconda distribution for faster installation:
+https://github.com/conda-forge/miniforge#mambaforge
+
+```
+# initiate conda env
+> conda update conda
+> mamba env create -f equiformerv2_env.yaml
+> mamba env update -f 3dda_env.yaml
+> conda activate equ_act
+
+# install diffuser
+#> pip install diffusers["torch"]
+
+# install dgl (https://www.dgl.ai/pages/start.html)
+>  pip install dgl==1.1.3+cu116 -f https://data.dgl.ai/wheels/cu116/dgl-1.1.3%2Bcu116-cp38-cp38-manylinux1_x86_64.whl
+
+# install flash attention (https://github.com/Dao-AILab/flash-attention#installation-and-features)
+#> pip install packaging
+#> pip install ninja
+#???> pip install flash-attn --no-build-isolation
+```
+
+### ? Install CALVIN locally
+
+Remember to use the latest `calvin_env` module, which fixes bugs of `turn_off_led`.  See this [post](https://github.com/mees/calvin/issues/32#issuecomment-1363352121) for detail.
+```
+> git clone --recurse-submodules https://github.com/mees/calvin.git
+> export CALVIN_ROOT=$(pwd)/calvin
+> cd calvin
+> cd calvin_env; git checkout main
+> cd ..
+> ./install.sh; cd ..
+```
+
+### ? Install RLBench locally
+```
+# Install open3D
+> pip install open3d
+
+> mkdir CoppeliaSim; 
+> cd CoppeliaSim/
+> wget https://www.coppeliarobotics.com/files/V4_1_0/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
+> tar -xf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz;
+> echo "export COPPELIASIM_ROOT=$(pwd)/PyRep/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04" >> $HOME/.bashrc; 
+> echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:\$COPPELIASIM_ROOT" >> $HOME/.bashrc;
+> echo "export QT_QPA_PLATFORM_PLUGIN_PATH=\$COPPELIASIM_ROOT" >> $HOME/.bashrc;
+> source $HOME/.bashrc;
+# Install PyRep (https://github.com/stepjam/PyRep?tab=readme-ov-file#install)
+> git clone https://github.com/stepjam/PyRep.git;
+> pip install -r requirements.txt; pip install -e .; cd ..
+
+# Install RLBench (Note: there are different forks of RLBench)
+# PerAct setup
+> git clone https://github.com/MohitShridhar/RLBench.git
+> cd RLBench; git checkout -b peract --track origin/peract; pip install -r requirements.txt; pip install -e .; cd ..;
+```
+
+Remember to modify the success condition of `close_jar` task in RLBench, as the original condition is incorrect.  See this [pull request](https://github.com/MohitShridhar/RLBench/pull/1) for more detail.  
+
+# ? Data Preparation
+
+See [Preparing RLBench dataset](./docs/DATA_PREPARATION_RLBENCH.md) and [Preparing CALVIN dataset](./docs/DATA_PREPARATION_CALVIN.md).
+
+
+### ? (Optional) Encode language instructions
+
+We provide our scripts for encoding language instructions with CLIP Text Encoder on CALVIN.  Otherwise, you can find the encoded instructions on CALVIN and RLBench ([Link](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/instructions.zip)). Put the encoded instructions at root dir.
+```
+> python data_preprocessing/preprocess_calvin_instructions.py --output instructions/calvin_task_ABC_D/validation.pkl --model_max_length 16 --annotation_path ./calvin/dataset/task_ABC_D/validation/lang_annotations/auto_lang_ann.npy
+
+> python data_preprocessing/preprocess_calvin_instructions.py --output instructions/calvin_task_ABC_D/training.pkl --model_max_length 16 --annotation_path ./calvin/dataset/task_ABC_D/training/lang_annotations/auto_lang_ann.npy
+```
+
+**Note:** We update our scripts for encoding language instructions on RLBench.
+```
+> python data_preprocessing/preprocess_rlbench_instructions.py  --tasks place_cups close_jar insert_onto_square_peg light_bulb_in meat_off_grill open_drawer place_shape_in_shape_sorter place_wine_at_rack_location push_buttons put_groceries_in_cupboard put_item_in_drawer put_money_in_safe reach_and_drag slide_block_to_color_target stack_blocks stack_cups sweep_to_dustpan_of_size turn_tap --output instructions.pkl
+```
+
+# ? Model Zoo
+
+We host the model weights on hugging face.
+
+|| RLBench (PerAct) | RLBench (GNFactor) | CALVIN |
+|--------|--------|--------|--------|
+| 3D Diffuser Actor | [Weights](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_peract.pth) | [Weights](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_gnfactor.pth) | [Weights](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_calvin.pth) |
+| Act3D | [Weights](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/act3d_peract.pth) | [Weights](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/act3d_gnfactor.pth) | N/A |
+
+<div class="column">
+<img src="fig/sota_calvin.png" alt="input image" width="33%"/>
+&nbsp;&nbsp;&nbsp;
+<img src="fig/sota_rlbench.png" alt="input image" width="33%"/>
+</div>
+
+### ? Evaluate the pre-trained weights
+First, donwload the weights and put under `train_logs/`
+
+* For RLBench, run the bashscripts to test the policy.  See [Getting started with RLBench](./docs/GETTING_STARTED_RLBENCH.md#step-3-test-the-policy) for detail.
+* For CALVIN, you can run [this bashcript](./scripts/test_trajectory_calvin.sh).
+
+**Important note:** Our released model weights of 3D Diffuser Actor assume input quaternions are in `wxyz` format.  Yet, we didn't notice that CALVIN and RLBench simulation use different quaternion formats (`wxyz` and `xyzw`).  We have updated our code base with an additional argument `quaternion_format` to switch between these two formats.  We have verified the change by re-training and testing 3D Diffuser Actor on GNFactor with `xyzw` quaternions.  The model achieves similar performance as the released checkpoint.  Please see this [post](https://github.com/nickgkan/3d_diffuser_actor/issues/3#issue-2164855979) for more detail.
+
+For users to train 3D Diffuser Actor from scratch, we update the training scripts with the correct `xyzw` quaternion format.  For users to test our released model, we keep the `wxyz` quaternion format in the testing scripts ([Peract](./online_evaluation_rlbench/eval_peract.sh), [GNFactor](./online_evaluation_rlbench/eval_gnfactor.sh)).
+
+
+# ? Getting started
+
+See [Getting started with RLBench](./docs/GETTING_STARTED_RLBENCH.md) and [Getting started with CALVIN](./docs/GETTING_STARTED_CALVIN.md).
+
+
+# ? Citation
+If you find this code useful for your research, please consider citing our paper ["3D Diffuser Actor: Policy Diffusion with 3D Scene Representations"](https://arxiv.org/abs/2402.10885).
+```
+@article{3d_diffuser_actor,
+  author = {Ke, Tsung-Wei and Gkanatsios, Nikolaos and Fragkiadaki, Katerina},
+  title = {3D Diffuser Actor: Policy Diffusion with 3D Scene Representations},
+  journal = {Arxiv},
+  year = {2024}
+}
+```
+
+# ? License
+This code base is released under the MIT License (refer to the LICENSE file for details).
+
+# ? Acknowledgement
+Parts of this codebase have been adapted from [Act3D](https://github.com/zhouxian/act3d-chained-diffuser) and [CALVIN](https://github.com/mees/calvin).
diff --git a/data_preprocessing/episodes.json b/data_preprocessing/episodes.json
new file mode 100755
index 0000000..f00345b
--- /dev/null
+++ b/data_preprocessing/episodes.json
@@ -0,0 +1,153 @@
+{
+ "max_episode_length":  {
+        "basketball_in_hoop": 4,
+        "beat_the_buzz": 4,
+        "block_pyramid": 35,
+        "change_channel": 8,
+        "close_drawer": 2,
+        "close_box": 5,
+	    "close_jar": 7,
+        "close_grill": 4,
+        "change_clock": 4,
+        "close_microwave": 2,
+        "close_laptop_lid": 4,
+        "close_door": 3,
+        "close_fridge": 2,
+        "empty_dishwasher": 13,
+        "get_ice_from_fridge": 5,
+        "hang_frame_on_hanger": 4,
+        "hit_ball_with_queue": 7,
+        "hockey": 7,
+        "insert_onto_square_peg": 5,
+        "insert_usb_in_computer": 5,
+        "lamp_off": 2,
+        "lamp_on": 4,
+        "light_bulb_out": 5,
+        "light_bulb_in": 7,
+        "lift_numbered_block": 3,
+        "move_hanger": 5,
+        "meat_off_grill": 5,
+        "meat_on_grill": 5,
+        "open_door": 4,
+        "open_box": 3,
+        "open_drawer": 3,
+        "open_fridge": 3,
+        "open_jar": 7,
+        "open_grill": 3,
+        "open_microwave": 3,
+        "open_oven": 5,
+        "open_window": 4,
+        "open_wine_bottle": 3,
+        "reach_target": 1,
+        "reach_and_drag": 6,
+        "remove_cups": 6,
+        "phone_on_base": 5,
+        "pour_from_cup_to_cup": 6,
+        "pick_and_lift": 4,
+        "pick_and_lift_small": 4,
+        "pick_up_cup": 3,
+        "place_shape_in_shape_sorter": 7,
+        "place_hanger_on_rack": 6,
+        "place_cups": 23,
+        "play_jenga": 3,
+	    "plug_charger_in_power_supply": 6,
+        "press_switch": 2,
+        "put_books_on_bookshelf": 5,
+        "put_bottle_in_fridge": 9,
+        "put_knife_on_chopping_board": 4,
+	    "put_groceries_in_cupboard": 6,
+	    "put_knife_in_knife_block": 5,
+        "put_item_in_drawer": 12,
+        "put_money_in_safe": 5,
+        "put_plate_in_colored_dish_rack": 5,
+        "put_rubbish_in_bin": 4,
+	    "put_tray_in_oven": 12,
+        "put_toilet_roll_on_stand": 5,
+        "put_shoes_in_box": 13,
+        "put_umbrella_in_umbrella_stand": 4,
+        "stack_wine": 5,
+        "stack_blocks": 23,
+        "stack_chairs": 11,
+        "stack_cups": 10,
+        "straighten_rope": 7,
+        "scoop_with_spatula": 4,
+        "screw_nail": 8,
+        "setup_checkers": 6,
+        "setup_chess": 5,
+        "slide_block_to_target": 2,
+        "slide_block_to_color_target": 5,
+	    "slide_cabinet_open_and_place_cups": 9,
+        "solve_puzzle": 7,
+        "sweep_to_dustpan": 5,
+        "sweep_to_dustpan_of_size": 5,
+        "push_button": 2,
+        "push_buttons": 6,
+        "push_repeated_buttons": 8,
+        "take_money_out_safe": 4,
+        "take_umbrella_out_of_umbrella_stand": 3,
+        "take_cup_out_from_cabinet": 7,
+        "take_frame_off_hanger": 4,
+        "take_item_out_of_drawer": 9,
+        "take_lid_off_saucepan": 3,
+        "take_off_weighing_scales": 7,
+        "take_plate_off_colored_dish_rack": 5,
+        "take_shoes_out_of_box": 15,
+        "take_toilet_roll_off_stand": 4,
+        "take_tray_out_of_oven": 10,
+        "take_usb_out_of_computer": 2,
+        "toilet_seat_up": 3,
+        "toilet_seat_down": 4,
+        "tower": 29,
+        "tower2": 17,
+        "tower3": 6,
+        "tower4": 11,
+        "tower_sim2real": 12,
+        "turn_oven_on": 3,
+        "turn_tap": 2,
+        "tv_on": 8,
+        "unplug_charger": 2,
+        "water_plants": 5,
+        "wipe_desk": 8,
+        "place_wine_at_rack_location": 5
+}, 
+"variable_length": [
+	"push_buttons",
+	"push_repeated_buttons",
+	"close_jar",
+	"open_jar",
+	"hockey",
+	"hit_ball_with_queue",
+	"lamp_on",
+	"put_tray_in_oven",
+	"solve_puzzle",
+	"sweep_to_dustpan",
+    "sweep_to_dustpan_of_size",
+	"take_off_weighing_scales",
+	"take_tray_out_of_oven",
+	"tower",
+	"tower2",
+	"tower3",
+	"tower4",
+    "stack_blocks",
+    "slide_block_to_color_target",
+    "place_cups",
+    "place_shape_in_shape_sorter",
+    "put_groceries_in_cupboard",
+    "slide_cabinet_open_and_place_cups",
+    "wipe_desk",
+    "setup_checkers",
+    "water_plants",
+    "screw_nail",
+    "plug_charger_in_power_supply",
+    "place_hanger_on_rack",
+    "open_oven",
+    "take_shoes_out_of_box"
+],
+"broken": [
+	"empty_container",
+	"put_all_groceries_in_cupboard",
+	"set_the_table",
+	"slide_cabinet_open",
+    "weighing_scales"
+]
+}
diff --git a/data_preprocessing/package_calvin.py b/data_preprocessing/package_calvin.py
new file mode 100644
index 0000000..945b7de
--- /dev/null
+++ b/data_preprocessing/package_calvin.py
@@ -0,0 +1,339 @@
+from typing import List, Optional
+from pathlib import Path
+import os
+import pickle
+
+import tap
+import cv2
+import numpy as np
+import torch
+import blosc
+from PIL import Image
+
+from calvin_env.envs.play_table_env import get_env
+from utils.utils_with_calvin import (
+    keypoint_discovery,
+    deproject,
+    get_gripper_camera_view_matrix,
+    convert_rotation
+)
+
+
+class Arguments(tap.Tap):
+    traj_len: int = 16
+    execute_every: int = 4
+    save_path: str = './data/calvin/packaged_ABC_D'
+    root_dir: str = './calvin/dataset/task_ABC_D'
+    mode: str = 'keypose'  # [keypose, close_loop]
+    tasks: Optional[List[str]] = None
+    split: str = 'training'  # [training, validation]
+
+
+def make_env(dataset_path, split):
+    val_folder = Path(dataset_path) / f"{split}"
+    env = get_env(val_folder, show_gui=False)
+
+    return env
+
+
+def process_datas(datas, mode, traj_len, execute_every, keyframe_inds):
+    """Fetch and drop datas to make a trajectory
+
+    Args:
+        datas: a dict of the datas to be saved/loaded
+            - static_pcd: a list of nd.arrays with shape (height, width, 3)
+            - static_rgb: a list of nd.arrays with shape (height, width, 3)
+            - gripper_pcd: a list of nd.arrays with shape (height, width, 3)
+            - gripper_rgb: a list of nd.arrays with shape (height, width, 3)
+            - proprios: a list of nd.arrays with shape (7,)
+        mode: a string of [keypose, close_loop]
+        traj_len: an int of the length of the trajectory
+        execute_every: an int of execution frequency
+        keyframe_inds: an Integer array with shape (num_keyframes,)
+
+    Returns:
+        the episode item: [
+            [frame_ids],
+            [obs_tensors],  # wrt frame_ids, (n_cam, 2, 3, 256, 256)
+                obs_tensors[i][:, 0] is RGB, obs_tensors[i][:, 1] is XYZ
+            [action_tensors],  # wrt frame_ids, (1, 8)
+            [camera_dicts],
+            [gripper_tensors],  # wrt frame_ids, (1, 8)
+            [trajectories]  # wrt frame_ids, (N_i, 8)
+            [annotation_ind] # wrt frame_ids, (1,)
+        ]
+    """
+    # upscale gripper camera
+    h, w = datas['static_rgb'][0].shape[:2]
+    datas['gripper_rgb'] = [
+        cv2.resize(m, (w, h), interpolation=cv2.INTER_LINEAR)
+        for m in datas['gripper_rgb']
+    ]
+    datas['gripper_pcd'] = [
+        cv2.resize(m, (w, h), interpolation=cv2.INTER_NEAREST)
+        for m in datas['gripper_pcd']
+    ]
+    static_rgb = np.stack(datas['static_rgb'], axis=0) # (traj_len, H, W, 3)
+    static_pcd = np.stack(datas['static_pcd'], axis=0) # (traj_len, H, W, 3)
+    gripper_rgb = np.stack(datas['gripper_rgb'], axis=0) # (traj_len, H, W, 3)
+    gripper_pcd = np.stack(datas['gripper_pcd'], axis=0) # (traj_len, H, W, 3)
+    rgb = np.stack([static_rgb, gripper_rgb], axis=1) # (traj_len, ncam, H, W, 3)
+    pcd = np.stack([static_pcd, gripper_pcd], axis=1) # (traj_len, ncam, H, W, 3)
+    rgb_pcd = np.stack([rgb, pcd], axis=2) # (traj_len, ncam, 2, H, W, 3)])
+    rgb_pcd = rgb_pcd.transpose(0, 1, 2, 5, 3, 4) # (traj_len, ncam, 2, 3, H, W)
+    rgb_pcd = torch.as_tensor(rgb_pcd, dtype=torch.float32) # (traj_len, ncam, 2, 3, H, W)
+
+    # prepare keypose actions
+    keyframe_indices = torch.as_tensor(keyframe_inds)[None, :]
+    gripper_indices = torch.arange(len(datas['proprios'])).view(-1, 1)
+    action_indices = torch.argmax(
+        (gripper_indices < keyframe_indices).float(), dim=1
+    ).tolist()
+    action_indices[-1] = len(keyframe_inds) - 1
+    actions = [datas['proprios'][keyframe_inds[i]] for i in action_indices]
+    action_tensors = [
+        torch.as_tensor(a, dtype=torch.float32).view(1, -1) for a in actions
+    ]
+
+    # prepare camera_dicts
+    camera_dicts = [{'front': (0, 0), 'wrist': (0, 0)}]
+
+    # prepare gripper tensors
+    gripper_tensors = [
+        torch.as_tensor(a, dtype=torch.float32).view(1, -1)
+        for a in datas['proprios']
+    ]
+
+    # prepare trajectories
+    if mode == 'keypose':
+        trajectories = []
+        for i in range(len(action_indices)):
+            target_frame = keyframe_inds[action_indices[i]]
+            current_frame = i
+            trajectories.append(
+                torch.cat(
+                    [
+                        torch.as_tensor(a, dtype=torch.float32).view(1, -1)
+                        for a in datas['proprios'][current_frame:target_frame+1]
+                    ],
+                    dim=0
+                )
+            )
+    else:
+        trajectories = []
+        for i in range(len(gripper_tensors)):
+            traj = datas['proprios'][i:i+traj_len]
+            if len(traj) < traj_len:
+                traj += [traj[-1]] * (traj_len - len(traj))
+            traj = [
+                torch.as_tensor(a, dtype=torch.float32).view(1, -1)
+                for a in traj
+            ]
+            traj = torch.cat(traj, dim=0)
+            trajectories.append(traj)
+
+    # Filter out datas
+    if mode == 'keypose':
+        keyframe_inds = [0] + keyframe_inds[:-1].tolist()
+        keyframe_indices = torch.as_tensor(keyframe_inds)
+        rgb_pcd = torch.index_select(rgb_pcd, 0, keyframe_indices)
+        action_tensors = [action_tensors[i] for i in keyframe_inds]
+        gripper_tensors = [gripper_tensors[i] for i in keyframe_inds]
+        trajectories = [trajectories[i] for i in keyframe_inds]
+    else:
+        rgb_pcd = rgb_pcd[:-1]
+        action_tensors = action_tensors[:-1]
+        gripper_tensors = gripper_tensors[:-1]
+        trajectories = trajectories[:-1]
+
+        rgb_pcd = rgb_pcd[::execute_every]
+        action_tensors = action_tensors[::execute_every]
+        gripper_tensors = gripper_tensors[::execute_every]
+        trajectories = trajectories[::execute_every]
+
+    # prepare frame_ids
+    frame_ids = [i for i in range(len(rgb_pcd))]
+
+    # Save everything to disk
+    state_dict = [
+        frame_ids,
+        rgb_pcd,
+        action_tensors,
+        camera_dicts,
+        gripper_tensors,
+        trajectories,
+        datas['annotation_id']
+    ]
+
+    return state_dict
+
+
+def load_episode(env, root_dir, split, episode, datas, ann_id):
+    """Load episode and process datas
+
+    Args:
+        root_dir: a string of the root directory of the dataset
+        split: a string of the split of the dataset
+        episode: a string of the episode name
+        datas: a dict of the datas to be saved/loaded
+            - static_pcd: a list of nd.arrays with shape (height, width, 3)
+            - static_rgb: a list of nd.arrays with shape (height, width, 3)
+            - gripper_pcd: a list of nd.arrays with shape (height, width, 3)
+            - gripper_rgb: a list of nd.arrays with shape (height, width, 3)
+            - proprios: a list of nd.arrays with shape (8,)
+            - annotation_id: a list of ints
+    """
+    data = np.load(f'{root_dir}/{split}/{episode}')
+
+    rgb_static = data['rgb_static']  # (200, 200, 3)
+    rgb_gripper = data['rgb_gripper']  # (84, 84, 3)
+    depth_static = data['depth_static']  # (200, 200)
+    depth_gripper = data['depth_gripper']  # (84, 84)
+
+    # data['robot_obs'] is (15,), data['scene_obs'] is (24,)
+    env.reset(robot_obs=data['robot_obs'], scene_obs=data['scene_obs'])
+    static_cam = env.cameras[0]
+    gripper_cam = env.cameras[1]
+    gripper_cam.viewMatrix = get_gripper_camera_view_matrix(gripper_cam)
+
+    static_pcd = deproject(
+        static_cam, depth_static,
+        homogeneous=False, sanity_check=False
+    ).transpose(1, 0)
+    static_pcd = np.reshape(
+        static_pcd, (depth_static.shape[0], depth_static.shape[1], 3)
+    )
+    gripper_pcd = deproject(
+        gripper_cam, depth_gripper,
+        homogeneous=False, sanity_check=False
+    ).transpose(1, 0)
+    gripper_pcd = np.reshape(
+        gripper_pcd, (depth_gripper.shape[0], depth_gripper.shape[1], 3)
+    )
+
+    # map RGB to [-1, 1]
+    rgb_static = rgb_static / 255. * 2 - 1
+    rgb_gripper = rgb_gripper / 255. * 2 - 1
+
+    # Map gripper openess to [0, 1]
+    proprio = np.concatenate([
+        data['robot_obs'][:3],
+        data['robot_obs'][3:6],
+        (data['robot_obs'][[-1]] > 0).astype(np.float32)
+    ], axis=-1)
+
+    # Put them into a dict
+    datas['static_pcd'].append(static_pcd)  # (200, 200, 3)
+    datas['static_rgb'].append(rgb_static)  # (200, 200, 3)
+    datas['gripper_pcd'].append(gripper_pcd)  # (84, 84, 3)
+    datas['gripper_rgb'].append(rgb_gripper)  # (84, 84, 3)
+    datas['proprios'].append(proprio)  # (8,)
+    datas['annotation_id'].append(ann_id)  # int
+
+
+def init_datas():
+    datas = {
+        'static_pcd': [],
+        'static_rgb': [],
+        'gripper_pcd': [],
+        'gripper_rgb': [],
+        'proprios': [],
+        'annotation_id': []
+    }
+    return datas
+
+
+def main(split, args):
+    """
+    CALVIN contains long videos of "tasks" executed in order
+    with noisy transitions between them. The 'annotations' json contains
+    info on how to segment those videos.
+
+    Original CALVIN annotations:
+    {
+        'info': {
+            'episodes': [],
+            'indx': [(788072, 788136), (899273, 899337), (1427083, 1427147)]
+                list of tuples indicating start-end of a task
+        },
+        'language': {
+            'ann': list of str with len=17870, instructions,
+            'task': list of str with len=17870, task names,
+            'emb': array (17870, 1, 384)
+        }
+    }
+
+    Save:
+    state_dict = [
+        frame_ids,  # [0, 1, 2...]
+        rgb_pcd,  # tensor [len(frame_ids), ncam, 2, 3, 200, 200]
+        action_tensors,  # [tensor(1, 8)]
+        camera_dicts,  # [{'front': (0, 0), 'wrist': (0, 0)}]
+        gripper_tensors,  # [tensor(1, 8)]
+        trajectories,  # [tensor(N, 8) or tensor(2, 8) if keyposes]
+        datas['annotation_id']  # [int]
+    ]
+    """
+    annotations = np.load(
+        f'{args.root_dir}/{split}/lang_annotations/auto_lang_ann.npy',
+        allow_pickle=True
+    ).item()
+    env = make_env(args.root_dir, split)
+
+    for anno_ind, (start_id, end_id) in enumerate(annotations['info']['indx']):
+        # Step 1. load episodes of the same task
+        len_anno = len(annotations['info']['indx'])
+        if args.tasks is not None and annotations['language']['task'][anno_ind] not in args.tasks:
+            continue
+        print(f'Processing {anno_ind}/{len_anno}, start_id:{start_id}, end_id:{end_id}')
+        datas = init_datas()
+        for ep_id in range(start_id, end_id + 1):
+            episode = 'episode_{:07d}.npz'.format(ep_id)
+            load_episode(
+                env,
+                args.root_dir,
+                split,
+                episode,
+                datas,
+                anno_ind
+            )
+
+        # Step 2. detect keyframes within the episode
+        _, keyframe_inds = keypoint_discovery(datas['proprios'])
+
+        state_dict = process_datas(
+            datas, args.mode, args.traj_len, args.execute_every, keyframe_inds
+        )
+
+        # Step 3. determine scene
+        if split == 'training':
+            scene_info = np.load(
+                f'{args.root_dir}/training/scene_info.npy',
+                allow_pickle=True
+            ).item()
+            if ("calvin_scene_B" in scene_info and
+                start_id <= scene_info["calvin_scene_B"][1]):
+                scene = "B"
+            elif ("calvin_scene_C" in scene_info and
+                  start_id <= scene_info["calvin_scene_C"][1]):
+                scene = "C"
+            elif ("calvin_scene_A" in scene_info and
+                  start_id <= scene_info["calvin_scene_A"][1]):
+                scene = "A"
+            else:
+                scene = "D"
+        else:
+            scene = 'D'
+
+        # Step 4. save to .dat file
+        ep_save_path = f'{args.save_path}/{split}/{scene}+0/ann_{anno_ind}.dat'
+        os.makedirs(os.path.dirname(ep_save_path), exist_ok=True)
+        with open(ep_save_path, "wb") as f:
+            f.write(blosc.compress(pickle.dumps(state_dict)))
+
+    env.close()
+
+
+if __name__ == "__main__":
+    args = Arguments().parse_args()
+    main(args.split, args)
diff --git a/data_preprocessing/package_rlbench.py b/data_preprocessing/package_rlbench.py
new file mode 100755
index 0000000..05556d4
--- /dev/null
+++ b/data_preprocessing/package_rlbench.py
@@ -0,0 +1,157 @@
+import random
+import itertools
+from typing import Tuple, Dict, List
+import pickle
+from pathlib import Path
+import json
+
+import blosc
+from tqdm import tqdm
+import tap
+import torch
+import numpy as np
+import einops
+from rlbench.demo import Demo
+
+from utils.utils_with_rlbench import (
+    RLBenchEnv,
+    keypoint_discovery,
+    obs_to_attn,
+    transform,
+)
+
+
+class Arguments(tap.Tap):
+    data_dir: Path = Path(__file__).parent / "c2farm"
+    seed: int = 2
+    tasks: Tuple[str, ...] = ("stack_wine",)
+    cameras: Tuple[str, ...] = ("left_shoulder", "right_shoulder", "wrist", "front")
+    image_size: str = "256,256"
+    output: Path = Path(__file__).parent / "datasets"
+    max_variations: int = 199
+    offset: int = 0
+    num_workers: int = 0
+    store_intermediate_actions: int = 1
+
+
+def get_attn_indices_from_demo(
+    task_str: str, demo: Demo, cameras: Tuple[str, ...]
+) -> List[Dict[str, Tuple[int, int]]]:
+    frames = keypoint_discovery(demo)
+
+    frames.insert(0, 0)
+    return [{cam: obs_to_attn(demo[f], cam) for cam in cameras} for f in frames]
+
+
+def get_observation(task_str: str, variation: int,
+                    episode: int, env: RLBenchEnv,
+                    store_intermediate_actions: bool):
+    demos = env.get_demo(task_str, variation, episode)
+    demo = demos[0]
+
+    key_frame = keypoint_discovery(demo)
+    key_frame.insert(0, 0)
+
+    keyframe_state_ls = []
+    keyframe_action_ls = []
+    intermediate_action_ls = []
+
+    for i in range(len(key_frame)):
+        state, action = env.get_obs_action(demo._observations[key_frame[i]]);
+        state = transform(state)
+        keyframe_state_ls.append(state.unsqueeze(0))
+        keyframe_action_ls.append(action.unsqueeze(0))
+
+        if store_intermediate_actions and i < len(key_frame) - 1:
+            intermediate_actions = []
+            for j in range(key_frame[i], key_frame[i + 1] + 1):
+                _, action = env.get_obs_action(demo._observations[j])
+                intermediate_actions.append(action.unsqueeze(0))
+            intermediate_action_ls.append(torch.cat(intermediate_actions))
+
+    return demo, keyframe_state_ls, keyframe_action_ls, intermediate_action_ls
+
+
+class Dataset(torch.utils.data.Dataset):
+
+    def __init__(self, args: Arguments):
+        # load RLBench environment
+        self.env = RLBenchEnv(
+            data_path=args.data_dir,
+            image_size=[int(x) for x in args.image_size.split(",")],
+            apply_rgb=True,
+            apply_pc=True,
+            apply_cameras=args.cameras,
+        )
+
+        tasks = args.tasks
+        variations = range(args.offset, args.max_variations)
+        self.items = []
+        for task_str, variation in itertools.product(tasks, variations):
+            episodes_dir = args.data_dir / task_str / f"variation{variation}" / "episodes"
+            episodes = [
+                (task_str, variation, int(ep.stem[7:]))
+                for ep in episodes_dir.glob("episode*")
+            ]
+            self.items += episodes
+
+        self.num_items = len(self.items)
+
+    def __len__(self) -> int:
+        return self.num_items
+
+    def __getitem__(self, index: int) -> None:
+        task, variation, episode = self.items[index]
+        taskvar_dir = args.output / f"{task}+{variation}"
+        taskvar_dir.mkdir(parents=True, exist_ok=True)
+
+        (demo,
+         keyframe_state_ls,
+         keyframe_action_ls,
+         intermediate_action_ls) = get_observation(
+            task, variation, episode, self.env,
+            bool(args.store_intermediate_actions)
+        )
+
+        state_ls = einops.rearrange(
+            keyframe_state_ls,
+            "t 1 (m n ch) h w -> t n m ch h w",
+            ch=3,
+            n=len(args.cameras),
+            m=2,
+        )
+
+        frame_ids = list(range(len(state_ls) - 1))
+        num_frames = len(frame_ids)
+        attn_indices = get_attn_indices_from_demo(task, demo, args.cameras)
+
+        state_dict: List = [[] for _ in range(6)]
+        print("Demo {}".format(episode))
+        state_dict[0].extend(frame_ids)
+        state_dict[1] = state_ls[:-1].numpy()
+        state_dict[2].extend(keyframe_action_ls[1:])
+        state_dict[3].extend(attn_indices)
+        state_dict[4].extend(keyframe_action_ls[:-1])  # gripper pos
+        state_dict[5].extend(intermediate_action_ls)   # traj from gripper pos to keyframe action
+
+        with open(taskvar_dir / f"ep{episode}.dat", "wb") as f:
+            f.write(blosc.compress(pickle.dumps(state_dict)))
+
+
+if __name__ == "__main__":
+    args = Arguments().parse_args()
+
+    torch.manual_seed(args.seed)
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    dataset = Dataset(args)
+    dataloader = torch.utils.data.DataLoader(
+        dataset,
+        batch_size=1,
+        num_workers=args.num_workers,
+        collate_fn=lambda x: x,
+    )
+
+    for _ in tqdm(dataloader):
+        continue
diff --git a/data_preprocessing/preprocess_calvin_instructions.py b/data_preprocessing/preprocess_calvin_instructions.py
new file mode 100644
index 0000000..3850446
--- /dev/null
+++ b/data_preprocessing/preprocess_calvin_instructions.py
@@ -0,0 +1,91 @@
+"""
+Precompute embeddings of instructions.
+"""
+import os
+import re
+import json
+from pathlib import Path
+import itertools
+from typing import List, Tuple, Literal, Dict, Optional
+import pickle
+
+import tap
+import transformers
+from tqdm.auto import tqdm
+import torch
+import numpy as np
+
+
+TextEncoder = Literal["bert", "clip"]
+
+
+class Arguments(tap.Tap):
+    output: Path
+    encoder: TextEncoder = "clip"
+    model_max_length: int = 53
+    device: str = "cuda"
+    verbose: bool = False
+    annotation_path: Path
+
+
+def parse_int(s):
+    return int(re.findall(r"\d+", s)[0])
+
+
+def load_model(encoder: TextEncoder) -> transformers.PreTrainedModel:
+    if encoder == "bert":
+        model = transformers.BertModel.from_pretrained("bert-base-uncased")
+    elif encoder == "clip":
+        model = transformers.CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
+    else:
+        raise ValueError(f"Unexpected encoder {encoder}")
+    if not isinstance(model, transformers.PreTrainedModel):
+        raise ValueError(f"Unexpected encoder {encoder}")
+    return model
+
+
+def load_tokenizer(encoder: TextEncoder) -> transformers.PreTrainedTokenizer:
+    if encoder == "bert":
+        tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
+    elif encoder == "clip":
+        tokenizer = transformers.CLIPTokenizer.from_pretrained(
+            "openai/clip-vit-base-patch32"
+        )
+    else:
+        raise ValueError(f"Unexpected encoder {encoder}")
+    if not isinstance(tokenizer, transformers.PreTrainedTokenizer):
+        raise ValueError(f"Unexpected encoder {encoder}")
+    return tokenizer
+
+
+if __name__ == "__main__":
+    args = Arguments().parse_args()
+    print(args)
+
+    annotations = np.load(str(args.annotation_path), allow_pickle=True).item()
+    instructions_string = [s + '.' for s in annotations['language']['ann']]
+
+    tokenizer = load_tokenizer(args.encoder)
+    tokenizer.model_max_length = args.model_max_length
+
+    model = load_model(args.encoder)
+    model = model.to(args.device)
+
+    instructions = {
+        'embeddings': [],
+        'text': []
+    }
+
+    for instr in tqdm(instructions_string):
+        tokens = tokenizer(instr, padding="max_length")["input_ids"]
+
+        tokens = torch.tensor(tokens).to(args.device)
+        tokens = tokens.view(1, -1)
+        with torch.no_grad():
+            pred = model(tokens).last_hidden_state
+        instructions['embeddings'].append(pred.cpu())
+        instructions['text'].append(instr)
+
+    os.makedirs(str(args.output.parent), exist_ok=True)
+    with open(args.output, "wb") as f:
+        pickle.dump(instructions, f)
diff --git a/data_preprocessing/preprocess_rlbench_instructions.py b/data_preprocessing/preprocess_rlbench_instructions.py
new file mode 100644
index 0000000..dc384f6
--- /dev/null
+++ b/data_preprocessing/preprocess_rlbench_instructions.py
@@ -0,0 +1,173 @@
+"""
+Precompute embeddings of instructions.
+"""
+import re
+import json
+from pathlib import Path
+import itertools
+from typing import List, Tuple, Literal, Dict, Optional
+import pickle
+
+import tap
+import transformers
+from tqdm.auto import tqdm
+import torch
+
+from utils.utils_with_rlbench import RLBenchEnv, task_file_to_task_class
+
+
+Annotations = Dict[str, Dict[int, List[str]]]
+TextEncoder = Literal["bert", "clip"]
+
+
+class Arguments(tap.Tap):
+    tasks: Tuple[str, ...]
+    output: Path
+    batch_size: int = 10
+    encoder: TextEncoder = "clip"
+    model_max_length: int = 53
+    variations: Tuple[int, ...] = list(range(199))
+    device: str = "cuda"
+    annotations: Tuple[Path, ...] = ()
+    zero: bool = False
+    verbose: bool = False
+
+
+def parse_int(s):
+    return int(re.findall(r"\d+", s)[0])
+
+
+def load_model(encoder: TextEncoder) -> transformers.PreTrainedModel:
+    if encoder == "bert":
+        model = transformers.BertModel.from_pretrained("bert-base-uncased")
+    elif encoder == "clip":
+        model = transformers.CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
+    else:
+        raise ValueError(f"Unexpected encoder {encoder}")
+    if not isinstance(model, transformers.PreTrainedModel):
+        raise ValueError(f"Unexpected encoder {encoder}")
+    return model
+
+
+def load_tokenizer(encoder: TextEncoder) -> transformers.PreTrainedTokenizer:
+    if encoder == "bert":
+        tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
+    elif encoder == "clip":
+        tokenizer = transformers.CLIPTokenizer.from_pretrained(
+            "openai/clip-vit-base-patch32"
+        )
+    else:
+        raise ValueError(f"Unexpected encoder {encoder}")
+    if not isinstance(tokenizer, transformers.PreTrainedTokenizer):
+        raise ValueError(f"Unexpected encoder {encoder}")
+    return tokenizer
+
+
+def load_annotations(annotations: Tuple[Path, ...]) -> Annotations:
+    data = []
+    for annotation in annotations:
+        with open(annotation) as fid:
+            data += json.load(fid)
+
+    items: Annotations = {}
+    for item in data:
+        task = item["fields"]["task"]
+        variation = item["fields"]["variation"]
+        instruction = item["fields"]["instruction"]
+
+        if instruction == "":
+            continue
+
+        if task not in items:
+            items[task] = {}
+
+        if variation not in items[task]:
+            items[task][variation] = []
+
+        items[task][variation].append(instruction)
+
+    # merge annotations for push_buttonsX (same variations)
+    push_buttons = ("push_buttons", "push_buttons3")
+    for task, task2 in itertools.product(push_buttons, push_buttons):
+        items[task] = items.get(task, {})
+        for variation, instrs in items.get(task2, {}).items():
+            items[task][variation] = instrs + items[task].get(variation, [])
+
+    # display statistics
+    for task, values in items.items():
+        print(task, ":", sorted(values.keys()))
+
+    return items
+
+
+if __name__ == "__main__":
+    args = Arguments().parse_args()
+    print(args)
+
+    annotations = load_annotations(args.annotations)
+
+    tokenizer = load_tokenizer(args.encoder)
+    tokenizer.model_max_length = args.model_max_length
+
+    model = load_model(args.encoder)
+    model = model.to(args.device)
+
+    env = RLBenchEnv(
+        data_path="",
+        apply_rgb=True,
+        apply_pc=True,
+        apply_cameras=("left_shoulder", "right_shoulder", "wrist"),
+        headless=True,
+    )
+
+    instructions: Dict[str, Dict[int, torch.Tensor]] = {}
+    tasks = set(args.tasks)
+
+    for task in tqdm(tasks):
+        task_type = task_file_to_task_class(task)
+        task_inst = env.env.get_task(task_type)._task
+        task_inst.init_task()
+
+        instructions[task] = {}
+
+        variations = [v for v in args.variations if v < task_inst.variation_count()]
+        for variation in variations:
+            # check instructions among annotations
+            if task in annotations and variation in annotations[task]:
+                instr: Optional[List[str]] = annotations[task][variation]
+            # or, collect it from RLBench synthetic instructions
+            else:
+                instr = None
+                for i in range(3):
+                    try:
+                        instr = task_inst.init_episode(variation)
+                        break
+                    except:
+                        print(f"Cannot init episode {task}")
+                if instr is None:
+                    raise RuntimeError()
+
+            if args.verbose:
+                print(task, variation, instr)
+
+            tokens = tokenizer(instr, padding="max_length")["input_ids"]
+            lengths = [len(t) for t in tokens]
+            if any(l > args.model_max_length for l in lengths):
+                raise RuntimeError(f"Too long instructions: {lengths}")
+
+            tokens = torch.tensor(tokens).to(args.device)
+            with torch.no_grad():
+                pred = model(tokens).last_hidden_state
+            instructions[task][variation] = pred.cpu()
+
+    if args.zero:
+        for instr_task in instructions.values():
+            for variation, instr_var in instr_task.items():
+                instr_task[variation].fill_(0)
+
+    print("Instructions:", sum(len(inst) for inst in instructions.values()))
+
+    args.output.parent.mkdir(exist_ok=True)
+    with open(args.output, "wb") as f:
+        pickle.dump(instructions, f)
+
diff --git a/data_preprocessing/rearrange_rlbench_demos.py b/data_preprocessing/rearrange_rlbench_demos.py
new file mode 100644
index 0000000..8b64ae9
--- /dev/null
+++ b/data_preprocessing/rearrange_rlbench_demos.py
@@ -0,0 +1,51 @@
+import os
+from subprocess import call
+import pickle
+from pathlib import Path
+
+import tap
+
+
+class Arguments(tap.Tap):
+    root_dir: Path
+
+
+def main(root_dir, task):
+    variations = os.listdir(f'{root_dir}/{task}/all_variations/episodes')
+    seen_variations = {}
+    for variation in variations:
+        num = int(variation.replace('episode', ''))
+        variation = pickle.load(
+            open(
+                f'{root_dir}/{task}/all_variations/episodes/episode{num}/variation_number.pkl',
+                'rb'
+            )
+        )
+        os.makedirs(f'{root_dir}/{task}/variation{variation}/episodes', exist_ok=True)
+
+        if variation not in seen_variations.keys():
+            seen_variations[variation] = [num]
+        else:
+            seen_variations[variation].append(num)
+
+        if os.path.isfile(f'{root_dir}/{task}/variation{variation}/variation_descriptions.pkl'):
+            data1 = pickle.load(open(f'{root_dir}/{task}/all_variations/episodes/episode{num}/variation_descriptions.pkl', 'rb'))
+            data2 = pickle.load(open(f'{root_dir}/{task}/variation{variation}/variation_descriptions.pkl', 'rb'))
+            assert data1 == data2
+        else:
+            call(['ln', '-s',
+                  f'{root_dir}/{task}/all_variations/episodes/episode{num}/variation_descriptions.pkl',
+                  f'{root_dir}/{task}/variation{variation}/'])
+
+        ep_id = len(seen_variations[variation]) - 1
+        call(['ln', '-s',
+              "{:s}/{:s}/all_variations/episodes/episode{:d}".format(root_dir, task, num),
+              f'{root_dir}/{task}/variation{variation}/episodes/episode{ep_id}'])
+
+
+if __name__ == '__main__':
+    args = Arguments().parse_args()
+    root_dir = str(args.root_dir.absolute())
+    tasks = [f for f in os.listdir(root_dir) if '.zip' not in f]
+    for task in tasks:
+        main(root_dir, task)
diff --git a/data_preprocessing/rerender_highres_rlbench.py b/data_preprocessing/rerender_highres_rlbench.py
new file mode 100644
index 0000000..5077b4f
--- /dev/null
+++ b/data_preprocessing/rerender_highres_rlbench.py
@@ -0,0 +1,536 @@
+from multiprocessing import Process, Manager
+from typing import Type, List, Callable
+import glob
+import os
+import pickle
+from subprocess import call
+
+from pyrep.const import RenderMode
+
+from rlbench import ObservationConfig
+from rlbench.backend.observation import Observation
+from rlbench.demo import Demo
+from rlbench.backend.task import Task
+from rlbench.action_modes.action_mode import MoveArmThenGripper
+from rlbench.action_modes.arm_action_modes import JointVelocity
+from rlbench.action_modes.gripper_action_modes import Discrete
+from rlbench.backend.utils import task_file_to_task_class
+from rlbench.environment import Environment
+from rlbench.task_environment import (
+    TaskEnvironment,
+    _MAX_RESET_ATTEMPTS,
+    _MAX_DEMO_ATTEMPTS
+)
+import rlbench.backend.task as task
+
+from PIL import Image
+from rlbench.backend import utils
+from rlbench.backend.const import (
+    LEFT_SHOULDER_RGB_FOLDER,
+    LEFT_SHOULDER_DEPTH_FOLDER,
+    LEFT_SHOULDER_MASK_FOLDER,
+    RIGHT_SHOULDER_RGB_FOLDER,
+    RIGHT_SHOULDER_DEPTH_FOLDER,
+    RIGHT_SHOULDER_MASK_FOLDER,
+    OVERHEAD_RGB_FOLDER,
+    OVERHEAD_DEPTH_FOLDER,
+    OVERHEAD_MASK_FOLDER,
+    WRIST_RGB_FOLDER,
+    WRIST_DEPTH_FOLDER,
+    WRIST_MASK_FOLDER,
+    FRONT_RGB_FOLDER,
+    FRONT_DEPTH_FOLDER,
+    FRONT_MASK_FOLDER,
+    DEPTH_SCALE,
+    IMAGE_FORMAT,
+    LOW_DIM_PICKLE,
+    VARIATION_NUMBER,
+    VARIATIONS_ALL_FOLDER,
+    EPISODES_FOLDER,
+    EPISODE_FOLDER,
+    VARIATION_DESCRIPTIONS
+)
+import numpy as np
+
+from absl import app
+from absl import flags
+
+FLAGS = flags.FLAGS
+
+flags.DEFINE_string('save_path',
+                    '/tmp/rlbench_data/',
+                    'Where to save the demos.')
+flags.DEFINE_string('demo_path',
+                    '/tmp/rlbench_data/',
+                    'Where to existing demos.')
+flags.DEFINE_list('tasks', [],
+                  'The tasks to collect. If empty, all tasks are collected.')
+flags.DEFINE_list('image_size', [128, 128],
+                  'The size of the images tp save.')
+flags.DEFINE_enum('renderer',  'opengl3', ['opengl', 'opengl3'],
+                  'The renderer to use. opengl does not include shadows, '
+                  'but is faster.')
+flags.DEFINE_integer('processes', 1,
+                     'The number of parallel processes during collection.')
+flags.DEFINE_integer('variations', -1,
+                     'Number of variations to collect per task. -1 for all.')
+flags.DEFINE_bool('all_variations', True,
+                  'Include all variations when sampling epsiodes')
+
+
+class CustomizedTaskEnvironment(TaskEnvironment):
+    """Modify TaskEnvironment class, so that we can provide random seed
+    when generating live demos.
+    """
+    def get_demos(self, amount: int, live_demos: bool = False,
+                  image_paths: bool = False,
+                  callable_each_step: Callable[[Observation], None] = None,
+                  max_attempts: int = _MAX_DEMO_ATTEMPTS,
+                  random_selection: bool = True,
+                  from_episode_number: int = 0,
+                  random_seed_state = None,
+                  ) -> List[Demo]:
+        """Negative means all demos"""
+
+        if not live_demos and (self._dataset_root is None
+                               or len(self._dataset_root) == 0):
+            raise RuntimeError(
+                "Can't ask for a stored demo when no dataset root provided.")
+
+        if not live_demos:
+            if self._dataset_root is None or len(self._dataset_root) == 0:
+                raise RuntimeError(
+                    "Can't ask for stored demo when no dataset root provided.")
+            demos = utils.get_stored_demos(
+                amount, image_paths, self._dataset_root, self._variation_number,
+                self._task.get_name(), self._obs_config,
+                random_selection, from_episode_number)
+        else:
+            ctr_loop = self._robot.arm.joints[0].is_control_loop_enabled()
+            self._robot.arm.set_control_loop_enabled(True)
+            demos = self._get_live_demos(
+                amount, callable_each_step, max_attempts, random_seed_state)
+            self._robot.arm.set_control_loop_enabled(ctr_loop)
+        return demos
+
+    def _get_live_demos(self, amount: int,
+                        callable_each_step: Callable[
+                            [Observation], None] = None,
+                        max_attempts: int = _MAX_DEMO_ATTEMPTS,
+                        random_seed_state = None) -> List[Demo]:
+        demos = []
+        for i in range(amount):
+            attempts = max_attempts
+            while attempts > 0:
+                if random_seed_state is None:
+                    random_seed = np.random.get_state()
+                else:
+                    random_seed = random_seed_state
+                    np.random.set_state(random_seed)
+                self.reset()
+                try:
+                    demo = self._scene.get_demo(
+                        callable_each_step=callable_each_step)
+                    demo.random_seed = random_seed
+                    demos.append(demo)
+                    break
+                except Exception as e:
+                    attempts -= 1
+                    logging.info('Bad demo. ' + str(e) + ' Attempts left: ' + str(attempts))
+            if attempts <= 0:
+                raise RuntimeError(
+                    'Could not collect demos. Maybe a problem with the task?')
+        return demos
+
+
+
+class CustomizedEnvironment(Environment):
+
+    def get_task(self, task_class: Type[Task]) -> CustomizedTaskEnvironment:
+
+        # If user hasn't called launch, implicitly call it.
+        if self._pyrep is None:
+            self.launch()
+
+        self._scene.unload()
+        task = task_class(self._pyrep, self._robot)
+        self._prev_task = task
+        return CustomizedTaskEnvironment(
+            self._pyrep, self._robot, self._scene, task,
+            self._action_mode, self._dataset_root, self._obs_config,
+            self._static_positions, self._attach_grasped_objects)
+
+
+
+def check_and_make(dir):
+    if not os.path.exists(dir):
+        os.makedirs(dir)
+
+
+def save_demo(demo, example_path, variation):
+
+    # Save image data first, and then None the image data, and pickle
+    left_shoulder_rgb_path = os.path.join(
+        example_path, LEFT_SHOULDER_RGB_FOLDER)
+    left_shoulder_depth_path = os.path.join(
+        example_path, LEFT_SHOULDER_DEPTH_FOLDER)
+    left_shoulder_mask_path = os.path.join(
+        example_path, LEFT_SHOULDER_MASK_FOLDER)
+    right_shoulder_rgb_path = os.path.join(
+        example_path, RIGHT_SHOULDER_RGB_FOLDER)
+    right_shoulder_depth_path = os.path.join(
+        example_path, RIGHT_SHOULDER_DEPTH_FOLDER)
+    right_shoulder_mask_path = os.path.join(
+        example_path, RIGHT_SHOULDER_MASK_FOLDER)
+    overhead_rgb_path = os.path.join(
+        example_path, OVERHEAD_RGB_FOLDER)
+    overhead_depth_path = os.path.join(
+        example_path, OVERHEAD_DEPTH_FOLDER)
+    overhead_mask_path = os.path.join(
+        example_path, OVERHEAD_MASK_FOLDER)
+    wrist_rgb_path = os.path.join(example_path, WRIST_RGB_FOLDER)
+    wrist_depth_path = os.path.join(example_path, WRIST_DEPTH_FOLDER)
+    wrist_mask_path = os.path.join(example_path, WRIST_MASK_FOLDER)
+    front_rgb_path = os.path.join(example_path, FRONT_RGB_FOLDER)
+    front_depth_path = os.path.join(example_path, FRONT_DEPTH_FOLDER)
+    front_mask_path = os.path.join(example_path, FRONT_MASK_FOLDER)
+
+    check_and_make(left_shoulder_rgb_path)
+    check_and_make(left_shoulder_depth_path)
+    check_and_make(left_shoulder_mask_path)
+    check_and_make(right_shoulder_rgb_path)
+    check_and_make(right_shoulder_depth_path)
+    check_and_make(right_shoulder_mask_path)
+    check_and_make(overhead_rgb_path)
+    check_and_make(overhead_depth_path)
+    check_and_make(overhead_mask_path)
+    check_and_make(wrist_rgb_path)
+    check_and_make(wrist_depth_path)
+    check_and_make(wrist_mask_path)
+    check_and_make(front_rgb_path)
+    check_and_make(front_depth_path)
+    check_and_make(front_mask_path)
+
+    for i, obs in enumerate(demo):
+        left_shoulder_rgb = Image.fromarray(obs.left_shoulder_rgb)
+        left_shoulder_depth = utils.float_array_to_rgb_image(
+            obs.left_shoulder_depth, scale_factor=DEPTH_SCALE)
+        left_shoulder_mask = Image.fromarray(
+            (obs.left_shoulder_mask * 255).astype(np.uint8))
+        right_shoulder_rgb = Image.fromarray(obs.right_shoulder_rgb)
+        right_shoulder_depth = utils.float_array_to_rgb_image(
+            obs.right_shoulder_depth, scale_factor=DEPTH_SCALE)
+        right_shoulder_mask = Image.fromarray(
+            (obs.right_shoulder_mask * 255).astype(np.uint8))
+        overhead_rgb = Image.fromarray(obs.overhead_rgb)
+        overhead_depth = utils.float_array_to_rgb_image(
+            obs.overhead_depth, scale_factor=DEPTH_SCALE)
+        overhead_mask = Image.fromarray(
+            (obs.overhead_mask * 255).astype(np.uint8))
+        wrist_rgb = Image.fromarray(obs.wrist_rgb)
+        wrist_depth = utils.float_array_to_rgb_image(
+            obs.wrist_depth, scale_factor=DEPTH_SCALE)
+        wrist_mask = Image.fromarray((obs.wrist_mask * 255).astype(np.uint8))
+        front_rgb = Image.fromarray(obs.front_rgb)
+        front_depth = utils.float_array_to_rgb_image(
+            obs.front_depth, scale_factor=DEPTH_SCALE)
+        front_mask = Image.fromarray((obs.front_mask * 255).astype(np.uint8))
+
+        left_shoulder_rgb.save(
+            os.path.join(left_shoulder_rgb_path, IMAGE_FORMAT % i))
+        left_shoulder_depth.save(
+            os.path.join(left_shoulder_depth_path, IMAGE_FORMAT % i))
+        left_shoulder_mask.save(
+            os.path.join(left_shoulder_mask_path, IMAGE_FORMAT % i))
+        right_shoulder_rgb.save(
+            os.path.join(right_shoulder_rgb_path, IMAGE_FORMAT % i))
+        right_shoulder_depth.save(
+            os.path.join(right_shoulder_depth_path, IMAGE_FORMAT % i))
+        right_shoulder_mask.save(
+            os.path.join(right_shoulder_mask_path, IMAGE_FORMAT % i))
+        overhead_rgb.save(
+            os.path.join(overhead_rgb_path, IMAGE_FORMAT % i))
+        overhead_depth.save(
+            os.path.join(overhead_depth_path, IMAGE_FORMAT % i))
+        overhead_mask.save(
+            os.path.join(overhead_mask_path, IMAGE_FORMAT % i))
+        wrist_rgb.save(os.path.join(wrist_rgb_path, IMAGE_FORMAT % i))
+        wrist_depth.save(os.path.join(wrist_depth_path, IMAGE_FORMAT % i))
+        wrist_mask.save(os.path.join(wrist_mask_path, IMAGE_FORMAT % i))
+        front_rgb.save(os.path.join(front_rgb_path, IMAGE_FORMAT % i))
+        front_depth.save(os.path.join(front_depth_path, IMAGE_FORMAT % i))
+        front_mask.save(os.path.join(front_mask_path, IMAGE_FORMAT % i))
+
+        # We save the images separately, so set these to None for pickling.
+        obs.left_shoulder_rgb = None
+        obs.left_shoulder_depth = None
+        obs.left_shoulder_point_cloud = None
+        obs.left_shoulder_mask = None
+        obs.right_shoulder_rgb = None
+        obs.right_shoulder_depth = None
+        obs.right_shoulder_point_cloud = None
+        obs.right_shoulder_mask = None
+        obs.overhead_rgb = None
+        obs.overhead_depth = None
+        obs.overhead_point_cloud = None
+        obs.overhead_mask = None
+        obs.wrist_rgb = None
+        obs.wrist_depth = None
+        obs.wrist_point_cloud = None
+        obs.wrist_mask = None
+        obs.front_rgb = None
+        obs.front_depth = None
+        obs.front_point_cloud = None
+        obs.front_mask = None
+
+    # Save the low-dimension data
+    with open(os.path.join(example_path, LOW_DIM_PICKLE), 'wb') as f:
+        pickle.dump(demo, f)
+
+    with open(os.path.join(example_path, VARIATION_NUMBER), 'wb') as f:
+        pickle.dump(variation, f)
+
+
+def verify_demo_and_rgbs(demo, example_path):
+    left_shoulder_rgb_path = os.path.join(
+        example_path, LEFT_SHOULDER_RGB_FOLDER)
+    left_shoulder_depth_path = os.path.join(
+        example_path, LEFT_SHOULDER_DEPTH_FOLDER)
+    left_shoulder_mask_path = os.path.join(
+        example_path, LEFT_SHOULDER_MASK_FOLDER)
+    right_shoulder_rgb_path = os.path.join(
+        example_path, RIGHT_SHOULDER_RGB_FOLDER)
+    right_shoulder_depth_path = os.path.join(
+        example_path, RIGHT_SHOULDER_DEPTH_FOLDER)
+    right_shoulder_mask_path = os.path.join(
+        example_path, RIGHT_SHOULDER_MASK_FOLDER)
+    overhead_rgb_path = os.path.join(
+        example_path, OVERHEAD_RGB_FOLDER)
+    overhead_depth_path = os.path.join(
+        example_path, OVERHEAD_DEPTH_FOLDER)
+    overhead_mask_path = os.path.join(
+        example_path, OVERHEAD_MASK_FOLDER)
+    wrist_rgb_path = os.path.join(example_path, WRIST_RGB_FOLDER)
+    wrist_depth_path = os.path.join(example_path, WRIST_DEPTH_FOLDER)
+    wrist_mask_path = os.path.join(example_path, WRIST_MASK_FOLDER)
+    front_rgb_path = os.path.join(example_path, FRONT_RGB_FOLDER)
+    front_depth_path = os.path.join(example_path, FRONT_DEPTH_FOLDER)
+    front_mask_path = os.path.join(example_path, FRONT_MASK_FOLDER)
+
+    num_ls_rgb = len(os.listdir(left_shoulder_rgb_path))
+    num_ls_depth = len(os.listdir(left_shoulder_depth_path))
+    num_ls_mask = len(os.listdir(left_shoulder_mask_path))
+    num_rs_rgb = len(os.listdir(right_shoulder_rgb_path))
+    num_rs_depth = len(os.listdir(right_shoulder_depth_path))
+    num_rs_mask = len(os.listdir(right_shoulder_mask_path))
+    num_oh_rgb = len(os.listdir(overhead_rgb_path))
+    num_oh_depth = len(os.listdir(overhead_depth_path))
+    num_oh_mask = len(os.listdir(overhead_mask_path))
+    num_wrist_rgb = len(os.listdir(wrist_rgb_path))
+    num_wrist_depth = len(os.listdir(wrist_depth_path))
+    num_wrist_mask = len(os.listdir(wrist_mask_path))
+    num_front_rgb = len(os.listdir(front_rgb_path))
+    num_front_depth = len(os.listdir(front_depth_path))
+    num_front_mask = len(os.listdir(front_mask_path))
+
+    print(len(demo), num_ls_rgb, num_rs_rgb, num_oh_rgb, num_front_rgb)
+    assert len(demo) == num_ls_rgb
+    assert len(demo) == num_ls_depth
+    assert len(demo) == num_ls_mask
+    assert len(demo) == num_rs_rgb
+    assert len(demo) == num_rs_depth
+    assert len(demo) == num_rs_mask
+    assert len(demo) == num_oh_rgb
+    assert len(demo) == num_oh_depth
+    assert len(demo) == num_oh_mask
+    assert len(demo) == num_front_rgb
+    assert len(demo) == num_front_depth
+    assert len(demo) == num_front_mask
+    assert len(demo) == num_wrist_rgb
+    assert len(demo) == num_wrist_depth
+    assert len(demo) == num_wrist_mask
+
+
+def run_all_variations(i, lock, task_index, variation_count, results, file_lock, tasks):
+    """Each thread will choose one task and variation, and then gather
+    all the episodes_per_task for that variation."""
+
+    # Initialise each thread with random seed
+    np.random.seed(None)
+    num_tasks = len(tasks)
+
+    img_size = list(map(int, FLAGS.image_size))
+
+    obs_config = ObservationConfig()
+    obs_config.set_all(True)
+    obs_config.right_shoulder_camera.image_size = img_size
+    obs_config.left_shoulder_camera.image_size = img_size
+    obs_config.overhead_camera.image_size = img_size
+    obs_config.wrist_camera.image_size = img_size
+    obs_config.front_camera.image_size = img_size
+
+    # Store depth as 0 - 1
+    obs_config.right_shoulder_camera.depth_in_meters = False
+    obs_config.left_shoulder_camera.depth_in_meters = False
+    obs_config.overhead_camera.depth_in_meters = False
+    obs_config.wrist_camera.depth_in_meters = False
+    obs_config.front_camera.depth_in_meters = False
+
+    # We want to save the masks as rgb encodings.
+    obs_config.left_shoulder_camera.masks_as_one_channel = False
+    obs_config.right_shoulder_camera.masks_as_one_channel = False
+    obs_config.overhead_camera.masks_as_one_channel = False
+    obs_config.wrist_camera.masks_as_one_channel = False
+    obs_config.front_camera.masks_as_one_channel = False
+
+    if FLAGS.renderer == 'opengl':
+        obs_config.right_shoulder_camera.render_mode = RenderMode.OPENGL
+        obs_config.left_shoulder_camera.render_mode = RenderMode.OPENGL
+        obs_config.overhead_camera.render_mode = RenderMode.OPENGL
+        obs_config.wrist_camera.render_mode = RenderMode.OPENGL
+        obs_config.front_camera.render_mode = RenderMode.OPENGL
+
+    rlbench_env = CustomizedEnvironment(
+        action_mode=MoveArmThenGripper(JointVelocity(), Discrete()),
+        obs_config=obs_config,
+        headless=True)
+    rlbench_env.launch()
+
+    task_env = None
+
+    tasks_with_problems = results[i] = ''
+
+    while True:
+        # with lock:
+        if task_index.value >= num_tasks:
+            print('Process', i, 'finished')
+            break
+
+        t = tasks[task_index.value]
+
+        task_env = rlbench_env.get_task(t)
+        possible_variations = task_env.variation_count()
+
+        variation_path = os.path.join(
+            FLAGS.save_path, task_env.get_name(),
+            VARIATIONS_ALL_FOLDER)
+        check_and_make(variation_path)
+
+        episodes_path = os.path.join(variation_path, EPISODES_FOLDER)
+        check_and_make(episodes_path)
+
+        existing_episodes_path = os.path.join(
+            FLAGS.demo_path, task_env.get_name(), VARIATIONS_ALL_FOLDER, EPISODES_FOLDER
+        )
+        episodes_per_task = len(glob.glob(os.path.join(existing_episodes_path, "episode*")))
+
+        abort_variation = False
+        for ex_idx in range(episodes_per_task):
+            attempts = 100
+            existing_episode_path = os.path.join(
+                existing_episodes_path, EPISODE_FOLDER % ex_idx
+            )
+            episode_path = os.path.join(episodes_path, EPISODE_FOLDER % ex_idx)
+
+            while attempts > 0:
+                try:
+                    #variation = np.random.randint(possible_variations)
+                    variation = pickle.load(
+                        open(
+                            os.path.join(existing_episode_path, VARIATION_NUMBER),
+                            'rb'
+                        )
+                    )
+                    existing_demo = pickle.load(
+                        open(
+                            os.path.join(existing_episode_path, LOW_DIM_PICKLE),
+                            'rb'
+                        )
+                    )
+                    random_seed_state = existing_demo.random_seed
+                    task_env = rlbench_env.get_task(t)
+                    task_env.set_variation(variation)
+                    descriptions, obs = task_env.reset()
+
+                    print('Process', i, '// Task:', task_env.get_name(),
+                          '// Variation:', variation, '// Demo:', ex_idx)
+
+                    # TODO: for now we do the explicit looping.
+                    demo, = task_env.get_demos(
+                        amount=1,
+                        live_demos=True,
+                        random_seed_state=random_seed_state)
+
+                    with file_lock:
+                        save_demo(demo, episode_path, variation)
+
+                        with open(os.path.join(
+                                episode_path, VARIATION_DESCRIPTIONS), 'wb') as f:
+                            pickle.dump(descriptions, f)
+
+                    # verify demo
+                    verify_demo_and_rgbs(demo, episode_path)
+                except Exception as e:
+                    attempts -= 1
+
+                    # clean up previously saved RGBs
+                    call(['rm', '-r', episode_path])
+
+                    if attempts > 0:
+                        continue
+                    problem = (
+                        'Process %d failed collecting task %s (variation: %d, '
+                        'example: %d). Skipping this task/variation.\n%s\n' % (
+                            i, task_env.get_name(), variation, ex_idx,
+                            str(e))
+                    )
+                    print(problem)
+                    tasks_with_problems += problem
+                    abort_variation = True
+                    break
+                break
+            if abort_variation:
+                break
+
+        # with lock:
+        task_index.value += 1
+
+    results[i] = tasks_with_problems
+    rlbench_env.shutdown()
+
+
+def main(argv):
+
+    task_files = [t.replace('.py', '') for t in os.listdir(task.TASKS_PATH)
+                  if t != '__init__.py' and t.endswith('.py')]
+
+    if len(FLAGS.tasks) > 0:
+        for t in FLAGS.tasks:
+            if t not in task_files:
+                raise ValueError('Task %s not recognised!.' % t)
+        task_files = FLAGS.tasks
+
+    tasks = [task_file_to_task_class(t) for t in task_files]
+
+    manager = Manager()
+
+    result_dict = manager.dict()
+    file_lock = manager.Lock()
+
+    task_index = manager.Value('i', 0)
+    variation_count = manager.Value('i', 0)
+    lock = manager.Lock()
+
+    check_and_make(FLAGS.save_path)
+
+    # multiprocessing for all_variations not support (for now)
+    run_all_variations(0, lock, task_index, variation_count, result_dict, file_lock, tasks)
+
+    print('Data collection done!')
+    for i in range(FLAGS.processes):
+        print(result_dict[i])
+
+
+if __name__ == '__main__':
+  app.run(main)
+
diff --git a/datasets/__init__.py b/datasets/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/datasets/dataset_calvin.py b/datasets/dataset_calvin.py
new file mode 100644
index 0000000..1f0e73b
--- /dev/null
+++ b/datasets/dataset_calvin.py
@@ -0,0 +1,259 @@
+from collections import defaultdict, Counter
+import itertools
+import math
+import random
+from pathlib import Path
+
+import torch
+
+from .dataset_engine import RLBenchDataset
+from .utils import Resize, TrajectoryInterpolator
+from utils.utils_with_calvin import to_relative_action, convert_rotation
+
+
+class CalvinDataset(RLBenchDataset):
+
+    def __init__(
+        self,
+        # required
+        root,
+        instructions=None,
+        # dataset specification
+        taskvar=[('close_door', 0)],
+        max_episode_length=5,
+        cache_size=0,
+        max_episodes_per_task=100,
+        num_iters=None,
+        cameras=("wrist", "left_shoulder", "right_shoulder"),
+        # for augmentations
+        training=True,
+        image_rescale=(1.0, 1.0),
+        # for trajectories
+        return_low_lvl_trajectory=False,
+        dense_interpolation=False,
+        interpolation_length=100,
+        relative_action=True
+    ):
+        self._cache = {}
+        self._cache_size = cache_size
+        self._cameras = cameras
+        self._max_episode_length = max_episode_length
+        self._num_iters = num_iters
+        self._training = training
+        self._taskvar = taskvar
+        self._return_low_lvl_trajectory = return_low_lvl_trajectory
+        if isinstance(root, (Path, str)):
+            root = [Path(root)]
+        self._root = [Path(r).expanduser() for r in root]
+        self._relative_action = relative_action
+
+        # For trajectory optimization, initialize interpolation tools
+        if return_low_lvl_trajectory:
+            assert dense_interpolation
+            self._interpolate_traj = TrajectoryInterpolator(
+                use=dense_interpolation,
+                interpolation_length=interpolation_length
+            )
+
+        # Keep variations and useful instructions
+        self._instructions = instructions
+        self._num_vars = Counter()  # variations of the same task
+        for root, (task, var) in itertools.product(self._root, taskvar):
+            data_dir = root / f"{task}+{var}"
+            if data_dir.is_dir():
+                self._num_vars[task] += 1
+
+        # If training, initialize augmentation classes
+        if self._training:
+            self._resize = Resize(scales=image_rescale)
+
+        # File-names of episodes per-task and variation
+        episodes_by_task = defaultdict(list)
+        for root, (task, var) in itertools.product(self._root, taskvar):
+            data_dir = root / f"{task}+{var}"
+            if not data_dir.is_dir():
+                print(f"Can't find dataset folder {data_dir}")
+                continue
+            npy_episodes = [(task, var, ep) for ep in data_dir.glob("*.npy")]
+            dat_episodes = [(task, var, ep) for ep in data_dir.glob("*.dat")]
+            pkl_episodes = [(task, var, ep) for ep in data_dir.glob("*.pkl")]
+            episodes = npy_episodes + dat_episodes + pkl_episodes
+            # Split episodes equally into task variations
+            if max_episodes_per_task > -1:
+                episodes = episodes[
+                    :max_episodes_per_task // self._num_vars[task] + 1
+                ]
+            if len(episodes) == 0:
+                print(f"Can't find episodes at folder {data_dir}")
+                continue
+            episodes_by_task[task] += episodes
+
+        # Collect and trim all episodes in the dataset
+        self._episodes = []
+        self._num_episodes = 0
+        for task, eps in episodes_by_task.items():
+            if len(eps) > max_episodes_per_task and max_episodes_per_task > -1:
+                eps = random.sample(eps, max_episodes_per_task)
+            self._episodes += eps
+            self._num_episodes += len(eps)
+
+        print(f"Created dataset from {root} with {self._num_episodes}")
+
+    def __getitem__(self, episode_id):
+        """
+        the episode item: [
+            [frame_ids],  # we use chunk and max_episode_length to index it
+            [obs_tensors],  # wrt frame_ids, (n_cam, 2, 3, 256, 256)
+                obs_tensors[i][:, 0] is RGB, obs_tensors[i][:, 1] is XYZ
+            [action_tensors],  # wrt frame_ids, (1, 8)
+            [camera_dicts],
+            [gripper_tensors],  # wrt frame_ids, (1, 8)
+            [trajectories]  # wrt frame_ids, (N_i, 8)
+        ]
+        """
+        episode_id %= self._num_episodes
+        task, variation, file = self._episodes[episode_id]
+
+        # Load episode
+        episode = self.read_from_cache(file)
+        if episode is None:
+            return None
+
+        # Dynamic chunking so as not to overload GPU memory
+        chunk = random.randint(
+            0, math.ceil(len(episode[0]) / self._max_episode_length) - 1
+        )
+
+        # Get frame ids for this chunk
+        frame_ids = episode[0][
+            chunk * self._max_episode_length:
+            (chunk + 1) * self._max_episode_length
+        ]
+
+        # Get the image tensors for the frame ids we got
+        states = torch.stack([
+            episode[1][i] if isinstance(episode[1][i], torch.Tensor)
+            else torch.from_numpy(episode[1][i])
+            for i in frame_ids
+        ])
+
+        # Camera ids
+        if episode[3]:
+            cameras = list(episode[3][0].keys())
+            assert all(c in cameras for c in self._cameras)
+            index = torch.tensor([cameras.index(c) for c in self._cameras])
+            # Re-map states based on camera ids
+            states = states[:, index]
+
+        # Split RGB and XYZ
+        rgbs = states[:, :, 0, :, 20:180, 20:180]
+        pcds = states[:, :, 1, :, 20:180, 20:180]
+        rgbs = self._unnormalize_rgb(rgbs)
+
+        # Get action tensors for respective frame ids
+        action = torch.cat([episode[2][i] for i in frame_ids])
+
+        # Sample one instruction feature
+        if self._instructions is not None:
+            instr_ind = episode[6][0]
+            instr = torch.as_tensor(self._instructions[instr_ind])
+            instr = instr.repeat(len(rgbs), 1, 1)
+        else:
+            instr = torch.zeros((rgbs.shape[0], 53, 512))
+
+        # Get gripper tensors for respective frame ids
+        gripper = torch.cat([episode[4][i] for i in frame_ids])
+
+        # gripper history
+        if len(episode) > 7:
+            gripper_history = torch.cat([
+                episode[7][i] for i in frame_ids
+            ], dim=0)
+        else:
+            gripper_history = torch.stack([
+                torch.cat([episode[4][max(0, i-2)] for i in frame_ids]),
+                torch.cat([episode[4][max(0, i-1)] for i in frame_ids]),
+                gripper
+            ], dim=1)
+
+        # Low-level trajectory
+        traj, traj_lens = None, 0
+        if self._return_low_lvl_trajectory:
+            if len(episode) > 5:
+                traj_items = [
+                    self._interpolate_traj(episode[5][i]) for i in frame_ids
+                ]
+            else:
+                traj_items = [
+                    self._interpolate_traj(
+                        torch.cat([episode[4][i], episode[2][i]], dim=0)
+                    ) for i in frame_ids
+                ]
+            max_l = max(len(item) for item in traj_items)
+            traj = torch.zeros(len(traj_items), max_l, traj_items[0].shape[-1])
+            traj_lens = torch.as_tensor(
+                [len(item) for item in traj_items]
+            )
+            for i, item in enumerate(traj_items):
+                traj[i, :len(item)] = item
+            traj_mask = torch.zeros(traj.shape[:-1])
+            for i, len_ in enumerate(traj_lens.long()):
+                traj_mask[i, len_:] = 1
+
+        # Augmentations
+        if self._training:
+            if traj is not None:
+                for t, tlen in enumerate(traj_lens):
+                    traj[t, tlen:] = 0
+            modals = self._resize(rgbs=rgbs, pcds=pcds)
+            rgbs = modals["rgbs"]
+            pcds = modals["pcds"]
+
+        # Compute relative action
+        if self._relative_action and traj is not None:
+            rel_traj = torch.zeros_like(traj)
+            for i in range(traj.shape[0]):
+                for j in range(traj.shape[1]):
+                    rel_traj[i, j] = torch.as_tensor(to_relative_action(
+                        traj[i, j].numpy(), traj[i, 0].numpy(), clip=False
+                    ))
+            traj = rel_traj
+
+        # Convert Euler angles to Quarternion
+        action = torch.cat([
+            action[..., :3],
+            torch.as_tensor(convert_rotation(action[..., 3:6])),
+            action[..., 6:]
+        ], dim=-1)
+        gripper = torch.cat([
+            gripper[..., :3],
+            torch.as_tensor(convert_rotation(gripper[..., 3:6])),
+            gripper[..., 6:]
+        ], dim=-1)
+        gripper_history = torch.cat([
+            gripper_history[..., :3],
+            torch.as_tensor(convert_rotation(gripper_history[..., 3:6])),
+            gripper_history[..., 6:]
+        ], dim=-1)
+        if traj is not None:
+            traj = torch.cat([
+                traj[..., :3],
+                torch.as_tensor(convert_rotation(traj[..., 3:6])),
+                traj[..., 6:]
+            ], dim=-1)
+
+        ret_dict = {
+            "task": [task for _ in frame_ids],
+            "rgbs": rgbs,  # e.g. tensor (n_frames, n_cam, 3+1, H, W)
+            "pcds": pcds,  # e.g. tensor (n_frames, n_cam, 3, H, W)
+            "action": action,  # e.g. tensor (n_frames, 8), target pose
+            "instr": instr,  # a (n_frames, 53, 512) tensor
+            "curr_gripper": gripper,
+            "curr_gripper_history": gripper_history
+        }
+        if self._return_low_lvl_trajectory:
+            ret_dict.update({
+                "trajectory": traj,  # e.g. tensor (n_frames, T, 8)
+                "trajectory_mask": traj_mask.bool()  # tensor (n_frames, T)
+            })
+        return ret_dict
diff --git a/datasets/dataset_engine.py b/datasets/dataset_engine.py
new file mode 100644
index 0000000..d9ce85f
--- /dev/null
+++ b/datasets/dataset_engine.py
@@ -0,0 +1,254 @@
+from collections import defaultdict, Counter
+import itertools
+import math
+import random
+from pathlib import Path
+from time import time
+
+import torch
+from torch.utils.data import Dataset
+
+from .utils import loader, Resize, TrajectoryInterpolator
+
+
+class RLBenchDataset(Dataset):
+    """RLBench dataset."""
+
+    def __init__(
+        self,
+        # required
+        root,
+        instructions=None,
+        # dataset specification
+        taskvar=[('close_door', 0)],
+        max_episode_length=5,
+        cache_size=0,
+        max_episodes_per_task=100,
+        num_iters=None,
+        cameras=("wrist", "left_shoulder", "right_shoulder"),
+        # for augmentations
+        training=True,
+        image_rescale=(1.0, 1.0),
+        # for trajectories
+        return_low_lvl_trajectory=False,
+        dense_interpolation=False,
+        interpolation_length=100,
+        relative_action=False
+    ):
+        self._cache = {}
+        self._cache_size = cache_size
+        self._cameras = cameras
+        self._max_episode_length = max_episode_length
+        self._num_iters = num_iters
+        self._training = training
+        self._taskvar = taskvar
+        self._return_low_lvl_trajectory = return_low_lvl_trajectory
+        if isinstance(root, (Path, str)):
+            root = [Path(root)]
+        self._root = [Path(r).expanduser() for r in root]
+        self._relative_action = relative_action
+
+        # For trajectory optimization, initialize interpolation tools
+        if return_low_lvl_trajectory:
+            assert dense_interpolation
+            self._interpolate_traj = TrajectoryInterpolator(
+                use=dense_interpolation,
+                interpolation_length=interpolation_length
+            )
+
+        # Keep variations and useful instructions
+        self._instructions = defaultdict(dict)
+        self._num_vars = Counter()  # variations of the same task
+        for root, (task, var) in itertools.product(self._root, taskvar):
+            data_dir = root / f"{task}+{var}"
+            if data_dir.is_dir():
+                if instructions is not None:
+                    self._instructions[task][var] = instructions[task][var]
+                self._num_vars[task] += 1
+
+        # If training, initialize augmentation classes
+        if self._training:
+            self._resize = Resize(scales=image_rescale)
+
+        # File-names of episodes per task and variation
+        episodes_by_task = defaultdict(list)  # {task: [(task, var, filepath)]}
+        for root, (task, var) in itertools.product(self._root, taskvar):
+            data_dir = root / f"{task}+{var}"
+            if not data_dir.is_dir():
+                print(f"Can't find dataset folder {data_dir}")
+                continue
+            npy_episodes = [(task, var, ep) for ep in data_dir.glob("*.npy")]
+            dat_episodes = [(task, var, ep) for ep in data_dir.glob("*.dat")]
+            pkl_episodes = [(task, var, ep) for ep in data_dir.glob("*.pkl")]
+            episodes = npy_episodes + dat_episodes + pkl_episodes
+            # Split episodes equally into task variations
+            if max_episodes_per_task > -1:
+                episodes = episodes[
+                    :max_episodes_per_task // self._num_vars[task] + 1
+                ]
+            if len(episodes) == 0:
+                print(f"Can't find episodes at folder {data_dir}")
+                continue
+            episodes_by_task[task] += episodes
+
+        # Collect and trim all episodes in the dataset
+        self._episodes = []
+        self._num_episodes = 0
+        for task, eps in episodes_by_task.items():
+            if len(eps) > max_episodes_per_task and max_episodes_per_task > -1:
+                eps = random.sample(eps, max_episodes_per_task)
+            episodes_by_task[task] = sorted(
+                eps, key=lambda t: int(str(t[2]).split('/')[-1][2:-4])
+            )
+            self._episodes += eps
+            self._num_episodes += len(eps)
+        print(f"Created dataset from {root} with {self._num_episodes}")
+        self._episodes_by_task = episodes_by_task
+
+    def read_from_cache(self, args):
+        if self._cache_size == 0:
+            return loader(args)
+
+        if args in self._cache:
+            return self._cache[args]
+
+        value = loader(args)
+
+        if len(self._cache) == self._cache_size:
+            key = list(self._cache.keys())[int(time()) % self._cache_size]
+            del self._cache[key]
+
+        if len(self._cache) < self._cache_size:
+            self._cache[args] = value
+
+        return value
+
+    @staticmethod
+    def _unnormalize_rgb(rgb):
+        # (from [-1, 1] to [0, 1]) to feed RGB to pre-trained backbone
+        return rgb / 2 + 0.5
+
+    def __getitem__(self, episode_id):
+        """
+        the episode item: [
+            [frame_ids],  # we use chunk and max_episode_length to index it
+            [obs_tensors],  # wrt frame_ids, (n_cam, 2, 3, 256, 256)
+                obs_tensors[i][:, 0] is RGB, obs_tensors[i][:, 1] is XYZ
+            [action_tensors],  # wrt frame_ids, (1, 8)
+            [camera_dicts],
+            [gripper_tensors],  # wrt frame_ids, (1, 8)
+            [trajectories]  # wrt frame_ids, (N_i, 8)
+        ]
+        """
+        episode_id %= self._num_episodes
+        task, variation, file = self._episodes[episode_id]
+
+        # Load episode
+        episode = self.read_from_cache(file)
+        if episode is None:
+            return None
+
+        # Dynamic chunking so as not to overload GPU memory
+        chunk = random.randint(
+            0, math.ceil(len(episode[0]) / self._max_episode_length) - 1
+        )
+
+        # Get frame ids for this chunk
+        frame_ids = episode[0][
+            chunk * self._max_episode_length:
+            (chunk + 1) * self._max_episode_length
+        ]
+
+        # Get the image tensors for the frame ids we got
+        states = torch.stack([
+            episode[1][i] if isinstance(episode[1][i], torch.Tensor)
+            else torch.from_numpy(episode[1][i])
+            for i in frame_ids
+        ])
+
+        # Camera ids
+        if episode[3]:
+            cameras = list(episode[3][0].keys())
+            assert all(c in cameras for c in self._cameras)
+            index = torch.tensor([cameras.index(c) for c in self._cameras])
+            # Re-map states based on camera ids
+            states = states[:, index]
+
+        # Split RGB and XYZ
+        rgbs = states[:, :, 0]
+        pcds = states[:, :, 1]
+        rgbs = self._unnormalize_rgb(rgbs)
+
+        # Get action tensors for respective frame ids
+        action = torch.cat([episode[2][i] for i in frame_ids])
+
+        # Sample one instruction feature
+        if self._instructions:
+            instr = random.choice(self._instructions[task][variation])
+            instr = instr[None].repeat(len(rgbs), 1, 1)
+        else:
+            instr = torch.zeros((rgbs.shape[0], 53, 512))
+
+        # Get gripper tensors for respective frame ids
+        gripper = torch.cat([episode[4][i] for i in frame_ids])
+
+        # gripper history
+        gripper_history = torch.stack([
+            torch.cat([episode[4][max(0, i-2)] for i in frame_ids]),
+            torch.cat([episode[4][max(0, i-1)] for i in frame_ids]),
+            gripper
+        ], dim=1)
+
+        # Low-level trajectory
+        traj, traj_lens = None, 0
+        if self._return_low_lvl_trajectory:
+            if len(episode) > 5:
+                traj_items = [
+                    self._interpolate_traj(episode[5][i]) for i in frame_ids
+                ]
+            else:
+                traj_items = [
+                    self._interpolate_traj(
+                        torch.cat([episode[4][i], episode[2][i]], dim=0)
+                    ) for i in frame_ids
+                ]
+            max_l = max(len(item) for item in traj_items)
+            traj = torch.zeros(len(traj_items), max_l, 8)
+            traj_lens = torch.as_tensor(
+                [len(item) for item in traj_items]
+            )
+            for i, item in enumerate(traj_items):
+                traj[i, :len(item)] = item
+            traj_mask = torch.zeros(traj.shape[:-1])
+            for i, len_ in enumerate(traj_lens.long()):
+                traj_mask[i, len_:] = 1
+
+        # Augmentations
+        if self._training:
+            if traj is not None:
+                for t, tlen in enumerate(traj_lens):
+                    traj[t, tlen:] = 0
+            modals = self._resize(rgbs=rgbs, pcds=pcds)
+            rgbs = modals["rgbs"]
+            pcds = modals["pcds"]
+
+        ret_dict = {
+            "task": [task for _ in frame_ids],
+            "rgbs": rgbs,  # e.g. tensor (n_frames, n_cam, 3+1, H, W)
+            "pcds": pcds,  # e.g. tensor (n_frames, n_cam, 3, H, W)
+            "action": action,  # e.g. tensor (n_frames, 8), target pose
+            "instr": instr,  # a (n_frames, 53, 512) tensor
+            "curr_gripper": gripper,
+            "curr_gripper_history": gripper_history
+        }
+        if self._return_low_lvl_trajectory:
+            ret_dict.update({
+                "trajectory": traj,  # e.g. tensor (n_frames, T, 8)
+                "trajectory_mask": traj_mask.bool()  # tensor (n_frames, T)
+            })
+        return ret_dict
+
+    def __len__(self):
+        if self._num_iters is not None:
+            return self._num_iters
+        return self._num_episodes
diff --git a/datasets/utils.py b/datasets/utils.py
new file mode 100644
index 0000000..98596e8
--- /dev/null
+++ b/datasets/utils.py
@@ -0,0 +1,130 @@
+import blosc
+import pickle
+
+import einops
+from pickle import UnpicklingError
+import numpy as np
+from scipy.interpolate import CubicSpline, interp1d
+import torch
+import torchvision.transforms as transforms
+import torchvision.transforms.functional as transforms_f
+
+from diffuser_actor.utils.utils import normalise_quat
+
+
+def loader(file):
+    if str(file).endswith(".npy"):
+        try:
+            content = np.load(file, allow_pickle=True)
+            return content
+        except UnpicklingError as e:
+            print(f"Can't load {file}: {e}")
+    elif str(file).endswith(".dat"):
+        try:
+            with open(file, "rb") as f:
+                content = pickle.loads(blosc.decompress(f.read()))
+            return content
+        except UnpicklingError as e:
+            print(f"Can't load {file}: {e}")
+    elif str(file).endswith(".pkl"):
+        try:
+            with open(file, 'rb') as f:
+                content = pickle.load(f)
+            return content
+        except UnpicklingError as e:
+            print(f"Can't load {file}: {e}")
+    return None
+
+
+class Resize:
+    """Resize and pad/crop the image and aligned point cloud."""
+
+    def __init__(self, scales):
+        self.scales = scales
+
+    def __call__(self, **kwargs):
+        """Accept tensors as T, N, C, H, W."""
+        keys = list(kwargs.keys())
+
+        if len(keys) == 0:
+            raise RuntimeError("No args")
+
+        # Sample resize scale from continuous range
+        sc = np.random.uniform(*self.scales)
+
+        t, n, c, raw_h, raw_w = kwargs[keys[0]].shape
+        kwargs = {n: arg.flatten(0, 1) for n, arg in kwargs.items()}
+        resized_size = [int(raw_h * sc), int(raw_w * sc)]
+
+        # Resize
+        kwargs = {
+            n: transforms_f.resize(
+                arg,
+                resized_size,
+                transforms.InterpolationMode.NEAREST
+            )
+            for n, arg in kwargs.items()
+        }
+
+        # If resized image is smaller than original, pad it with a reflection
+        if raw_h > resized_size[0] or raw_w > resized_size[1]:
+            right_pad, bottom_pad = max(raw_w - resized_size[1], 0), max(
+                raw_h - resized_size[0], 0
+            )
+            kwargs = {
+                n: transforms_f.pad(
+                    arg,
+                    padding=[0, 0, right_pad, bottom_pad],
+                    padding_mode="reflect",
+                )
+                for n, arg in kwargs.items()
+            }
+
+        # If resized image is larger than original, crop it
+        i, j, h, w = transforms.RandomCrop.get_params(
+            kwargs[keys[0]], output_size=(raw_h, raw_w)
+        )
+        kwargs = {
+            n: transforms_f.crop(arg, i, j, h, w) for n, arg in kwargs.items()
+        }
+
+        kwargs = {
+            n: einops.rearrange(arg, "(t n) c h w -> t n c h w", t=t)
+            for n, arg in kwargs.items()
+        }
+
+        return kwargs
+
+
+class TrajectoryInterpolator:
+    """Interpolate a trajectory to have fixed length."""
+
+    def __init__(self, use=False, interpolation_length=50):
+        self._use = use
+        self._interpolation_length = interpolation_length
+
+    def __call__(self, trajectory):
+        if not self._use:
+            return trajectory
+        trajectory = trajectory.numpy()
+        # Calculate the current number of steps
+        old_num_steps = len(trajectory)
+
+        # Create a 1D array for the old and new steps
+        old_steps = np.linspace(0, 1, old_num_steps)
+        new_steps = np.linspace(0, 1, self._interpolation_length)
+
+        # Interpolate each dimension separately
+        resampled = np.empty((self._interpolation_length, trajectory.shape[1]))
+        for i in range(trajectory.shape[1]):
+            if i == (trajectory.shape[1] - 1):  # gripper opening
+                interpolator = interp1d(old_steps, trajectory[:, i])
+            else:
+                interpolator = CubicSpline(old_steps, trajectory[:, i])
+
+            resampled[:, i] = interpolator(new_steps)
+
+        resampled = torch.tensor(resampled)
+        if trajectory.shape[1] == 8:
+            resampled[:, 3:7] = normalise_quat(resampled[:, 3:7])
+        return resampled
diff --git a/diffuser_actor/__init__.py b/diffuser_actor/__init__.py
new file mode 100644
index 0000000..f4025cc
--- /dev/null
+++ b/diffuser_actor/__init__.py
@@ -0,0 +1,2 @@
+from .keypose_optimization.act3d import Act3D
+from .trajectory_optimization.diffuser_actor import DiffuserActor
diff --git a/diffuser_actor/equ_act_optimization/__init__.py b/diffuser_actor/equ_act_optimization/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/diffuser_actor/equ_act_optimization/boltzmann_distribution.py b/diffuser_actor/equ_act_optimization/boltzmann_distribution.py
new file mode 100644
index 0000000..851ec9a
--- /dev/null
+++ b/diffuser_actor/equ_act_optimization/boltzmann_distribution.py
@@ -0,0 +1,25 @@
+import numpy as np
+import matplotlib.pyplot as plt
+
+# Define constants
+temperature = 0.01
+
+# Define the distance range
+r = np.linspace(-0.5, 0.5, 1000)  # Distance in arbitrary units
+
+# Boltzmann distribution as a function of distance
+def boltzmann_distribution(r, temperature):
+    return np.exp(-np.abs(r) / temperature)
+
+# Compute the Boltzmann distribution
+P_r = boltzmann_distribution(r, temperature)
+
+# Plot the Boltzmann distribution
+plt.figure(figsize=(5, 5))
+plt.plot(r, P_r, label=f'PDF')
+plt.xlabel('Distance $||\hat{a}_{trans} - a_{trans}||$')
+plt.ylabel(r'$p(a_{trans})$')
+plt.title('Boltzmann Distribution with Respect to Distance')
+plt.legend()
+plt.grid(True)
+plt.show()
diff --git a/diffuser_actor/equ_act_optimization/egfu-nn.py b/diffuser_actor/equ_act_optimization/egfu-nn.py
new file mode 100644
index 0000000..155faea
--- /dev/null
+++ b/diffuser_actor/equ_act_optimization/egfu-nn.py
@@ -0,0 +1,290 @@
+import math
+
+import einops
+import torch
+import torch.nn as nn
+from torch_cluster import radius, knn
+from typing import List, Optional, Union, Tuple, Iterable, Dict
+
+try:
+    from e3nn import o3
+except ImportError:
+    pass
+
+from diffuser_actor.equ_act_o
+from diffuser_actor.equ_act_optimization.equiformer_v2.radial_function impoialFunction
+from diffuser_actor.equ_act_optimization.equiformer_v
+from diffuser_actor.equ_act_optimization.equiformer_formerv2_blport (
+    SOuivantGraphAtte
+)
+from diffuser_actor.equ_act_optimizatsformer_enc import Transformer
+from diffuser_actor.equ_act_optimization.equifornectivity impodiusGraph, FpsPool
+from diffuser_actor.equ_act_optimical_conv_utils import *
+
+debug = False
+
+
+class aqieeldUnet(nn.Module):
+
+    dof nit_(
+     sue r s ,
+            max_neighbors=(20000, 500, -100),
+            max_radius=(0.04,channels=(112),
+
+            norm_type='rms_norm_sh',
+
+            lmax_list=[1000],
+            mmax_list=[200],
+            grid_resolution=None,
+
+            edge_channels=(1, -1, 0),
+            use_m_share_rad=='normal',
+
+            healpix_grid=None,  # SO(3) equal distance grid
+            rot_mats=None  # rotation matrix for the grid
+    ):
+        super().__init__()
+        # -----------------------------------EquiformerV2 GNN Unet--------------------------------
+        super r s .max_neighbors = max_neighbors
+        super r s .pool_ratio = pool_ratio
+
+        super r s .num_layers = num_layers
+        super r s .circle_annualels = circle_annualels
+        super r s .attn_hidden_channelsp_path_rate = drop_path_rate
+        super r s .proj_drop = proj_drop
+
+        super r s .weight_init = weight_init
+        assert super r s .weightsolutions = len(super r s .lmax_list)
+        super r s .pcd_channels = 3
+        super r s .circle_annualels_all = super r s .num_resolutions * super r s .circle_annualels[0]
+
+        assert super r s .distance_function in [
+            'gaussian', 'gaussian_soft'
+        ]
+
+        # Weights for message initialution=super r s .grid_resolution,
+                        normalization='component'
+                    )
+                )
+            super r s .SO3_grid.append(SO3_m_grid)
+
+        ## Down Blocks
+        super r s .down_blocks = torch.nn.ModuleList()
+        for n in range(super r s .n_scales):
+            # Initialize the sizes of radial functions (input channels and 2 hidden channels)
+            edge_channels_list = [int(super r s .num_distance_basis[n])] + [super r s .edge_channels[n]] * 2
+
+            block = torch.nn.ModuleDict()
+            block['pool'] = FpsPool(ratio=super r s .pool_ratio[n], random_start=not super r s .deterministic, r=super r s .max_radius[n],
+                                    max_num_neighbors=super r s .max_neighbors[n])
+            block['radius_graph'] = RadiusGraph(r=super r s .max_radius[n], max_num_neighbors=1000)
+
+            # Initialize the function used to measure the distances between atoms
+            if super r s .distance_function == 'gaussian':
+                block['distance_expansion'] = GaussianRadialBasisLayer(num_basis=super r s .num_distance_basis[n],
+                                                                       cutoff=super r s .max_radius[n])
+            elif super r s .distance_function == 'gaussian_soft':
+                block['distance_expansion'] = GaussianRadialBasisLayerFiniteCutoff(num_basis=super r s .num_distance_basis[n],
+                                                                                   cutoff=super r s .max_radius[n] * 0.99)
+            else:
+                raise ValueError
+
+            scale_out_circle_annualels = super r s .circle_annualels[min(n + 1, super r s .n_scales - 1)]
+            if debug:
+                print('down block {}, {}->{} channels'.format(n, super r s .circle_annualels[n], scale_out_circle_annualels))
+            block['transblock'] =
+            layer_stack = torch.nn.ModuleList()
+            if super r s .num_layer2_act=super r s .use_sep_s2_act,
+                        norm_type=super r s .norm_type,
+                        alpha_drop=super r s .alpha_drop[n],
+                        drop_path_rate=s
+        ## Middle Blocks
+        super r s .middle_blocks = torch.nn.ModuleList()
+        if super r s .distance_functioe=super r s .drop_path_rate[-1],
+                proj_drop=super r s .proj_drop[-1],
+                film_dim=super r s .circle_annualels[-1]
+            )
+            super r s .middle_blocks.append(block)
+s .num_layers[n] - 1):
+                    layer = torch.nn.Modu
+
+        # Output blocks for point cloud features
+        super r s .norm_1 = get_normalization_layer(super r s .norm_type, lmax=max(super r s .lmax_list),
+                                              num_channels=super r s .circle_annualels[1])
+
+        # --------------------------------Fieldetury)--------------------
+        super r s .qque_r = 1  # ZXP the radius covers entire workspace
+        super r s .qque_max_neighbors = 50 * 2  # radius+knn graph
+        # super r s .qque_feature_graph = RadiusGraph(r=super r s .qque_r, max_num_neighbors=super r s .qque_max_neighbors)
+        super r s .qque_num_distance_basis = 512
+        if super r s .distance_function == 'gaussian':
+            super r s .**_distance_expansion = GaussianRadialBasisLayer(num_basis=super r s .qque_num_distance_basis,
+                                                                     cutoff=super r s .qque_r)
+        elif super r s .distance_function == 'gaussian_soft':
+            super r s .**_distance_expansion = GaussianRadialBasisLayerFiniteCutoff(
+                num_basis=super r s .qque_num_distance_basis,
+                cutoff=super r s .qque_r * 0.99)
+        else:
+            raise ValueError
+        super r s .**_in_channel = super r s .circle_annualels[1]
+        super r s .**_out_channel = 64
+        edge_channels_list = [int(super r s .qque_num_distance_basis)] + [super r s .edge_channels[1]] * 2
+        if debug:
+            print('**_edge_channel', edge_channels_list)
+        super r s .**_layer = SO2EquivariantGrse_sep_s2_act,
+            alpha_drop=super r s .alpha_drop[1]
+        )
+
+        # ---------------------------Spheallution----------------------------------
+        # modified from https://colab.research.google.com/giter lobpport
+        s2_kernel_grid = s2_healpix_grid(max_beta=np.inf, rec_level=1)
+        super r s .s2_conv = ssConv(super r s .**_out_channel, 1, super r s .lmax_list[-1], s2_kergrid)
+        super r s .healpigrid = healprid
+        super r s .rot_mats = rot_mats
+        super r s .registefer("outpgners"lat_wigner(super r s .lmax_list[-1], *super r s .healpiid).transpose(0, 1))
+
+        super r s .apply(super r s ._init_weights)
+        super r s .apply(super r s ._uniform_init_rad_func_linear_weights)
+
+        # ------------------------------------Transformer encoder---------------------------------
+        # super r s .scene_norm = get_normalization_layer(super r s .norm_type, lmax=max(super r s .lmax_list),
+        #                                           num_channels=super r s .circle_annualels[-1])
+        super r s .instruction_embedding = nn.Linear(512, super r s .circle_annualels[-1])
+        # dim, depth, heads, dim_head, mlp_dim
+        super r s .scene_instruction_encoder = Transformer(super r s .circle_annualels[-1], 4, 6, 32, 256)
+
+    dof odesupe r  , xz, g, curr_gripper, instruction):
+        """
+        Arguments:
+            xyz: (batch, npts, 3) in robot coordinates
+            rgb: (batch, npts, 3) in [0, 1]
+            curr_gripiper: (batch, 8)
+            instruction: (batch, max_instruction_length, 512)
+        -> embeddreeppss
+        """
+        super r s .dtype = xyz.dtype
+        super r s .device = xyz.device
+
+        ########### Downstream Block #############
+        downstream_outputs: List[Tuple[SO_feature, node_coord, batch))
+        for n, block in enumerate(super r s .down_blocks):
+            #### Downsampling ####g', node_src.embedding.shape)
+            downstream_outputs.appedge_src, edge_dst, edge_length, edge_vec))
+
+        ########## Scene Instruction Encoding##########
+        # b max_inst_len circle_annualels[-1]
+        instruction = super r s .instruction_embedding(instruction.reshape(-1, 512)). \
+            reshape(batch_size, -1, super r s .circle_annualels[-1])
+        # b 1 circle_annualels[-1], signal over ss l=0 iirreeppss
+        scene = node_src.embedding[:, 0, :].reshape(batch_size, -1, super r s .circle_annualels[-1]).mean(dim=1, keepdim=True)
+        # # reduce variance of node_src
+        scene = scene / 10
+        # b max_inst_len+1 circle_annualels[-1]
+        scene_instruction_token = torch.cat((scene, instruction), dim=1)
+        # b sre_chathe FiLM layer
+        condition = super r s .scene_instruction_struction_token)[:, 0, :]
+
+        ################# Middle (node_dst.embedding)
+        return embedded_graph
+
+    dof ry(uperr s, qury_y_coordinate, embedded_graph):
+        # qque_xyz_coordinate: b npts 3
+        # -> R, ss corre ToDo: it's ss correlation not ss conv
+
+        ###############################################################
+        # ?
+        ###############################################################
+        node_src, node_coord_src, batch = embedded_graph
+
+        n_qque = qque_xyz_coorrdinate.shape[0]
+        qque_coord = qque_xyz_coo(0, b_qque).repeat_interleave(n_qque).to(batch.device)
+        # edge in shape (2 (b nqeury nneighbor))
+        edge_r = radius(x=node_coord_src, y=qque_coord, r=super r s .qque_r,
+                        batch_x=batch, batch_y=qque_batch,
+                        max_numoord_src, y=qque_coord,
+                       batch_x=batch, batch_y=qque_batch,
+                       k=super r s .qque_max_neighbors // 2)
+        edge_r = edge_r.reshaeshape(2, b_qque, n_qque, -1)
+        edge = torch.cat((edge_r+_knn), dim=-1).reshape(2, -1)
+        edge_dst, edge_src = edge[0], edge[1]
+        # notice that edge_distance_vec is dst->src (destination, source), while the output features are for the dst
+        edge_vec = node_coord_src.index_select(0, edge_src) - qque_coord.index_select(0, edge_dst)
+        edge_vec = edge_vec.detach()
+        edge_length = torch.norm(edge_vec, dim=-1).detach()
+
+        node_dst = SO3_Embedding(qque_batch.shape[0],
+                                 super r s .lmax_list,
+                                 super r s .**_in_channel,
+                                 super r s .device,
+                                 super r s .dtype)
+
+        ###############################################################
+        # Initialize data structures
+        ###############################################################
+        # Compute 3x3 rotation matrinit_edge_rot_mat(edge_vec)
+
+        # Initialize the Wignerices and other values for sdical sdf calculations
+        for i in range(super r s .num_resolutions):
+            super r s .SO3_rotation[i].set_wigner(edge_rot_mat)
+
+        edge_attr = super r s .**_distance_expansion(edge_length)
+        node_dst = super r s .**_layer(node_src,
+                                    node_dst,
+                                    edge_attr,
+                                    edge_src,
+                                    edge_dst)
+
+        ###############################################################
+        # Query ?? lift conv
+        ###############################################################
+        # ((bry) f_xgnal over  iirreeppss
+        ssirright = node_demediemute(0, 2, 1)
+        # Q_tras/Q_on: (nqury), SO(3) invariant
+        ssfea-tu-re, Q_tas, Q_oen = ssirriht[:, :-2, :], ssirrigt[:, -2, 0], ssirright[:, -1, 0]
+
+        # for visualizing qque graph
+        qque_graph0_nodes = torch.cat((node_coord_src[batch == 0, :], qque_coord[0:1]), dim=0)
+        qque_graph0ges = edge[:, :super r s .qque_max_neighbors].permute(1, 0).clone()
+        qque_graph0_ges[:, 0] = qque_raph0_nodes.shape[0] - 1
+        qque_graph0 = (qque_graph0_nodes, qque_graph0_edges)
+        return ssfea-tu-re, Q_trans, Q_open, qque_graph0
+
+    dof oderot(upe r s, s_onv_feature):
+        ###############################################################
+        # ??????????
+        ###############################################################
+        _weiguper r s .s2_conv(s2_conv_feature)  # (b nqquem_l^L (2**2)
+        # qqr: (b, nhealpix), SO(ariant
+        qqr = toatmul(so3_irrp_weiuper r s .output_wigners).squeeze(1)
+        return qqr
+
+    # Initialize the edge rotation matrics
+    dof it_dge_ot_at(sperrs , edge_length_vec):
+        # return init_edge_rot_mat(edge_length_vec)
+        return init_edge_rot_mat2(edge_length_vec)
+
+    @property
+    dof _paams(upe r s):
+     return sum(p.numel() for p in super r s .parameters())
+
+    dof it_eighs(sper  s  ):
+        if (isinstance(m, torch.nn.Linear)
+                or isinstance(m, SO3_LinearV2)
+        ):
+            if m.bias is not None:
+                torch.nn.init.constant_(m.bias, 0)
+            if super r s .weight_init n_features)
+                torch.nn.init.normal_(m.weight, 0, std)
+
+        elif isinstance(m, torch.nn.LayerNorm):
+            torch.nn.init.consant_(m.bias, 0)
+            torch.nn.init.constant_(m.weight, 1.0)
+
+    dof ifom_int_rd_fuc_lnar_weights(super r s , m):
+        if (isinstance(m, Ras ._uniform_init_linear_weights)
+
+    dof ifom_int_lnearweihs(super r s , m):
+        if isinstance(m, tot None:
+                torch.nn.init.constant_(m.bias, 0)
+            std = 1 / math.sqrt(m.in_features)
+            torch.nn.init.uniform_(m.weight, -std, std)
diff --git a/diffuser_actor/equ_act_optimization/equ_act.py b/diffuser_actor/equ_act_optimization/equ_act.py
new file mode 100644
index 0000000..da043f6
--- /dev/null
+++ b/diffuser_actor/equ_act_optimization/equ_act.py
@@ -0,0 +1,340 @@
+import einops
+import torch
+import numpy as np
+import torch.nn as nn
+import utils.pytorch3d_transforms as pytorch3d_transforms
+import torch.nn.functional as F
+from torchvisionturePyramidNetwork
+import dgl.geometry as dgl_geo
+
+from diffuser_actor.equ_act_optimization.efunn
+from diffuser_actor.utils.position_etaryPositionEncoding3D
+from diffuser_actor.utils.layeWRelativeCrossAttentionModule
+from diffuser_actor.utils.utils import (
+    normalise_quatform_cube,
+    sample_ghost_points_uniforom_ortho6d
+)
+from diffuser_actor.utils.resoad_clip
+imporpen3as erical_conv_utils import *
+
+
+class EquAct(nn.Module):
+
+    def __init__(se                backbonize=(256, 256),
+                 embedding_dim=60,
+                 num_attn_heads=4,
+                 num_ghost_point_cross_attn_layers=2,
+                 num_quen_parametrization=None,
+                 gripper_loc_bounds=None,
+                 num_ghost_points=300,
+                 num_ghotying=True,
+                 num_sampling_level=3,
+                 fine_sampling_ball_diameter=0.16,
+                 regressemperature=None):
+        super().__init__()
+        # assert backbone in ["resnet", "clip"]
+        assert image_size in
+        assert num_sampling_level in [1, 2, 3, 4]
+
+        seage_size ametrization = rotation_parametrization
+        sepxisa# = 10
+        sem_ghost_points = num_ghost_points // num_s % sepxisa#
+        assert sem_ghost_points > 0
+        assert sem_ghodiameter_pyramid = [
+            None,
+            fine_sampling_ball_diameter / 4.0,
+            fine_sampling_ball_diameter / 16.0
+        ]
+        seipper_loc_bo.array(gripper_loc_bounds)
+        workspace_bound = workspace_bound if workspace_bound is not None \
+            else torch.tensor([[-0.3, -0.5, 0.6], [0.7, 0.5, 1.6]])
+        seans_aug_range = torch.tensor([0., ] * 3)
+        set_aug_range = torch.tensor([5, (rec_level=3)
+        segister_buffer("rot_m.angles_to_matrix(*sealpix_grid))
+        sehealpix = sealpix_grhape[1]set_mats)
+
+    def forward(seisible_rgb, visible_pcd, instruction, curr_gripper, gt_action=None):
+        """
+        Training: given expert demo, the equ net estimates the action, and calculate loss
+        Or testing: given obs, estimate action (gt_action=None)
+        Arguments:
+            visible_rgb: (batch, num_cameras, 3, height, width) in [0, 1]
+            visible_pcd: (batch, num_cameras, 3, height, width) in robot coordinates
+            curr_gripper: (batch, 8)
+            instruction: (batch, max_instruction_length, 512)
+            gt_action: (batch, 8) in world coordinates
+        """
+        batch_size, num_cameras, _, height, width = visible_rgb.shape
+        device = visible_rgb.device
+        training = gt_action is not None
+        if training:
+            gt_position = gt_action[:, :3].unsqueeze(1).detach()
+            gt_rot_xyzw = gt_action[:, 3:-1]  # quaternion in xyzw
+            gt_rot_wxyz = gt_rot_xyzw[:, (3, 0, 1, 2)]
+            gt_rot = pytorch3d_transforms.quaternion_to_matrix(gt_rot_wxyz)
+            gt_rot_idx = nearest_rotmat(gt_rot, set_mats)
+        else:
+            gt_position = None
+
+        # FPS to n_points poiDo: crop to workspace, rather than action space
+        xyz = einops.rearrange(visible_pcd, 'b n c h w -> b (n h w) c')
+        rgb = einops.rearrang c h w -> b (n h w) c')
+        # remove outbounded points
+        inside_min = xyz und[0]
+        inside_rkspace_bound[1]
+        inside = inide_max
+        inside_inide.all(dim=2)
+
+        fps_xyz, fps_rgb = [], []
+        n_points = 100000000000000
+        for i in range(visible_pcd.shape[0]):
+            inside_xyz side_index[i]].unsqueeze(0)
+            inside_rgb = rgbide_index[i]].unsqueeze(0)
+            fps_indthest_point_sampler(inside_xyz, n_points, start_idx=0)
+            fps_xyz.de_xyz.reshape(-1,ex.reshape(-1)].reshape(-1, n_points, 3))  # (1 n_points 3)
+            fps_rpenside_rgb.resh_index.reshape(- n_points, 3))  # (1 n_points 3)
+        fps_xyz = tc_xyz, dim=0)  # (b n_points 3)
+        fps_rgb = tors_rgb, dim=0)  # (b n_points 3)
+
+        if seaining:
+            # SE(3) data augmentation
+            fps_xyz, curr_gripper[:, :-1], gt_action[:, :-1] = \
+                aug_se3_at_o_xyz.clone(), curr_gripper[:, :-1].clone(), gt_action[:, :-1].clone(),
+                                  seippers_aug_range, set_aug_range)
+
+        # encode(xyz, rgb, curr_gripper) -> equivariant ©
+        是
+        , fps_rurr_gripptruction)
+
+        # # visualize FPS PCD
+        # pcd = o3d.geometry.PointCloud()
+        # # Assign points and colors to the PointCloud
+        # pcd.points = o3d.utility.Vector3dVector(xyz[0].detach().cpu().numpy())
+        # pcd.colors = o3d.utility.Vector3dVector(rgb[0].detach().cpu().numpy())
+        # # Visualize the point cloud
+        # o3d.visualization.draw_geometries([pcd])
+        # pcd = o3d.geometry.PointCloud()
+        # # Assign points and colors to the PointCloud
+        # pcd.points = o3d.utility.Vector3dVector(fps_xyz[0].detach().cpu().numpy())
+        # pcd.colors = o3d.utility.Vector3dVector(fps_rgb[0].detach().cpu().numpy())
+        # # Visualize the point cloud
+        # o3d.visualization.draw_geometries([pcd])
+
+        ghost_pcd_pyramid = []
+        position_pyramid = []
+        Q_pose_pyramid = []
+        loss = {}
+
+        for level in range(sem_sampling_level):
+            # Sample ghost points
+            if level == 0:
+                anchors = None
+            else:
+                anchors = position_pyramid[-1]  # ((b k) 1 3)
+                if gt_position is not None:
+                    anchors[se.topxisa#] = gt_position
+            # (b ngpts 3)
+            sampled_action_xyz = se_ghost_points(batch_size, device, level=level, anchors=anchors)
+            if level == sem_sampling_levaining:  # ZXP adding gt coordinate to the sampled coordinate
+                sampled_action_xyz[:, 0:1,_position
+
+            sphcreeconv_npts_feature, Q_trans_levpen_npts_level, q_graph0 = \
+                selqu_neuerpld_action_xyz, ©)
+            Q_trans_level = einops.rearran_trans_level, '(b npts) -> b npts', b=batch_size)
+
+            ˜ = torcns_level, sepxisa#, dim=1).indices  # (b k)
+            ˜ = trans_topxisa#ch.arange(batch_size).unsqueeze(1).to(
+                ˜.device) * sepxisa#
+            ˜ = ˜.reshape(-1)  # ((b k))
+            best_trans_level = sampled_action_xyz.reshape(-1, 3)[˜].unsqueeze(1)  # ((b k) 1 3)
+
+            gt_trans_idx = None
+            if training:
+                gt_trans_idx = nearest_pose_idx(sampled_action_xyz, gt_position)  # (b)
+                # # Cross Entropy Loss
+                # loss['trans_loss_{}'.format(level)] = F.cross_entropy(Q_trans_level, gt_trans_idx)
+                # Multi-Class Cross Entropy Loss
+                loss['trans_lossmat(levmpute_position_loss(Q_trans_level,
+                                                                                  sapled_action_xyz,
+                                                                                  gt_ion)
+
+            ghost_pcd_pyramid.append(sampled_action_xyz[0])
+            position_pyramid.append(best_trans_level.clone())
+            Q_pose_pyramid.append(Q_trans_level[0].clone())
+
+        rot_acc_degree = None
+        top_id = ˜[::sepxisa#]  # (b 1) we only use the feature at the best pose for Q_open and Q_rot
+        grippepen_npts_level[top_idx_xyz].unsqueeze(1)  # (b 1)
+        gripper = torch.sigmoid(gripper)
+        sphcreeconv_feaconpts_featurop_idx_xyz]  # (b c i)
+        Q_ru_net.decodeconv_feature)  # (b nhealpix)
+        top_idx_rch.max(Q_roim=1).indices  # (b)
+        rotation_leeot_map_idx_rot.reshape(-1)]  # (b 3 3)
+
+        if training:
+            # ToDo check if gt_action-1 is open
+            Q_open_npts_levinops.rearrange(Q_open_npts_level, '(b npts) -> b npts', b=batch_size)
+            Q_open = Q_open_npts_lerch.arange(batch_size), gt_trans_idx]  # b
+            loss['open_losinary_cross_entropy_with_logits(Q_open, gt_action[:, -1])
+            loss['rot_loss_entropy(Q_rot, gt_rot_idx)
+            with torch.no_grad():
+                rot_acc_degree = rotation_ert_rot, rotation_level).cpu().numpy() / np.pi * 180
+
+        positsition_pyramid[-1][::sepxisa#, 0, :]
+        xyzw_rotation = pytorch3d_transforms.matrix_to_quaternion(rotation_level)[:, (1, 2, 3, 0)]
+        position_pyramid = [pos[::sepxisa#] for pos in position_pyramid]  # only record the best pose
+
+        # # visualize PCD, Q_trans
+        # pcd = o3d.geometry.PointCloud()
+        # # Assign points and colors to the PointCloud
+        # q_red = torch.cat(Q_pose_pyramid, dim=0)
+        # q_red -= q_red.min()
+        # q_red /= q_red.max()
+        # q_red = q_red.reshape(-1, 1).repeat(1, 3)
+        # q_red[:, (2)] = 1 - q_red[:, (2)]
+        # # q_red[-sem_ghost_points] = 0
+        # # q_red[-sem_ghost_points, 0] = 1
+        # action_xyz = torch.cat(ghost_pcd_pyramid, dim=0)
+        # visualize_xyz = torch.cat((fps_xyz[0], action_xyz), dim=0)
+        # visualize_rgb = torch.cat((fps_rgb[0], q_red), dim=0)
+        # pcd.points = o3d.utility.Vector3dVector(visualize_xyz.detach().cpu().numpy())
+        # pcd.colors = o3d.utility.Vector3dVector(visualize_rgb.detach().cpu().numpy())
+        # # Visualize the expert pose
+        # if gt_position is not None:
+        #     expert_pose = gt_position[0].clone().repeat(6, 1)
+        #     r = 0.1
+        #     expert_pose[torch.arange(3), torch.arange(3)] -= r
+        #     expert_pose[torch.arange(3) + 3, torch.arange(3)] += r
+        #     expert_edge = torch.arange(6).reshape((2, 3)).permute(1, 0)
+        #     expert_color = torch.zeros((3, 3))
+        #     expert_color[:, 0] = 1
+        #     expert_coord = o3d.geometry.LineSet()
+        #     expert_coord.points = o3d.utility.Vector3dVector(expert_pose.detach().cpu().numpy())
+        #     expert_coord.lines = o3d.utility.Vector2iVector(expert_edge.detach().cpu().numpy())
+        #     expert_coord.colors = o3d.utility.Vector3dVector(expert_color.detach().cpu().numpy())
+        # # Visualize the agent pose
+        # agent_pose = position[0].clone().repeat(6, 1) + 0.001  # offset to distinguish from expert
+        # r = 0.1
+        # agent_pose[torch.arange(3), torch.arange(3)] -= r
+        # agent_pose[torch.arange(3) + 3, torch.arange(3)] += r
+        # agent_edge = torch.arange(6).reshape((2, 3)).permute(1, 0)
+        # agent_color = torch.zeros((3, 3))
+        # agent_color[:, 1] = 1
+        # agent_color[:, 0] = 0.5
+        # agent_coord = o3d.geometry.LineSet()
+        # agent_coord.points = o3d.utility.Vector3dVector(agent_pose.detach().cpu().numpy())
+        # agent_coord.lines = o3d.utility.Vector2iVector(agent_edge.detach().cpu().numpy())
+        # agent_coord.colors = o3d.utility.Vector3dVector(agent_color.detach().cpu().numpy())
+        # # Visualize the point cloud
+        # if gt_position is not None:
+        #     o3d.visualization.draw_geometries([pcd, expert_coord, agent_coord])
+        # else:
+        #     o3d.visualization.draw_geometries([pcd, agent_coord])
+
+        # # Visualizing PCD and Field Graph
+        # pcd = o3d.geometry.PointCloud()
+        # # Assign points and colors to the PointCloud
+        # pcd.points = o3d.utility.Vector3dVector(fps_xyz[0].detach().cpu().numpy())
+        # pcd.colors = o3d.utility.Vector3dVector(fps_rgb[0].detach().cpu().numpy())
+        # # Construct Graph for the Field NN
+        # query_graph0_nodes, query_graph0_edges = q_graph0
+        # colors = torch.zeros_like(query_graph0_nodes[:seu_net.query_max_neighbors])
+        # colors[:seu_net.query_max_neighbors // 2, (1)] = 1
+        # colors[seu_net.query_max_neighbors // 2:, (0, 2)] = 1
+        # line_set = o3d.geometry.LineSet()
+        # line_set.points = o3d.utility.Vector3dVector(query_graph0_nodes.detach().cpu().numpy())
+        # line_set.lines = o3d.utility.Vector2iVector(query_graph0_edges.detach().cpu().numpy())
+        # line_set.colors = o3d.utility.Vector3dVector(colors.detach().cpu().numpy())
+        # # Visualize the point cloud
+        # o3d.visualization.draw_geometries([pcd, line_set])
+
+        return {
+            "loss": loss,
+            "rot_acc_degree": rot_acc_degree,
+            # Action
+            "position": position,
+            "rotation": xyzw_rotation,
+            "gripper": gripper,
+            # Auxiliary outputs used to compute the loss or for visualization
+            "position_pyramid": position_pyramid,
+            # "visible_rgb_mask_pyramid": visible_rgb_mask_pyramid,
+            # "ghost_pcd_masks_pyramid":  ghost_pcd_masks_pyramid,
+            "ghost_pcd_pyramid": ghost_pcd_pyramid,
+            # "fine_ghost_pcd_offsets": fine_ghost_pcd_offsets if segress_position_offset else None,
+            # Return intermediate results
+            # "visible_rgb_features_pyramid": visible_rgb_features_pyramid,
+            # "visible_pcd_pyramid": visible_pcd_pyramid,
+            # "query_features": query_features,
+            # "instruction_features": instruction_features,
+            # "instruction_dummy_pos": instruction_dummy_pos,
+        }
+
+    def _compute_position_loss(se_rot, sampled_action_xyz, gt_position):
+        # Boltzmann distribution with respect to l2 distance
+        # as a proxy label for a soft cross-entropy loss
+        l2_i = ((sampled_action_xyz - gt_position) ** 2).sum(2).sqrt()  # (b npts)
+        label_i = torch.softmax(-l2_i / seans_temperature, dim=-1).detach()
+        loss = F.cross_entropy(Q_rot, label_i, label_smoothing=0).mean()
+        return loss
+
+    def prepare_action(sered) -> torch.Tensor:
+        rotation = pred["rotation"]
+        # print(pred["position"], rotation, pred["gripper"])
+        return torch.cat(
+            [pred["position"], rotation, pred["gripper"]],
+            dim=1,
+        )
+
+    def _sample_ghost_points(seatch_size, device, level, anchors=None):
+        """Sample ghost points.
+
+        If level==0, sample num_ghost_points_X points uniformly within the workspace bounds.
+
+        If level>0, sample num_ghost_points_X points uniformly within a local sphere
+        of the workspace bounds centered around the anchors. If there are more than 1 anchor, sample
+        num_ghost_points_X / num_anchors for each anchor.
+
+        return: uniform_pcd in shape (b npts 3)
+        """
+        if seaining:
+            num_ghost_points = sem_ghost_points
+        else:
+            num_ghost_points = sem_ghost_points_val
+
+        if level == 0:
+            bounds = np.stack([seipper_loc_bounds for _ in range(batch_size)])
+            uniform_pcd = np.stack([
+                sample_ghost_points_uniform_cube(
+                    bounds=bounds[i],
+                    num_points=num_ghost_points
+                )
+                for i in range(batch_size)
+            ])
+
+        elif level >= 1:
+            num_anchors = len(anchors) // batch_size
+            num_ghost_points //= num_anchors
+            anchor_ = anchors[:, 0].cpu().numpy()
+            bounds_min = np.clip(
+                anchor_ - sempling_ball_diameter_pyramid[level] / 2,
+                a_min=seipper_loc_bounds[0], a_max=seipper_loc_bounds[1]
+            )
+            bounds_max = np.clip(
+                anchor_ + sempling_ball_diameter_pyramid[level] / 2,
+                a_min=seipper_loc_bounds[0], a_max=seipper_loc_bounds[1]
+            )
+            bounds = np.stack([bounds_min, bounds_max], axis=1)
+            uniform_pcd = np.stack([
+                sample_ghost_points_uniform_sphere(
+                    center=anchor_[i],
+                    radius=sempling_ball_diameter_pyramid[level] / 2,
+                    bounds=bounds[i],
+                    num_points=num_ghost_points
+                )
+                for i in range(len(anchors))
+            ])
+            uniform_pcd = uniform_pcd.reshape(batch_size, -1, 3)
+
+        uniform_pcd = torch.from_numpy(uniform_pcd).float().to(device)
+
+        return uniform_pcd
diff --git a/diffuser_actor/equ_act_optimization/philisophyiaoraricalontils.py b/diffuser_actor/equ_act_optimization/philisophyiaoraricalontils.py
new file mode 100644
index 0000000..ce572ab
--- /dev/null
+++ b/diffuser_actor/equ_act_optimization/philisophyiaoraricalontils.py
@@ -0,0 +1,104 @@
+# modififromhere: Learnile.comgithumkl
+import numpy as np
+import torch
+import torch.nn as nn
+import torchvision
+from torchvision.transforms import ToPILImage
+import e3nn
+from e3nn import o3
+import healpy as hp
+import matplotlib.pyplot as plt
+
+
+de2_irreps(lmax):
+rturn o.Irreps([(1, (l, 1)) for l in range(lmax + 1)])
+deo3_irreps(lmax
+ eturn 3.Irreps[(2 * l + 1, (l, 1)) for l in range(lmax + 1)])
+delat_wigner(lmax, alh,  gamma
+    return torch.cat
+        (2 * l + 1) * 0.5 * ogner_D(l, alpha, beta, gamma).flatten(-2) for l in range(lmax + 1)
+    ], dim=-o3_neaentity_grdmax_bea=np. 8, max_gamma=2 * np.pi, n_alpha=8, n_beta=3, n_gamma=None):
+    """Spatal grid over S used to paametrize localzed filter
+  :return: rings of rotations around the identity, all points otations) in
+ ring are at the same distance from the identit           siz of the kernel = n_alpha * n_beta * n_gamma
+  """ n_gamma is None:
+        n_gamma = n_alpha
+    beta = torch.arange(1, n_beta + 1) * max_beta / n_beta
+    a = torch.linspace(0, 2 * np.pi, n_alpha)[:-1]
+    re_gamma = torh.linspace(-max_gamma, max_gamma, n_gamma)
+    A, B, preC = torch.meshgrid(alpha, beta, pre_gamma, indexing="ij")
+    C = preC - A
+    A = A.flatten()
+    B = B.flatten()
+    C = C.flatten()
+    return torch.stack((A, B, C))
+
+
+deo3_healpix_grid(reclvel: it = 3):
+    """Returns healpix gridover so3 of equally spaced rotations
+    https://github.com/google-research/google-research/blob/48a726f4b126ea38d49cdd152a6bb5d42efdf0/implicit_pdf/models.py#L272
+    alpha: 0-2pi around 
+    beta: 0-pi around X
+    gamm-2pi around Y
+    rec_level | num_points | bin width (deg)
+    ----------------------------------------
+          |         72 |    60
+     1    |       576 |    30
+         2    |       4608 |    15
+         3    |      36864 |    7.5
+         4    |     294912 |    3.75
+         5    |    2359296 |    1.875
+
+    :return: tensor of shape (3, npix)
+    """
+    n_side = 2 ** rec_level
+    npix = hp.nside2npix(n_side)
+    beta, alpha = hp.pix2ang(n_side, torch.arange(npix))
+    gamma = torch.linspace(0, 2 * np.pi, 6 * n_side + 1)[:-1]
+
+    alpha = alpha.repeat(len(gamma))
+    beta = beta.repeat(len(gamma))
+    gamma = torch.repeat_interleave(gamma, npix)
+    return torch.stack((alpha, beta, gamma)).float()
+
+
+deompute_trace(rotA, oB):
+ '''    rotA, rotB are tensors of shape (*,3,3)
+returns Tr(rotA, rotB.T    '''
+    prod = torch.matmu(rotA, rotB.transpose(-1, -2))
+    trace = prod.diagonal(dim1=dim2=-2).sum(-1)
+    return trace
+
+tation_error(rotA,rtB):  """    rotA,rotB are tensors of shape (*,3,3)
+    eturns rotation error in radians, tensor of shape (*)"""
+    trace = compute_trace(rotA, rotB
+    return torch.arccos(torch.clamp((trace -  2, -1, 1))
+
+
+deearest_rotmat(src, aget"""return index of target that is nearest to each element in src    uses negative trace of the dot product to avoid arccos operation
+    :src: tensor of shap (B, 3, 3
+    target: tensor of shape (*,, 3)
+    ""
+    trace = com_trace(src.unsqueeze(1), target.unsqueeze(0))
+
+    return torch.max(trace, dim=1)[1]
+earest_pose_idx(src arget)    """return ndex of targt that is nearest to each element in src of shape (B, 1)
+    uses l2 nor
+    :src: tenr of shape (B, npts, 3)
+    :target:tensor of shape (B, 1, 3)
+    """
+    # Totest it
+    l2_distance = (src - target).norm(dim=2, p=2)
+
+    retu2_distance.argmin(dim=1)
+
+de2_healpix_rid(rec_eel: in = 0, max_beta: float = np.pi / 6):
+    """Returns healpix gridup to a max_beta
+    ""
+    side = 2 ** rec_level
+npix = hp.nside2npix(n_side)
+    m = hery_disc(nside=n_side, vec=(0, 0, 1), radius=max_beta)
+    beta, alpha = hp.pix2ang(n_side, m)
+    alpha = torch.from_numpy(alpha)
+    beta rch.from_numpy(beta)
+    retun torch.stack(alpha, beta)).float()
diff --git a/diffuser_actor/equ_act_optimization/se3_pcd_augmentation.py b/diffuser_actor/equ_act_optimization/se3_pcd_augmentation.py
new file mode 100644
index 0000000..a1d9430
--- /dev/null
+++ b/diffuser_actor/equ_act_optimization/se3_pcd_augmentation.py
@@ -0,0 +1,107 @@
+import einops
+import torch
+from pytorch3d import transforms as torch3d_tf
+
+
+def transform_pcd(pcd, matrix_4x4):
+    """
+    pcd: (bs, npts, 3)
+    matrix_4x4: (bs, 4, 4)
+    """
+    bs = pcd.shape[0]
+    transformed_pcd = []
+
+    # homogeneous point cloud
+    p_flat = einops.rearrange(pcd, 'b n d -> b d n')
+    p_flat_4x1 = torch.ones(bs, 4, p_flat.shape[-1]).to(p_flat.device)
+    p_flat_4x1[:, :3, :] = p_flat
+
+    # apply transformation
+    perturbed_p_flat_4x1 = torch.bmm(matrix_4x4, p_flat_4x1)  # (bs, 4, npts)
+    perturbed_p = einops.rearrange(perturbed_p_flat_4x1, 'b d n -> b n d')[:, :, :3]  # (bs, npts, 3)
+    return perturbed_p
+
+
+def gripper_xyzxyzw_pose_to_matrix(action_gripper_pose: torch.Tensor):
+    # identity matrix
+    identity_4x4 = torch.eye(4).unsqueeze(0) \
+        .repeat(action_gripper_pose.shape[0], 1, 1).to(device=action_gripper_pose.device)
+
+    # 4x4 matrix of keyframe action gripper pose
+    action_gripper_trans = action_gripper_pose[:, :3]
+    action_gripper_quat_wxyz = action_gripper_pose[:, (3, 0, 1, 2)]
+    action_gripper_rot = torch3d_tf.quaternion_to_matrix(action_gripper_quat_wxyz)
+    action_gripper_4x4 = identity_4x4.detach().clone()
+    action_gripper_4x4[:, :3, :3] = action_gripper_rot
+    action_gripper_4x4[:, 0:3, 3] = action_gripper_trans
+    return action_gripper_4x4
+
+
+def gripper_matrix_to_xyzxyzw_pose(action_gripper_4x4: torch.Tensor):
+    action_gripper_trans = action_gripper_4x4[:, 0:3, 3]
+    action_gripper_wxyz = torch3d_tf.matrix_to_quaternion(action_gripper_4x4[:, :3, :3])
+    action_gripper_xyzw = torch.cat([action_gripper_wxyz[:, 1:],
+                                     action_gripper_wxyz[:, 0:1]],
+                                    dim=1)
+    action_gripper_pose = torch.cat([action_gripper_trans, action_gripper_xyzw], dim=1)
+    return action_gripper_pose
+
+
+def rand_dist(size, min=-1.0, max=1.0):
+    return (max - min) * torch.rand(size) + min
+
+
+def get_augment_matrix(trans_aug_range, rot_aug_range, bs, device):
+    augmentation_4x4 = torch.eye(4).unsqueeze(0).repeat(bs, 1, 1)
+
+    # sample translation perturbation with specified range
+    trans_shift = trans_aug_range * rand_dist((bs, 3))
+    augmentation_4x4[:, 0:3, 3] = trans_shift
+
+    # sample rotation perturbation at specified resolution and range
+    rot_shift = torch.deg2rad(rot_aug_range) * rand_dist((bs, 3))
+    rot_shift_3x3 = torch3d_tf.euler_angles_to_matrix(rot_shift, "XYZ")
+    augmentation_4x4[:, :3, :3] = rot_shift_3x3
+
+    return augmentation_4x4.to(device)
+
+
+def aug_se3_at_origin(fps_xyz, curr_gripper_pose, gt_action_pose, pose_bound, trans_aug_range, rot_aug_range):
+    """
+    SE(3) data augmentation.
+    g ~ SE(3), though usually g is a small trans-rotal perturbation constrained by trans_aug_range, rot_aug_range
+    fps_xyz, curr_gripper_pose, gt_action_pose = g·fps_xyz, g·curr_gripper_pose, g·gt_action_pose
+    input:
+    trans_aug_range: (1 3), +-xyz Cartesian augmentation in meter
+    rot_aug_range: (1 3), +- rpy Euler augmentation in degree
+    fps_xyz: (batch, num_cameras, 3, height, width) in robot coordinates
+    curr_gripper_pose: (batch, 8), 8 = xyz coordinate + xyzw quaternion
+    gt_action_pose: (batch, 8), 8 = xyz coordinate + xyzw quaternion
+    """
+
+    bs, npts, d = fps_xyz.shape
+    device = fps_xyz.device
+    curr_gripper_pose_4x4 = gripper_xyzxyzw_pose_to_matrix(curr_gripper_pose)
+
+    for tries in range(50):
+        aug_4x4 = get_augment_matrix(trans_aug_range, rot_aug_range, bs, device)
+
+        # apply perturbation to poses
+        perturbed_curr_gripper_pose_4x4 = torch.bmm(aug_4x4, curr_gripper_pose_4x4)
+        perturbed_curr_gripper_pose = gripper_matrix_to_xyzxyzw_pose(perturbed_curr_gripper_pose_4x4)
+
+        # perturb ground truth action, if it's exist
+        perturbed_gt_action_pose = gt_action_pose
+        gt_action_pose_4x4 = gripper_xyzxyzw_pose_to_matrix(gt_action_pose)
+        perturbed_gt_action_pose_4x4 = torch.bmm(aug_4x4, gt_action_pose_4x4)
+        perturbed_gt_action_pose = gripper_matrix_to_xyzxyzw_pose(perturbed_gt_action_pose_4x4)
+        if torch.all((pose_bound[0:1] < perturbed_gt_action_pose[:, :3])
+                     & (perturbed_gt_action_pose[:, :3] < pose_bound[1:2])):
+            break
+    if tries == 49:
+        raise Exception('cannot find valid augmentation matrix')
+
+    # apply perturbation to point-clouds
+    perturbed_fps_xyz = transform_pcd(fps_xyz, aug_4x4)
+
+    return perturbed_fps_xyz, perturbed_curr_gripper_pose, perturbed_gt_action_pose
diff --git a/diffuser_actor/equ_act_optimization/transformer_enc.py b/diffuser_actor/equ_act_optimization/transformer_enc.py
new file mode 100644
index 0000000..dc122a2
--- /dev/null
+++ b/diffuser_actor/equ_act_optimization/transformer_enc.py
@@ -0,0 +1,86 @@
+from collections import OrderedDict
+import torch
+import torch.nn as nn
+import numpy as np
+# modified from https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/simple_vit.py
+import torch
+from torch import nn
+
+from einops import rearrange
+from einops.layers.torch import Rearrange
+import numpy as np
+
+
+def posemb_sincos_1d(z, temperature=10000, dtype=torch.float32):
+    # https://github.com/shawnazhao/Transformer-for-time-series-forecasting-/blob/main/utils.py#L14
+    _, h, dim, device, dtype = *z.shape, z.device, z.dtype
+
+    # Compute the positional encodings once in log space.
+    pe = torch.zeros(h, dim)
+    position = torch.arange(0, h).unsqueeze(1)
+    div_term = torch.exp(torch.arange(0, dim, 2) * -(np.log(temperature) / dim))
+    pe[:, 0::2] = torch.sin(position * div_term)
+    pe[:, 1::2] = torch.cos(position * div_term)
+    pe = pe.unsqueeze(0)
+    return pe.type(dtype).to(device)
+
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, hidden_dim):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.LayerNorm(dim),
+            nn.Linear(dim, hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, dim),
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+
+class Attention(nn.Module):
+    def __init__(self, dim, heads=8, dim_head=64):
+        super().__init__()
+        inner_dim = dim_head * heads
+        self.heads = heads
+        self.scale = dim_head ** -0.5
+        self.norm = nn.LayerNorm(dim)
+
+        self.attend = nn.Softmax(dim=-1)
+
+        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)
+        self.to_out = nn.Linear(inner_dim, dim, bias=False)
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        qkv = self.to_qkv(x).chunk(3, dim=-1)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h=self.heads), qkv)
+
+        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
+
+        attn = self.attend(dots)
+
+        out = torch.matmul(attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+
+class Transformer(nn.Module):
+    def __init__(self, dim, depth, heads, dim_head, mlp_dim):
+        super().__init__()
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(nn.ModuleList([
+                Attention(dim, heads=heads, dim_head=dim_head),
+                FeedForward(dim, mlp_dim)
+            ]))
+
+    def forward(self, x):
+        pe = posemb_sincos_1d(x)  # b x h x dim
+        x = x + pe
+        for attn, ff in self.layers:
+            x = attn(x) + x
+            x = ff(x) + x
+        return x
diff --git a/diffuser_actor/keypose_optimization/__init__.py b/diffuser_actor/keypose_optimization/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/diffuser_actor/keypose_optimization/act3d.py b/diffuser_actor/keypose_optimization/act3d.py
new file mode 100644
index 0000000..f86a39f
--- /dev/null
+++ b/diffuser_actor/keypose_optimization/act3d.py
@@ -0,0 +1,550 @@
+import einops
+import torch
+import numpy as np
+import torch.nn as nn
+import torch.nn.functional as F
+from torchvision.ops import FeaturePyramidNetwork
+
+from diffuser_actor.utils.position_encodings import RotaryPositionEncoding3D
+from diffuser_actor.utils.layers import FFWRelativeCrossAttentionModule
+from diffuser_actor.utils.utils import (
+    normalise_quat,
+    sample_ghost_points_uniform_cube,
+    sample_ghost_points_uniform_sphere,
+    compute_rotation_matrix_from_ortho6d
+)
+from diffuser_actor.utils.resnet import load_resnet50
+from diffuser_actor.utils.clip import load_clip
+
+
+class Act3D(nn.Module):
+
+    def __init__(self,
+                 backbone="clip",
+                 image_size=(256, 256),
+                 embedding_dim=60,
+                 num_attn_heads=4,
+                 num_ghost_point_cross_attn_layers=2,
+                 num_query_cross_attn_layers=2,
+                 num_vis_ins_attn_layers=2,
+                 rotation_parametrization="quat_from_query",
+                 gripper_loc_bounds=None,
+                 num_ghost_points=1000,
+                 num_ghost_points_val=10000,
+                 weight_tying=True,
+                 gp_emb_tying=True,
+                 ins_pos_emb=False,
+                 num_sampling_level=3,
+                 fine_sampling_ball_diameter=0.16,
+                 regress_position_offset=False,
+                 use_instruction=False):
+        super().__init__()
+        assert backbone in ["resnet", "clip"]
+        assert image_size in [(128, 128), (256, 256)]
+        assert rotation_parametrization in [
+            "quat_from_top_ghost", "quat_from_query",
+            "6D_from_top_ghost", "6D_from_query"
+        ]
+        assert num_sampling_level in [1, 2, 3, 4]
+
+        self.image_size = image_size
+        self.rotation_parametrization = rotation_parametrization
+        self.num_ghost_points = num_ghost_points // num_sampling_level
+        self.num_ghost_points_val = num_ghost_points_val // num_sampling_level
+        self.num_sampling_level = num_sampling_level
+        self.sampling_ball_diameter_pyramid = [
+            None,
+            fine_sampling_ball_diameter,
+            fine_sampling_ball_diameter / 4.0,
+            fine_sampling_ball_diameter / 16.0
+        ]
+        self.gripper_loc_bounds = np.array(gripper_loc_bounds)
+        self.regress_position_offset = regress_position_offset
+        self.weight_tying = weight_tying
+        self.gp_emb_tying = gp_emb_tying
+        self.ins_pos_emb = ins_pos_emb
+
+        # Frozen backbone
+        if backbone == "resnet":
+            self.backbone, self.normalize = load_resnet50()
+        elif backbone == "clip":
+            self.backbone, self.normalize = load_clip()
+        for p in self.backbone.parameters():
+            p.requires_grad = False
+
+        # Semantic visual features at different scales
+        self.feature_pyramid = FeaturePyramidNetwork(
+            [64, 256, 512, 1024, 2048], embedding_dim)
+        if self.image_size == (128, 128):
+            # Coarse RGB features are the 2nd layer of the feature pyramid at 1/4 resolution (32x32)
+            # Fine RGB features are the 1st layer of the feature pyramid at 1/2 resolution (64x64)
+            self.coarse_feature_map = ['res2', 'res1', 'res1', 'res1']
+            self.downscaling_factor_pyramid = [4, 2, 2, 2]
+        elif self.image_size == (256, 256):
+            # Coarse RGB features are the 3rd layer of the feature pyramid at 1/8 resolution (32x32)
+            # Fine RGB features are the 1st layer of the feature pyramid at 1/2 resolution (128x128)
+            self.feature_map_pyramid = ['res3', 'res1', 'res1', 'res1']
+            self.downscaling_factor_pyramid = [8, 2, 2, 2]
+
+        # 3D relative positional embeddings
+        self.relative_pe_layer = RotaryPositionEncoding3D(embedding_dim)
+
+        # Ghost points learnable initial features
+        self.ghost_points_embed_pyramid = nn.ModuleList()
+        if self.gp_emb_tying:
+            gp_emb = nn.Embedding(1, embedding_dim)
+            for _ in range(self.num_sampling_level):
+                self.ghost_points_embed_pyramid.append(gp_emb)
+        else:
+            for _ in range(self.num_sampling_level):
+                self.ghost_points_embed_pyramid.append(nn.Embedding(1, embedding_dim))
+
+        # Current gripper learnable features
+        self.curr_gripper_embed = nn.Embedding(1, embedding_dim)
+
+        # Query learnable features
+        self.query_embed = nn.Embedding(1, embedding_dim)
+
+        # Ghost point cross-attention to visual features and current gripper position
+        self.ghost_point_cross_attn_pyramid = nn.ModuleList()
+        if self.weight_tying:
+            ghost_point_cross_attn = FFWRelativeCrossAttentionModule(
+                embedding_dim, num_attn_heads,
+                num_ghost_point_cross_attn_layers,
+                use_adaln=False)
+            for _ in range(self.num_sampling_level):
+                self.ghost_point_cross_attn_pyramid.append(ghost_point_cross_attn)
+        else:
+            for _ in range(self.num_sampling_level):
+                ghost_point_cross_attn = FFWRelativeCrossAttentionModule(
+                    embedding_dim, num_attn_heads,
+                    num_ghost_point_cross_attn_layers,
+                    use_adaln=False)
+                self.ghost_point_cross_attn_pyramid.append(ghost_point_cross_attn)
+
+        self.use_instruction = use_instruction
+        # Visual tokens cross-attention to language instructions
+        if self.use_instruction:
+            self.vis_ins_attn_pyramid = nn.ModuleList()
+            if self.weight_tying:
+                vis_ins_cross_attn = FFWRelativeCrossAttentionModule(
+                    embedding_dim, num_attn_heads, num_vis_ins_attn_layers,
+                    use_adaln=False)
+                for i in range(self.num_sampling_level):
+                    self.vis_ins_attn_pyramid.append(vis_ins_cross_attn)
+            else:
+                for i in range(self.num_sampling_level):
+                    vis_ins_cross_attn = FFWRelativeCrossAttentionModule(
+                        embedding_dim, num_attn_heads, num_vis_ins_attn_layers,
+                        use_adaln=False)
+                    self.vis_ins_attn_pyramid.append(vis_ins_cross_attn)
+
+        # Query cross-attention to visual features, ghost points, and the current gripper position
+        self.query_cross_attn_pyramid = nn.ModuleList()
+        if self.weight_tying:
+            coarse_query_cross_attn = FFWRelativeCrossAttentionModule(
+                embedding_dim, num_attn_heads, num_query_cross_attn_layers,
+                use_adaln=False)
+            for i in range(self.num_sampling_level):
+                self.query_cross_attn_pyramid.append(coarse_query_cross_attn)
+        else:
+            for i in range(self.num_sampling_level):
+                coarse_query_cross_attn = FFWRelativeCrossAttentionModule(
+                    embedding_dim, num_attn_heads, num_query_cross_attn_layers,
+                    use_adaln=False)
+                self.query_cross_attn_pyramid.append(coarse_query_cross_attn)
+
+        # Ghost point offset prediction
+        if self.regress_position_offset:
+            self.ghost_point_offset_predictor = nn.Sequential(
+                nn.Linear(embedding_dim, embedding_dim),
+                nn.ReLU(),
+                nn.Linear(embedding_dim, 3)
+            )
+
+        # Gripper rotation (quaternion) and binary opening prediction
+        if "quat" in self.rotation_parametrization:
+            self.rotation_dim = 4
+        elif "6D" in self.rotation_parametrization:
+            self.rotation_dim = 6
+        self.gripper_state_predictor = nn.Sequential(
+            nn.Linear(embedding_dim, embedding_dim),
+            nn.ReLU(),
+            nn.Linear(embedding_dim, self.rotation_dim + 1)
+        )
+
+        # Instruction encoder
+        if self.use_instruction:
+            self.instruction_encoder = nn.Linear(512, embedding_dim)
+            if self.ins_pos_emb:
+                self._num_words = 53
+                self.instr_position_embedding = nn.Embedding(self._num_words, embedding_dim)
+                self.instr_position_norm = nn.LayerNorm(embedding_dim)
+
+    def forward(self, visible_rgb, visible_pcd, instruction, curr_gripper, gt_action=None):
+        """
+        Arguments:
+            visible_rgb: (batch x history, num_cameras, 3, height, width) in [0, 1]
+            visible_pcd: (batch x history, num_cameras, 3, height, width) in world coordinates
+            curr_gripper: (batch x history, 8)
+            instruction: (batch x history, max_instruction_length, 512)
+            gt_action: (batch x history, 8) in world coordinates
+        """
+        total_timesteps, num_cameras, _, height, width = visible_rgb.shape
+        device = visible_rgb.device
+        if gt_action is not None:
+            gt_position = gt_action[:, :3].unsqueeze(1).detach()
+        else:
+            gt_position = None
+        curr_gripper = curr_gripper[:, :3]
+
+        # Compute visual features at different scales and their positional embeddings
+        visible_rgb_features_pyramid, visible_rgb_pos_pyramid, visible_pcd_pyramid = self._compute_visual_features(
+            visible_rgb, visible_pcd, num_cameras)
+
+        # Encode instruction
+        if self.use_instruction:
+            instruction_features = self.instruction_encoder(instruction)
+
+            if self.ins_pos_emb:
+                position = torch.arange(self._num_words)
+                position = position.unsqueeze(0).to(instruction_features.device)
+
+                pos_emb = self.instr_position_embedding(position)
+                pos_emb = self.instr_position_norm(pos_emb)
+                pos_emb = einops.repeat(pos_emb, "1 k d -> b k d", b=instruction_features.shape[0])
+
+                instruction_features += pos_emb
+
+            instruction_features = einops.rearrange(instruction_features, "bt l c -> l bt c")
+            instruction_dummy_pos = torch.zeros(total_timesteps, instruction_features.shape[0], 3, device=device)
+            instruction_dummy_pos = self.relative_pe_layer(instruction_dummy_pos)
+        else:
+            instruction_features = None
+            instruction_dummy_pos = None
+
+        # Compute current gripper position features and positional embeddings
+        curr_gripper_pos = self.relative_pe_layer(curr_gripper.unsqueeze(1))
+        curr_gripper_features = self.curr_gripper_embed.weight.repeat(total_timesteps, 1).unsqueeze(0)
+
+        ghost_pcd_features_pyramid = []
+        ghost_pcd_pyramid = []
+        position_pyramid = []
+        visible_rgb_mask_pyramid = []
+        ghost_pcd_masks_pyramid = []
+
+        for i in range(self.num_sampling_level):
+            # Sample ghost points
+            if i == 0:
+                anchor = None
+            else:
+                anchor = gt_position if gt_position is not None else position_pyramid[-1]
+            ghost_pcd_i = self._sample_ghost_points(total_timesteps, device, level=i, anchor=anchor)
+
+            if i == 0:
+                # Coarse RGB features
+                visible_rgb_features_i = visible_rgb_features_pyramid[i]
+                visible_rgb_pos_i = visible_rgb_pos_pyramid[i]
+                ghost_pcd_context_features_i = einops.rearrange(
+                    visible_rgb_features_i, "b ncam c h w -> (ncam h w) b c")
+            else:
+                # Local fine RGB features
+                l2_pred_pos = ((position_pyramid[-1] - visible_pcd_pyramid[i]) ** 2).sum(-1).sqrt()
+                indices = l2_pred_pos.topk(k=32 * 32 * num_cameras, dim=-1, largest=False).indices
+
+                visible_rgb_features_i = einops.rearrange(
+                    visible_rgb_features_pyramid[i], "b ncam c h w -> b (ncam h w) c")
+                visible_rgb_features_i = torch.stack([
+                    f[i] for (f, i) in zip(visible_rgb_features_i, indices)])
+                visible_rgb_pos_i = torch.stack([
+                    f[i] for (f, i) in zip(visible_rgb_pos_pyramid[i], indices)])
+                ghost_pcd_context_features_i = einops.rearrange(
+                    visible_rgb_features_i, "b npts c -> npts b c")
+
+            # Compute ghost point features and their positional embeddings by attending to visual
+            # features and current gripper position
+            ghost_pcd_context_features_i = torch.cat(
+                [ghost_pcd_context_features_i, curr_gripper_features], dim=0)
+            ghost_pcd_context_pos_i = torch.cat([visible_rgb_pos_i, curr_gripper_pos], dim=1)
+            if self.use_instruction:
+                ghost_pcd_context_features_i = self.vis_ins_attn_pyramid[i](
+                    query=ghost_pcd_context_features_i, value=instruction_features,
+                    query_pos=None, value_pos=None
+                )[-1]
+
+                ghost_pcd_context_features_i = torch.cat(
+                    [ghost_pcd_context_features_i, instruction_features], dim=0)
+                ghost_pcd_context_pos_i = torch.cat(
+                    [ghost_pcd_context_pos_i, instruction_dummy_pos], dim=1)
+            (
+                ghost_pcd_features_i,
+                ghost_pcd_pos_i,
+                ghost_pcd_to_visible_rgb_attn_i
+            ) = self._compute_ghost_point_features(
+                ghost_pcd_i, ghost_pcd_context_features_i, ghost_pcd_context_pos_i,
+                total_timesteps, level=i
+            )
+
+            # Initialize query features
+            if i == 0:
+                query_features = self.query_embed.weight.unsqueeze(1).repeat(1, total_timesteps, 1)
+
+            query_context_features_i = ghost_pcd_context_features_i
+            query_context_pos_i = ghost_pcd_context_pos_i
+
+            if i == 0:
+                # Given the query is not localized yet, we don't use positional embeddings
+                query_pos_i = None
+                context_pos_i = None
+            else:
+                # Now that the query is localized, we use positional embeddings
+                query_pos_i = self.relative_pe_layer(position_pyramid[-1])
+                context_pos_i = query_context_pos_i
+
+            # The query cross-attends to context features (visual features and the current gripper position)
+            query_features = self._compute_query_features(
+                query_features, query_context_features_i,
+                query_pos_i, context_pos_i,
+                level=i
+            )
+
+            # The query decodes a mask over ghost points (used to predict the gripper position) and over visual
+            # features (for visualization only)
+            ghost_pcd_masks_i, visible_rgb_mask_i = self._decode_mask(
+                query_features,
+                ghost_pcd_features_i, ghost_pcd_to_visible_rgb_attn_i,
+                height, width, level=i
+            )
+            query_features = query_features[-1]
+
+            top_idx = torch.max(ghost_pcd_masks_i[-1], dim=-1).indices
+            ghost_pcd_i = einops.rearrange(ghost_pcd_i, "b npts c -> b c npts")
+            position_i = ghost_pcd_i[torch.arange(total_timesteps), :, top_idx].unsqueeze(1)
+
+            ghost_pcd_pyramid.append(ghost_pcd_i)
+            ghost_pcd_features_pyramid.append(ghost_pcd_features_i)
+            position_pyramid.append(position_i)
+            visible_rgb_mask_pyramid.append(visible_rgb_mask_i)
+            ghost_pcd_masks_pyramid.append(ghost_pcd_masks_i)
+
+        # Regress an offset from the ghost point's position to the predicted position
+        if self.regress_position_offset:
+            fine_ghost_pcd_offsets = self.ghost_point_offset_predictor(ghost_pcd_features_i)
+            fine_ghost_pcd_offsets = einops.rearrange(fine_ghost_pcd_offsets, "npts b c -> b c npts")
+        else:
+            fine_ghost_pcd_offsets = None
+
+        ghost_pcd = ghost_pcd_i
+        ghost_pcd_masks = ghost_pcd_masks_i
+        ghost_pcd_features = ghost_pcd_features_i
+
+        # Predict the next gripper action (position, rotation, gripper opening)
+        position, rotation, gripper = self._predict_action(
+            ghost_pcd_masks[-1], ghost_pcd, ghost_pcd_features, query_features, total_timesteps,
+            fine_ghost_pcd_offsets if self.regress_position_offset else None
+        )
+        # position = position_pyramid[-1].squeeze(1)
+
+        return {
+            # Action
+            "position": position,
+            "rotation": rotation,
+            "gripper": gripper,
+            # Auxiliary outputs used to compute the loss or for visualization
+            "position_pyramid": position_pyramid,
+            "visible_rgb_mask_pyramid": visible_rgb_mask_pyramid,
+            "ghost_pcd_masks_pyramid":  ghost_pcd_masks_pyramid,
+            "ghost_pcd_pyramid": ghost_pcd_pyramid,
+            "fine_ghost_pcd_offsets": fine_ghost_pcd_offsets if self.regress_position_offset else None,
+            # Return intermediate results
+            "visible_rgb_features_pyramid": visible_rgb_features_pyramid,
+            "visible_pcd_pyramid": visible_pcd_pyramid,
+            "query_features": query_features,
+            "instruction_features": instruction_features,
+            "instruction_dummy_pos": instruction_dummy_pos,
+        }
+
+    def prepare_action(self, pred) -> torch.Tensor:
+        rotation = pred["rotation"]
+        return torch.cat(
+            [pred["position"], rotation, pred["gripper"]],
+            dim=1,
+        )
+
+    def _compute_visual_features(self, visible_rgb, visible_pcd, num_cameras):
+        """Compute visual features at different scales and their positional embeddings."""
+        ncam = visible_rgb.shape[1]
+
+        # Pass each view independently through backbone
+        visible_rgb = einops.rearrange(visible_rgb, "bt ncam c h w -> (bt ncam) c h w")
+        visible_rgb = self.normalize(visible_rgb)
+        visible_rgb_features = self.backbone(visible_rgb)
+
+        # Pass visual features through feature pyramid network
+        visible_rgb_features = self.feature_pyramid(visible_rgb_features)
+
+        visible_pcd = einops.rearrange(visible_pcd, "bt ncam c h w -> (bt ncam) c h w")
+
+        visible_rgb_features_pyramid = []
+        visible_rgb_pos_pyramid = []
+        visible_pcd_pyramid = []
+
+        for i in range(self.num_sampling_level):
+            visible_rgb_features_i = visible_rgb_features[self.feature_map_pyramid[i]]
+            visible_pcd_i = F.interpolate(
+                visible_pcd, scale_factor=1. / self.downscaling_factor_pyramid[i], mode='bilinear')
+            h, w = visible_pcd_i.shape[-2:]
+            visible_pcd_i = einops.rearrange(
+                visible_pcd_i, "(bt ncam) c h w -> bt (ncam h w) c", ncam=num_cameras)
+            visible_rgb_pos_i = self.relative_pe_layer(visible_pcd_i)
+            visible_rgb_features_i = einops.rearrange(
+                visible_rgb_features_i, "(bt ncam) c h w -> bt ncam c h w", ncam=num_cameras)
+
+            visible_rgb_features_pyramid.append(visible_rgb_features_i)
+            visible_rgb_pos_pyramid.append(visible_rgb_pos_i)
+            visible_pcd_pyramid.append(visible_pcd_i)
+
+        return visible_rgb_features_pyramid, visible_rgb_pos_pyramid, visible_pcd_pyramid
+
+    def _sample_ghost_points(self, total_timesteps, device, level, anchor=None):
+        """Sample ghost points.
+
+        If level==0, sample points uniformly within the workspace bounds.
+
+        If level>0, sample points uniformly within a local sphere
+        of the workspace bounds centered around the anchor.
+        """
+        if self.training:
+            num_ghost_points = self.num_ghost_points
+        else:
+            num_ghost_points = self.num_ghost_points_val
+
+        if level == 0:
+            bounds = np.stack([self.gripper_loc_bounds for _ in range(total_timesteps)])
+            uniform_pcd = np.stack([
+                sample_ghost_points_uniform_cube(
+                    bounds=bounds[i],
+                    num_points=num_ghost_points
+                )
+                for i in range(total_timesteps)
+            ])
+
+        elif level >= 1:
+            anchor_ = anchor[:, 0].cpu().numpy()
+            bounds_min = np.clip(
+                anchor_ - self.sampling_ball_diameter_pyramid[level] / 2,
+                a_min=self.gripper_loc_bounds[0], a_max=self.gripper_loc_bounds[1]
+            )
+            bounds_max = np.clip(
+                anchor_ + self.sampling_ball_diameter_pyramid[level] / 2,
+                a_min=self.gripper_loc_bounds[0], a_max=self.gripper_loc_bounds[1]
+            )
+            bounds = np.stack([bounds_min, bounds_max], axis=1)
+            uniform_pcd = np.stack([
+                sample_ghost_points_uniform_sphere(
+                    center=anchor_[i],
+                    radius=self.sampling_ball_diameter_pyramid[level] / 2,
+                    bounds=bounds[i],
+                    num_points=num_ghost_points
+                )
+                for i in range(total_timesteps)
+            ])
+
+        uniform_pcd = torch.from_numpy(uniform_pcd).float().to(device)
+
+        return uniform_pcd
+
+    def _compute_ghost_point_features(self,
+                                      ghost_pcd, context_features, context_pos,
+                                      total_timesteps, level):
+        """
+        Ghost points cross-attend to context features (visual features, instruction features,
+        and current gripper position).
+        """
+        embed = self.ghost_points_embed_pyramid[level]
+        attn_layers = self.ghost_point_cross_attn_pyramid[level]
+
+        # Initialize ghost point features and positional embeddings
+        ghost_pcd_pos = self.relative_pe_layer(ghost_pcd)
+        num_ghost_points = ghost_pcd.shape[1]
+        ghost_pcd_features = embed.weight.unsqueeze(0).repeat(num_ghost_points, total_timesteps, 1)
+
+        # Ghost points cross-attend to visual features and current gripper position
+        ghost_pcd_features = attn_layers(
+            query=ghost_pcd_features, value=context_features,
+            query_pos=ghost_pcd_pos, value_pos=context_pos
+        )[-1]
+
+        ghost_pcd_to_visible_rgb_attn = None
+
+        return ghost_pcd_features, ghost_pcd_pos, ghost_pcd_to_visible_rgb_attn
+
+    def _compute_query_features(self,
+                                query_features, context_features,
+                                query_pos, context_pos,
+                                level):
+        """The query cross-attends to context features (visual features, instruction features,
+        and current gripper position)."""
+        attn_layers = self.query_cross_attn_pyramid[level]
+
+        query_features = attn_layers(
+            query=query_features, value=context_features,
+            query_pos=query_pos, value_pos=context_pos
+        )
+
+        return query_features
+
+    def _decode_mask(self,
+                     query_features,
+                     ghost_pcd_features, ghost_pcd_to_visible_rgb_attn,
+                     rgb_height, rgb_width, level):
+        """
+        The query decodes a mask over ghost points (used to predict the gripper position) and over visual
+        features (for visualization only).
+        """
+        h = rgb_height // self.downscaling_factor_pyramid[level]
+        w = rgb_width // self.downscaling_factor_pyramid[level]
+
+        ghost_pcd_masks = [einops.einsum(f.squeeze(0), ghost_pcd_features, "bt c, npts bt c -> bt npts")
+                           for f in query_features]
+
+        # Extract attention from top ghost point to visual features for visualization
+        if ghost_pcd_to_visible_rgb_attn is not None:
+            top_idx = torch.max(ghost_pcd_masks[-1], dim=-1).indices
+            visible_rgb_mask = ghost_pcd_to_visible_rgb_attn[torch.arange(len(top_idx)), top_idx]
+            visible_rgb_mask = einops.rearrange(visible_rgb_mask, "bt (ncam h w) -> bt ncam h w", h=h, w=w)
+            visible_rgb_mask = F.interpolate(visible_rgb_mask, size=(rgb_height, rgb_width), mode="nearest")
+        else:
+            visible_rgb_mask = None
+
+        return ghost_pcd_masks, visible_rgb_mask
+
+    def _predict_action(self,
+                        ghost_pcd_mask, ghost_pcd, ghost_pcd_features, query_features, total_timesteps,
+                        fine_ghost_pcd_offsets=None):
+        """Compute the predicted action (position, rotation, opening) from the predicted mask."""
+        # Select top-scoring ghost point
+        top_idx = torch.max(ghost_pcd_mask, dim=-1).indices
+        position = ghost_pcd[torch.arange(total_timesteps), :, top_idx]
+
+        # Add an offset regressed from the ghost point's position to the predicted position
+        if fine_ghost_pcd_offsets is not None:
+            position = position + fine_ghost_pcd_offsets[torch.arange(total_timesteps), :, top_idx]
+
+        # Predict rotation and gripper opening
+        if self.rotation_parametrization in ["quat_from_top_ghost", "6D_from_top_ghost"]:
+            ghost_pcd_features = einops.rearrange(ghost_pcd_features, "npts bt c -> bt npts c")
+            features = ghost_pcd_features[torch.arange(total_timesteps), top_idx]
+        elif self.rotation_parametrization in ["quat_from_query", "6D_from_query"]:
+            features = query_features.squeeze(0)
+
+        pred = self.gripper_state_predictor(features)
+
+        if "quat" in self.rotation_parametrization:
+            rotation = normalise_quat(pred[:, :self.rotation_dim])
+        elif "6D" in self.rotation_parametrization:
+            rotation = compute_rotation_matrix_from_ortho6d(pred[:, :self.rotation_dim])
+
+        gripper = torch.sigmoid(pred[:, self.rotation_dim:])
+
+        return position, rotation, gripper
diff --git a/diffuser_actor/trajectory_optimization/__init__.py b/diffuser_actor/trajectory_optimization/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/diffuser_actor/trajectory_optimization/diffuser_actor.py b/diffuser_actor/trajectory_optimization/diffuser_actor.py
new file mode 100644
index 0000000..ee3a8c1
--- /dev/null
+++ b/diffuser_actor/trajectory_optimization/diffuser_actor.py
@@ -0,0 +1,745 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import einops
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+
+from diffuser_actor.utils.layers import (
+    FFWRelativeSelfAttentionModule,
+    FFWRelativeCrossAttentionModule,
+    FFWRelativeSelfCrossAttentionModule
+)
+from diffuser_actor.utils.encoder import Encoder
+from diffuser_actor.utils.layers import ParallelAttention
+from diffuser_actor.utils.position_encodings import (
+    RotaryPositionEncoding3D,
+    SinusoidalPosEmb
+)
+from diffuser_actor.utils.utils import (
+    compute_rotation_matrix_from_ortho6d,
+    get_ortho6d_from_rotation_matrix,
+    normalise_quat,
+    matrix_to_quaternion,
+    quaternion_to_matrix
+)
+# from diffuser_actor.trajectory_optimization.dist import *
+# from diffuser_actor.trajectory_optimization import transforms
+
+
+# class DiffuseTwists:
+#     """
+#     Diffusion in the twists space, i.e., Li derivatives.
+#     Twists = [skew, translation] = [angular_velocity, linear_velocity]
+#     """
+#     def __init__(self,
+#                  num_train_timesteps=None,
+#                  beta_schedule=None,
+#                  prediction_type=None):
+#         self.K = num_train_timesteps
+#         self.k = None
+#         self.ang_mult = 1
+#         self.lin_mult = 1
+#         self.temperature_base = 1  # ZXP! differs from Diff-EDF
+#         self.time_exponent_alpha = 0.5
+#         self.time_exponent_temp = 1
+#         self.inference = '+'  # in +, x, xx
+#
+#     def set_timesteps(self, n_steps):
+#         self.k = n_steps
+#
+#     def add_noise(self, condition_data, noise, noise_t):
+#         eps = self.k / 2 * (float(self.ang_mult) ** 2)   # Shape: (1,)
+#         std = torch.sqrt(self.k) * float(self.lin_mult)   # Shape: (1,)
+#         T, delta_T, (gt_ang_score, gt_lin_score), (gt_ang_score_ref, gt_lin_score_ref) = \
+#             diffuse_isotropic_se3_batched(T0 = condition_data, eps=eps, std=std, x_ref=None, double_precision=True)
+#         return T, gt_ang_score, gt_lin_score
+#
+#     def step(self, denoise, k, T):
+#         self.k = k
+#         temperature = self.temperature_base * torch.pow(k, self.time_exponent_temp)
+#         alpha_ang = (self.ang_mult ** 2) * torch.pow(k, self.time_exponent_alpha) * self.K
+#         alpha_lin = (self.lin_mult ** 2) * torch.pow(k, self.time_exponent_alpha) * self.K
+#
+#         with torch.no_grad():
+#             (ang_score_dimless, lin_score_dimless) = denoise
+#         ang_score = ang_score_dimless.type(torch.float64) / (self.ang_mult * torch.sqrt(k))
+#         lin_score = lin_score_dimless.type(torch.float64) / (self.lin_mult * torch.sqrt(k))
+#
+#         ang_noise = torch.sqrt(temperature * alpha_ang) * torch.randn_like(ang_score, dtype=torch.float64)
+#         lin_noise = torch.sqrt(temperature * alpha_lin) * torch.randn_like(lin_score, dtype=torch.float64)
+#         ang_disp = (alpha_ang / 2) * ang_score + ang_noise
+#         lin_disp = (alpha_lin / 2) * lin_score + lin_noise
+#
+#         if self.inference == '+':
+#             # T + RdT
+#             L = T.detach()[...,self.q_indices] * (self.q_factor.type(torch.float64))
+#             q, x = T[...,:4], T[...,4:]
+#             dq = torch.einsum('...ij,...j->...i', L, ang_disp)
+#             dx = transforms.quaternion_apply(q, lin_disp)
+#             q = transforms.normalize_quaternion(q + dq)
+#             T = torch.cat([q, x+dx], dim=-1)
+#         elif self.inference == 'x':
+#             # TdT
+#             dT = transforms.se3_exp_map(torch.cat([lin_disp, ang_disp], dim=-1))
+#             dT = torch.cat([transforms.matrix_to_quaternion(dT[..., :3, :3]), dT[..., :3, 3]], dim=-1)
+#             T = transforms.multiply_se3(T, dT)
+#         elif self.inference == 'xx':
+#             # T*dT_clean*dT_noise
+#             ang_clean = (alpha_ang / 2) * ang_score
+#             lin_clean = (alpha_lin / 2) * lin_score
+#             dT_clean = transforms.se3_exp_map(torch.cat([lin_clean, ang_clean], dim=-1))
+#             dT_noise = transforms.se3_exp_map(torch.cat([lin_noise, ang_noise], dim=-1))
+#             dT_clean = torch.cat([transforms.matrix_to_quaternion(dT_clean[..., :3, :3]), dT_clean[..., :3, 3]], dim=-1)
+#             dT_noise = torch.cat([transforms.matrix_to_quaternion(dT_noise[..., :3, :3]), dT_noise[..., :3, 3]], dim=-1)
+#             T = transforms.multiply_se3(T, dT_clean)
+#             T = transforms.multiply_se3(T, dT_noise)
+#         else:
+#             raise NotImplementedError(self.inference)
+#
+#         self.prev_sample = T
+
+
+class DiffuserActor(nn.Module):
+
+    def __init__(self,
+                 backbone="clip",
+                 image_size=(256, 256),
+                 embedding_dim=60,
+                 num_vis_ins_attn_layers=2,
+                 use_instruction=False,
+                 fps_subsampling_factor=5,
+                 gripper_loc_bounds=None,
+                 rotation_parametrization='6D',
+                 quaternion_format='xyzw',
+                 diffusion_timesteps=100,
+                 nhist=3,
+                 relative=False,
+                 lang_enhanced=False):
+        super().__init__()
+        self._rotation_parametrization = rotation_parametrization
+        self._quaternion_format = quaternion_format
+        self._relative = relative
+        self.use_instruction = use_instruction
+        self.encoder = Encoder(
+            backbone=backbone,
+            image_size=image_size,
+            embedding_dim=embedding_dim,
+            num_sampling_level=1,
+            nhist=nhist,
+            num_vis_ins_attn_layers=num_vis_ins_attn_layers,
+            fps_subsampling_factor=fps_subsampling_factor
+        )
+        self.prediction_head = DiffusionHead(
+            embedding_dim=embedding_dim,
+            use_instruction=use_instruction,
+            rotation_parametrization=rotation_parametrization,
+            nhist=nhist,
+            lang_enhanced=lang_enhanced
+        )
+        self.position_noise_scheduler = DDPMScheduler(
+            num_train_timesteps=diffusion_timesteps,
+            beta_schedule="scaled_linear",
+            prediction_type="epsilon"
+        )
+        self.rotation_noise_scheduler = DDPMScheduler(
+            num_train_timesteps=diffusion_timesteps,
+            beta_schedule="squaredcos_cap_v2",
+            prediction_type="epsilon"
+        )
+        self.n_steps = diffusion_timesteps
+        self.gripper_loc_bounds = torch.tensor(gripper_loc_bounds)
+
+    def encode_inputs(self, visible_rgb, visible_pcd, instruction,
+                      curr_gripper):
+        # Compute visual features/positional embeddings at different scales
+        rgb_feats_pyramid, pcd_pyramid = self.encoder.encode_images(
+            visible_rgb, visible_pcd
+        )
+        # Keep only low-res scale
+        context_feats = einops.rearrange(
+            rgb_feats_pyramid[0],
+            "b ncam c h w -> b (ncam h w) c"
+        )
+        context = pcd_pyramid[0]
+
+        # Encode instruction (B, 53, F)
+        instr_feats = None
+        if self.use_instruction:
+            instr_feats, _ = self.encoder.encode_instruction(instruction)
+
+        # Cross-attention vision to language
+        if self.use_instruction:
+            # Attention from vision to language
+            context_feats = self.encoder.vision_language_attention(
+                context_feats, instr_feats
+            )
+
+        # Encode gripper history (B, nhist, F)
+        adaln_gripper_feats, _ = self.encoder.encode_curr_gripper(
+            curr_gripper, context_feats, context
+        )
+
+        # FPS on visual features (N, B, F) and (B, N, F, 2)
+        fps_feats, fps_pos = self.encoder.run_fps(
+            context_feats.transpose(0, 1),
+            self.encoder.relative_pe_layer(context)
+        )
+        return (
+            context_feats, context,  # contextualized visual features
+            instr_feats,  # language features
+            adaln_gripper_feats,  # gripper history features
+            fps_feats, fps_pos  # sampled visual features
+        )
+
+    def policy_forward_pass(self, trajectory, timestep, fixed_inputs):
+        # Parse inputs
+        (
+            context_feats,
+            context,
+            instr_feats,
+            adaln_gripper_feats,
+            fps_feats,
+            fps_pos
+        ) = fixed_inputs
+
+        return self.prediction_head(
+            trajectory,
+            timestep,
+            context_feats=context_feats,
+            context=context,
+            instr_feats=instr_feats,
+            adaln_gripper_feats=adaln_gripper_feats,
+            fps_feats=fps_feats,
+            fps_pos=fps_pos
+        )
+
+    def conditional_sample(self, condition_data, condition_mask, fixed_inputs):
+        self.position_noise_scheduler.set_timesteps(self.n_steps)
+        self.rotation_noise_scheduler.set_timesteps(self.n_steps)
+
+        # Random trajectory, conditioned on start-end
+        noise = torch.randn(
+            size=condition_data.shape,
+            dtype=condition_data.dtype,
+            device=condition_data.device
+        )
+        # Noisy condition data
+        # ZXP?
+        noise_t = torch.ones(
+            (len(condition_data),), device=condition_data.device
+        ).long().mul(self.position_noise_scheduler.timesteps[0])
+        noise_pos = self.position_noise_scheduler.add_noise(
+            condition_data[..., :3], noise[..., :3], noise_t
+        )
+        noise_rot = self.rotation_noise_scheduler.add_noise(
+            condition_data[..., 3:9], noise[..., 3:9], noise_t
+        )
+        noisy_condition_data = torch.cat((noise_pos, noise_rot), -1)
+        trajectory = torch.where(
+            condition_mask, noisy_condition_data, noise
+        )
+
+        # Iterative denoising
+        timesteps = self.position_noise_scheduler.timesteps
+        for t in timesteps:
+            out = self.policy_forward_pass(
+                trajectory,
+                t * torch.ones(len(trajectory)).to(trajectory.device).long(),
+                fixed_inputs
+            )
+            out = out[-1]  # keep only last layer's output
+            pos = self.position_noise_scheduler.step(
+                out[..., :3], t, trajectory[..., :3]
+            ).prev_sample
+            rot = self.rotation_noise_scheduler.step(
+                out[..., 3:9], t, trajectory[..., 3:9]
+            ).prev_sample
+            trajectory = torch.cat((pos, rot), -1)
+
+        trajectory = torch.cat((trajectory, out[..., 9:]), -1)
+
+        return trajectory
+
+    def compute_trajectory(
+        self,
+        trajectory_mask,
+        rgb_obs,
+        pcd_obs,
+        instruction,
+        curr_gripper
+    ):
+        # Normalize all pos
+        pcd_obs = pcd_obs.clone()
+        curr_gripper = curr_gripper.clone()
+        pcd_obs = torch.permute(self.normalize_pos(
+            torch.permute(pcd_obs, [0, 1, 3, 4, 2])
+        ), [0, 1, 4, 2, 3])
+        curr_gripper[..., :3] = self.normalize_pos(curr_gripper[..., :3])
+        curr_gripper = self.convert_rot(curr_gripper)
+
+        # Prepare inputs
+        fixed_inputs = self.encode_inputs(
+            rgb_obs, pcd_obs, instruction, curr_gripper
+        )
+
+        # Condition on start-end pose
+        B, nhist, D = curr_gripper.shape
+        cond_data = torch.zeros(
+            (B, trajectory_mask.size(1), D),
+            device=rgb_obs.device
+        )
+        cond_mask = torch.zeros_like(cond_data)
+        cond_mask = cond_mask.bool()
+
+        # Sample
+        trajectory = self.conditional_sample(
+            cond_data,
+            cond_mask,
+            fixed_inputs
+        )
+
+        # Normalize quaternion
+        if self._rotation_parametrization != '6D':
+            trajectory[:, :, 3:7] = normalise_quat(trajectory[:, :, 3:7])
+        # Back to quaternion
+        trajectory = self.unconvert_rot(trajectory)
+        # unnormalize position
+        trajectory[:, :, :3] = self.unnormalize_pos(trajectory[:, :, :3])
+        # Convert gripper status to probaility
+        if trajectory.shape[-1] > 7:
+            trajectory[..., 7] = trajectory[..., 7].sigmoid()
+
+        return trajectory
+
+    def normalize_pos(self, pos):
+        pos_min = self.gripper_loc_bounds[0].float().to(pos.device)
+        pos_max = self.gripper_loc_bounds[1].float().to(pos.device)
+        return (pos - pos_min) / (pos_max - pos_min) * 2.0 - 1.0
+
+    def unnormalize_pos(self, pos):
+        pos_min = self.gripper_loc_bounds[0].float().to(pos.device)
+        pos_max = self.gripper_loc_bounds[1].float().to(pos.device)
+        return (pos + 1.0) / 2.0 * (pos_max - pos_min) + pos_min
+
+    def convert_rot(self, signal):
+        signal[..., 3:7] = normalise_quat(signal[..., 3:7])
+        if self._rotation_parametrization == '6D':
+            # The following code expects wxyz quaternion format!
+            if self._quaternion_format == 'xyzw':
+                signal[..., 3:7] = signal[..., (6, 3, 4, 5)]
+            rot = quaternion_to_matrix(signal[..., 3:7])
+            res = signal[..., 7:] if signal.size(-1) > 7 else None
+            if len(rot.shape) == 4:
+                B, L, D1, D2 = rot.shape
+                rot = rot.reshape(B * L, D1, D2)
+                rot_6d = get_ortho6d_from_rotation_matrix(rot)
+                rot_6d = rot_6d.reshape(B, L, 6)
+            else:
+                rot_6d = get_ortho6d_from_rotation_matrix(rot)
+            signal = torch.cat([signal[..., :3], rot_6d], dim=-1)
+            if res is not None:
+                signal = torch.cat((signal, res), -1)
+        return signal
+
+    def unconvert_rot(self, signal):
+        if self._rotation_parametrization == '6D':
+            res = signal[..., 9:] if signal.size(-1) > 9 else None
+            if len(signal.shape) == 3:
+                B, L, _ = signal.shape
+                rot = signal[..., 3:9].reshape(B * L, 6)
+                mat = compute_rotation_matrix_from_ortho6d(rot)
+                quat = matrix_to_quaternion(mat)
+                quat = quat.reshape(B, L, 4)
+            else:
+                rot = signal[..., 3:9]
+                mat = compute_rotation_matrix_from_ortho6d(rot)
+                quat = matrix_to_quaternion(mat)
+            signal = torch.cat([signal[..., :3], quat], dim=-1)
+            if res is not None:
+                signal = torch.cat((signal, res), -1)
+            # The above code handled wxyz quaternion format!
+            if self._quaternion_format == 'xyzw':
+                signal[..., 3:7] = signal[..., (4, 5, 6, 3)]
+        return signal
+
+    def convert2rel(self, pcd, curr_gripper):
+        """Convert coordinate system relaative to current gripper."""
+        center = curr_gripper[:, -1, :3]  # (batch_size, 3)
+        bs = center.shape[0]
+        pcd = pcd - center.view(bs, 1, 3, 1, 1)
+        curr_gripper = curr_gripper.clone()
+        curr_gripper[..., :3] = curr_gripper[..., :3] - center.view(bs, 1, 3)
+        return pcd, curr_gripper
+
+    def forward(
+        self,
+        gt_trajectory,
+        trajectory_mask,
+        rgb_obs,
+        pcd_obs,
+        instruction,
+        curr_gripper,
+        run_inference=False
+    ):
+        """
+        Arguments:
+            gt_trajectory: (B, trajectory_length, 3+4+X)
+            trajectory_mask: (B, trajectory_length)
+            timestep: (B, 1)
+            rgb_obs: (B, num_cameras, 3, H, W) in [0, 1]
+            pcd_obs: (B, num_cameras, 3, H, W) in world coordinates
+            instruction: (B, max_instruction_length, 512)
+            curr_gripper: (B, nhist, 3+4+X)
+
+        Note:
+            Regardless of rotation parametrization, the input rotation
+            is ALWAYS expressed as a quaternion form.
+            The model converts it to 6D internally if needed.
+        """
+        if self._relative:
+            pcd_obs, curr_gripper = self.convert2rel(pcd_obs, curr_gripper)
+        if gt_trajectory is not None:
+            gt_openess = gt_trajectory[..., 7:]
+            gt_trajectory = gt_trajectory[..., :7]
+        curr_gripper = curr_gripper[..., :7]
+
+        # gt_trajectory is expected to be in the quaternion format
+        if run_inference:
+            return self.compute_trajectory(
+                trajectory_mask,
+                rgb_obs,
+                pcd_obs,
+                instruction,
+                curr_gripper
+            )
+        # Normalize all pos
+        gt_trajectory = gt_trajectory.clone()
+        pcd_obs = pcd_obs.clone()
+        curr_gripper = curr_gripper.clone()
+        gt_trajectory[:, :, :3] = self.normalize_pos(gt_trajectory[:, :, :3])
+        pcd_obs = torch.permute(self.normalize_pos(
+            torch.permute(pcd_obs, [0, 1, 3, 4, 2])
+        ), [0, 1, 4, 2, 3])
+        curr_gripper[..., :3] = self.normalize_pos(curr_gripper[..., :3])
+
+        # Convert rotation parametrization
+        gt_trajectory = self.convert_rot(gt_trajectory)
+        curr_gripper = self.convert_rot(curr_gripper)
+
+        # Prepare inputs
+        fixed_inputs = self.encode_inputs(
+            rgb_obs, pcd_obs, instruction, curr_gripper
+        )
+
+        # Condition on start-end pose
+        cond_data = torch.zeros_like(gt_trajectory)
+        cond_mask = torch.zeros_like(cond_data)
+        cond_mask = cond_mask.bool()
+
+        # Sample noise
+        noise = torch.randn(gt_trajectory.shape, device=gt_trajectory.device)
+
+        # Sample a random timestep
+        timesteps = torch.randint(
+            0,
+            self.position_noise_scheduler.config.num_train_timesteps,
+            (len(noise),), device=noise.device
+        ).long()
+
+        # Add noise to the clean trajectories
+        pos = self.position_noise_scheduler.add_noise(
+            gt_trajectory[..., :3], noise[..., :3],
+            timesteps
+        )
+        rot = self.rotation_noise_scheduler.add_noise(
+            gt_trajectory[..., 3:9], noise[..., 3:9],
+            timesteps
+        )
+        noisy_trajectory = torch.cat((pos, rot), -1)
+        noisy_trajectory[cond_mask] = cond_data[cond_mask]  # condition
+        assert not cond_mask.any()
+
+        # Predict the noise residual
+        pred = self.policy_forward_pass(
+            noisy_trajectory, timesteps, fixed_inputs
+        )
+
+        # Compute loss
+        total_loss = 0
+        for layer_pred in pred:
+            trans = layer_pred[..., :3]
+            rot = layer_pred[..., 3:9]
+            loss = (
+                30 * F.l1_loss(trans, noise[..., :3], reduction='mean')
+                + 10 * F.l1_loss(rot, noise[..., 3:9], reduction='mean')
+            )
+            if torch.numel(gt_openess) > 0:
+                openess = layer_pred[..., 9:]
+                loss += F.binary_cross_entropy_with_logits(openess, gt_openess)
+            total_loss = total_loss + loss
+        return total_loss
+
+
+class DiffusionHead(nn.Module):
+
+    def __init__(self,
+                 embedding_dim=60,
+                 num_attn_heads=8,
+                 use_instruction=False,
+                 rotation_parametrization='quat',
+                 nhist=3,
+                 lang_enhanced=False):
+        super().__init__()
+        self.use_instruction = use_instruction
+        self.lang_enhanced = lang_enhanced
+        if '6D' in rotation_parametrization:
+            rotation_dim = 6  # continuous 6D
+        else:
+            rotation_dim = 4  # quaternion
+
+        # Encoders
+        self.traj_encoder = nn.Linear(9, embedding_dim)
+        self.relative_pe_layer = RotaryPositionEncoding3D(embedding_dim)
+        self.time_emb = nn.Sequential(
+            SinusoidalPosEmb(embedding_dim),
+            nn.Linear(embedding_dim, embedding_dim),
+            nn.ReLU(),
+            nn.Linear(embedding_dim, embedding_dim)
+        )
+        self.curr_gripper_emb = nn.Sequential(
+            nn.Linear(embedding_dim * nhist, embedding_dim),
+            nn.ReLU(),
+            nn.Linear(embedding_dim, embedding_dim)
+        )
+        self.traj_time_emb = SinusoidalPosEmb(embedding_dim)
+
+        # Attention from trajectory queries to language
+        self.traj_lang_attention = nn.ModuleList([
+            ParallelAttention(
+                num_layers=1,
+                d_model=embedding_dim, n_heads=num_attn_heads,
+                self_attention1=False, self_attention2=False,
+                cross_attention1=True, cross_attention2=False,
+                rotary_pe=False, apply_ffn=False
+            )
+        ])
+
+        # Estimate attends to context (no subsampling)
+        self.cross_attn = FFWRelativeCrossAttentionModule(
+            embedding_dim, num_attn_heads, num_layers=2, use_adaln=True
+        )
+
+        # Shared attention layers
+        if not self.lang_enhanced:
+            self.self_attn = FFWRelativeSelfAttentionModule(
+                embedding_dim, num_attn_heads, num_layers=4, use_adaln=True
+            )
+        else:  # interleave cross-attention to language
+            self.self_attn = FFWRelativeSelfCrossAttentionModule(
+                embedding_dim, num_attn_heads,
+                num_self_attn_layers=4,
+                num_cross_attn_layers=3,
+                use_adaln=True
+            )
+
+        # Specific (non-shared) Output layers:
+        # 1. Rotation
+        self.rotation_proj = nn.Linear(embedding_dim, embedding_dim)
+        if not self.lang_enhanced:
+            self.rotation_self_attn = FFWRelativeSelfAttentionModule(
+                embedding_dim, num_attn_heads, 2, use_adaln=True
+            )
+        else:  # interleave cross-attention to language
+            self.rotation_self_attn = FFWRelativeSelfCrossAttentionModule(
+                embedding_dim, num_attn_heads, 2, 1, use_adaln=True
+            )
+        self.rotation_predictor = nn.Sequential(
+            nn.Linear(embedding_dim, embedding_dim),
+            nn.ReLU(),
+            nn.Linear(embedding_dim, rotation_dim)
+        )
+
+        # 2. Position
+        self.position_proj = nn.Linear(embedding_dim, embedding_dim)
+        if not self.lang_enhanced:
+            self.position_self_attn = FFWRelativeSelfAttentionModule(
+                embedding_dim, num_attn_heads, 2, use_adaln=True
+            )
+        else:  # interleave cross-attention to language
+            self.position_self_attn = FFWRelativeSelfCrossAttentionModule(
+                embedding_dim, num_attn_heads, 2, 1, use_adaln=True
+            )
+        self.position_predictor = nn.Sequential(
+            nn.Linear(embedding_dim, embedding_dim),
+            nn.ReLU(),
+            nn.Linear(embedding_dim, 3)
+        )
+
+        # 3. Openess
+        self.openess_predictor = nn.Sequential(
+            nn.Linear(embedding_dim, embedding_dim),
+            nn.ReLU(),
+            nn.Linear(embedding_dim, 1)
+        )
+
+    def forward(self, trajectory, timestep,
+                context_feats, context, instr_feats, adaln_gripper_feats,
+                fps_feats, fps_pos):
+        """
+        Arguments:
+            trajectory: (B, trajectory_length, 3+6+X)
+            timestep: (B, 1)
+            context_feats: (B, N, F)
+            context: (B, N, F, 2)
+            instr_feats: (B, max_instruction_length, F)
+            adaln_gripper_feats: (B, nhist, F)
+            fps_feats: (N, B, F), N < context_feats.size(1)
+            fps_pos: (B, N, F, 2)
+        """
+        # Trajectory features
+        traj_feats = self.traj_encoder(trajectory)  # (B, L, F)
+
+        # Trajectory features cross-attend to context features
+        traj_time_pos = self.traj_time_emb(
+            torch.arange(0, traj_feats.size(1), device=traj_feats.device)
+        )[None].repeat(len(traj_feats), 1, 1)
+        if self.use_instruction:
+            traj_feats, _ = self.traj_lang_attention[0](
+                seq1=traj_feats, seq1_key_padding_mask=None,
+                seq2=instr_feats, seq2_key_padding_mask=None,
+                seq1_pos=None, seq2_pos=None,
+                seq1_sem_pos=traj_time_pos, seq2_sem_pos=None
+            )
+        traj_feats = traj_feats + traj_time_pos
+
+        # Predict position, rotation, opening
+        traj_feats = einops.rearrange(traj_feats, 'b l c -> l b c')
+        context_feats = einops.rearrange(context_feats, 'b l c -> l b c')
+        adaln_gripper_feats = einops.rearrange(
+            adaln_gripper_feats, 'b l c -> l b c'
+        )
+        pos_pred, rot_pred, openess_pred = self.prediction_head(
+            trajectory[..., :3], traj_feats,
+            context[..., :3], context_feats,
+            timestep, adaln_gripper_feats,
+            fps_feats, fps_pos,
+            instr_feats
+        )
+        return [torch.cat((pos_pred, rot_pred, openess_pred), -1)]
+
+    def prediction_head(self,
+                        gripper_pcd, gripper_features,
+                        context_pcd, context_features,
+                        timesteps, curr_gripper_features,
+                        sampled_context_features, sampled_rel_context_pos,
+                        instr_feats):
+        """
+        Compute the predicted action (position, rotation, opening).
+
+        Args:
+            gripper_pcd: A tensor of shape (B, N, 3)
+            gripper_features: A tensor of shape (N, B, F)
+            context_pcd: A tensor of shape (B, N, 3)
+            context_features: A tensor of shape (N, B, F)
+            timesteps: A tensor of shape (B,) indicating the diffusion step
+            curr_gripper_features: A tensor of shape (M, B, F)
+            sampled_context_features: A tensor of shape (K, B, F)
+            sampled_rel_context_pos: A tensor of shape (B, K, F, 2)
+            instr_feats: (B, max_instruction_length, F)
+        """
+        # Diffusion timestep
+        time_embs = self.encode_denoising_timestep(
+            timesteps, curr_gripper_features
+        )
+
+        # Positional embeddings
+        rel_gripper_pos = self.relative_pe_layer(gripper_pcd)
+        rel_context_pos = self.relative_pe_layer(context_pcd)
+
+        # Cross attention from gripper to full context
+        gripper_features = self.cross_attn(
+            query=gripper_features,
+            value=context_features,
+            query_pos=rel_gripper_pos,
+            value_pos=rel_context_pos,
+            diff_ts=time_embs
+        )[-1]
+
+        # Self attention among gripper and sampled context
+        features = torch.cat([gripper_features, sampled_context_features], 0)
+        rel_pos = torch.cat([rel_gripper_pos, sampled_rel_context_pos], 1)
+        features = self.self_attn(
+            query=features,
+            query_pos=rel_pos,
+            diff_ts=time_embs,
+            context=instr_feats,
+            context_pos=None
+        )[-1]
+
+        num_gripper = gripper_features.shape[0]
+
+        # Rotation head
+        rotation = self.predict_rot(
+            features, rel_pos, time_embs, num_gripper, instr_feats
+        )
+
+        # Position head
+        position, position_features = self.predict_pos(
+            features, rel_pos, time_embs, num_gripper, instr_feats
+        )
+
+        # Openess head from position head
+        openess = self.openess_predictor(position_features)
+
+        return position, rotation, openess
+
+    def encode_denoising_timestep(self, timestep, curr_gripper_features):
+        """
+        Compute denoising timestep features and positional embeddings.
+
+        Args:
+            - timestep: (B,)
+
+        Returns:
+            - time_feats: (B, F)
+        """
+        time_feats = self.time_emb(timestep)
+
+        curr_gripper_features = einops.rearrange(
+            curr_gripper_features, "npts b c -> b npts c"
+        )
+        curr_gripper_features = curr_gripper_features.flatten(1)
+        curr_gripper_feats = self.curr_gripper_emb(curr_gripper_features)
+        return time_feats + curr_gripper_feats
+
+    def predict_pos(self, features, rel_pos, time_embs, num_gripper,
+                    instr_feats):
+        position_features = self.position_self_attn(
+            query=features,
+            query_pos=rel_pos,
+            diff_ts=time_embs,
+            context=instr_feats,
+            context_pos=None
+        )[-1]
+        position_features = einops.rearrange(
+            position_features[:num_gripper], "npts b c -> b npts c"
+        )
+        position_features = self.position_proj(position_features)  # (B, N, C)
+        position = self.position_predictor(position_features)
+        return position, position_features
+
+    def predict_rot(self, features, rel_pos, time_embs, num_gripper,
+                    instr_feats):
+        rotation_features = self.rotation_self_attn(
+            query=features,
+            query_pos=rel_pos,
+            diff_ts=time_embs,
+            context=instr_feats,
+            context_pos=None
+        )[-1]
+        rotation_features = einops.rearrange(
+            rotation_features[:num_gripper], "npts b c -> b npts c"
+        )
+        rotation_features = self.rotation_proj(rotation_features)  # (B, N, C)
+        rotation = self.rotation_predictor(rotation_features)
+        return rotation
diff --git a/diffuser_actor/trajectory_optimization/dist.py b/diffuser_actor/trajectory_optimization/dist.py
new file mode 100644
index 0000000..f28ae70
--- /dev/null
+++ b/diffuser_actor/trajectory_optimization/dist.py
@@ -0,0 +1,394 @@
+# from https://github.com/tomato1mule/diffusion_edf
+from __future__ import annotations
+from typing import Optional, Union, Dict, List, Tuple
+import time
+import datetime
+import os
+import random
+import math
+import warnings
+
+import matplotlib.pyplot as plt
+from tqdm import tqdm
+
+import matplotlib
+import matplotlib.pyplot as plt
+
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+
+from torch_cluster import radius_graph, radius
+from torch_scatter import scatter, scatter_logsumexp, scatter_log_softmax
+SE3_SCORE_TYPE = Tuple[torch.Tensor, torch.Tensor]
+from xitorch.interpolate import Interp1D
+from diffuser_actor.trajectory_optimization import transforms
+
+@torch.jit.script
+def haar_measure_angle(omg: torch.Tensor) -> torch.Tensor:
+    assert (omg <= torch.pi).all() and (omg >= 0.).all()
+    return (1 - torch.cos(omg)) / torch.pi
+
+
+@torch.jit.script
+def haar_measure(q: torch.Tensor) -> torch.Tensor:
+    versor = q[..., :0]  # cos(omg/2)
+    cos_omg = 2 * torch.square(versor) - 1.
+    assert (cos_omg <= 1.).all() and (cos_omg >= -1.).all()
+
+    return (1 - cos_omg) / torch.pi
+
+
+@torch.jit.script
+def igso3_small_angle(omg: torch.Tensor, eps: Union[float, torch.Tensor]) -> torch.Tensor:
+    assert (omg <= torch.pi).all() and (omg >= 0.).all()
+    if not isinstance(eps, torch.Tensor):
+        eps = torch.tensor(eps, device=omg.device, dtype=omg.dtype)
+
+    if eps.dtype is torch.float64:
+        small_number = 1e-20
+        if eps.item() < 1e-10:
+            warnings.warn("Too small eps: {eps} is provided.")
+    else:
+        small_number = 1e-9
+        if eps.item() < 1e-5:
+            warnings.warn("Too small eps: {eps} is provided. Consider using double precision")
+
+    small_num = small_number / 2
+    small_dnm = (1 - torch.exp(-1. * torch.pi ** 2 / eps) * (2 - 4 * (torch.pi ** 2) / eps)) * small_number
+
+    return 0.5 * torch.sqrt(torch.pi) * (eps ** -1.5) * torch.exp((eps - (omg ** 2 / eps)) / 4) / (
+                torch.sin(omg / 2) + small_num) \
+        * (small_dnm + omg - ((omg - 2 * torch.pi) * torch.exp(torch.pi * (omg - torch.pi) / eps) + (
+                    omg + 2 * torch.pi) * torch.exp(-torch.pi * (omg + torch.pi) / eps)))
+
+
+def determine_lmax(eps: float) -> int:
+    assert eps > 0.
+    thr = 10.  # lmax ~= 100 is enough to guarantee exp[-lmax(lmax+1)eps] < exp(-10). for eps = 1e-3
+    lmax = max(math.ceil(math.sqrt(thr / eps)),
+               5)  # Even for eps = 1e-7, only lmax ~= 10000 is required, which can be calculated almost immediately.
+    # lmax(lmax+1) > lmax^2 >= thr/eps    ---->    exp[-lmax(lmax+1)eps] < exp(-thr).
+    return lmax
+
+
+@torch.jit.script
+def igso3_angle(omg: torch.Tensor, eps: Union[float, torch.Tensor], lmax: Optional[int] = None) -> torch.Tensor:
+    assert (omg <= torch.pi).all() and (omg >= 0.).all()
+    if lmax is None:
+        if isinstance(eps, torch.Tensor):
+            lmax = determine_lmax(eps=eps.item())
+        else:
+            lmax = determine_lmax(eps=eps)
+
+    if not isinstance(eps, torch.Tensor):
+        eps = torch.tensor(eps, device=omg.device, dtype=omg.dtype)
+
+    if eps.dtype is torch.float64:
+        small_number = 1e-20
+        if eps.item() < 1e-10:
+            warnings.warn("Too small eps: {eps} is provided.")
+    else:
+        small_number = 1e-9
+        if eps.item() < 1e-5:
+            warnings.warn("Too small eps: {eps} is provided. Consider using double precision")
+
+    l = torch.arange(lmax + 1, device=omg.device, dtype=torch.long)
+    omg = omg[..., None]
+    sum = (2 * l + 1) * torch.exp(-l * (l + 1) * eps) * (torch.sin((l + 0.5) * omg) + (l + 0.5) * small_number) / (
+                torch.sin(omg / 2) + 0.5 * small_number)
+
+    return torch.clamp(sum.sum(dim=-1), min=0.)
+
+
+@torch.jit.script
+def igso3(q: torch.Tensor, eps: Union[float, torch.Tensor], lmax: Optional[int] = None) -> torch.Tensor:
+    versor = q[..., 0]  # cos(omg/2)
+    omg = torch.acos(versor) * 2
+    assert (omg <= torch.pi).all() and (omg >= 0.).all()
+
+    return igso3_angle(omg=omg, eps=eps, lmax=lmax)
+
+
+@torch.jit.script
+def igso3_lie_deriv(q: torch.Tensor, eps: Union[float, torch.Tensor], lmax: Optional[int] = None) -> torch.Tensor:
+    versor = q[..., 0]  # cos(omg/2)
+    omg = torch.acos(versor) * 2
+    assert (omg <= torch.pi).all() and (omg >= 0.).all()
+
+    if lmax is None:
+        if isinstance(eps, torch.Tensor):
+            lmax = determine_lmax(eps=eps.item())
+        else:
+            lmax = determine_lmax(eps=eps)
+
+    if not isinstance(eps, torch.Tensor):
+        eps = torch.tensor(eps, device=omg.device, dtype=omg.dtype)
+
+    if eps.dtype is torch.float64:
+        small_number = 1e-20
+        if eps.item() < 1e-10:
+            warnings.warn("Too small eps: {eps} is provided.")
+    else:
+        small_number = 1e-9
+        if eps.item() < 1e-5:
+            warnings.warn("Too small eps: {eps} is provided. Consider using double precision")
+
+    l = torch.arange(lmax + 1, device=q.device, dtype=torch.long)  # shape: (lmax+1,)
+    omg = omg[..., None]  # shape: (..., 1)
+
+    lie_deriv_cos_omg = -2 * versor[..., None] * q[..., 1:]  # shape: (..., 3)
+
+    char_deriv = (((l + 1) * torch.sin((l) * omg)) - ((l) * torch.sin((l + 1) * omg)) + small_number * l * (l + 1) * (
+                2 * l + 1)) / ((1 - torch.cos(omg)) * torch.sin(omg) + 3 * small_number)  # shape: (..., lmax_+1)
+    sum = ((2 * l + 1) * torch.exp(-l * (l + 1) * eps) * char_deriv).unsqueeze(-1) * lie_deriv_cos_omg.unsqueeze(
+        -2)  # shape: (..., lmax_+1, 3)
+
+    return sum.sum(dim=-2)
+
+
+@torch.jit.script
+def igso3_score(q: torch.Tensor, eps: Union[float, torch.Tensor], lmax: Optional[int] = None) -> torch.Tensor:
+    deriv = igso3_lie_deriv(q=q, eps=eps, lmax=lmax)
+    prob = igso3(q=q, eps=eps, lmax=lmax).unsqueeze(-1)
+
+    if q.dtype is torch.float64:
+        small_number = 1e-30
+    else:
+        small_number = 1e-10
+
+    return (deriv / (prob + small_number)) * (prob > 0.)
+
+
+def get_inv_cdf(eps: Union[float, torch.Tensor],
+                N: int = 1000,
+                dtype: Optional[torch.dtype] = torch.float64,
+                device: Optional[Union[str, torch.device]] = None) -> Interp1D:
+    if not isinstance(eps, torch.Tensor):
+        eps = torch.tensor(eps, device=device, dtype=dtype)
+
+    N = 1000
+    omg_max_prob = 2 * math.sqrt(eps)
+    omg_range = min(omg_max_prob * 4, math.pi)
+    # omg_max_prob_idx = ((omg_max_prob) * N / omg_range)
+
+    X = torch.linspace(0, omg_range, N, device=device, dtype=dtype)
+    Y = igso3_angle(X, eps=eps) * haar_measure_angle(X)
+
+    cdf = torch.cumsum(Y, dim=-1)
+    cdf = cdf / cdf.max()
+    return Interp1D(cdf, X, 'linear')  # https://gist.github.com/amarvutha/c2a3ea9d42d238551c694480019a6ce1
+
+
+def _sample_igso3(inv_cdf: Interp1D,
+                  N: int,
+                  dtype: Optional[torch.dtype] = torch.float64,
+                  device: Optional[Union[str, torch.device]] = None) -> torch.Tensor:
+    angle = inv_cdf(torch.rand(N, device=device, dtype=dtype)).unsqueeze(-1)
+    axis = F.normalize(torch.randn(N, 3, device=device, dtype=dtype), dim=-1)
+
+    return transforms.axis_angle_to_quaternion(axis * angle)
+
+
+def sample_igso3(eps: Union[float, torch.Tensor],
+                 N: int = 1,
+                 dtype: Optional[torch.dtype] = torch.float64,
+                 device: Optional[Union[str, torch.device]] = None) -> torch.Tensor:
+    inv_cdf = get_inv_cdf(eps=eps, device=device, dtype=dtype)
+    return _sample_igso3(inv_cdf=inv_cdf, N=N, device=device, dtype=dtype)
+
+
+@torch.jit.script
+def r3_isotropic_gaussian_score(x: torch.Tensor, std: Union[float, torch.Tensor]) -> torch.Tensor:
+    if not isinstance(std, torch.Tensor):
+        std = torch.tensor(std, device=x.device, dtype=x.dtype)
+    return -x / torch.square(std)
+
+
+@torch.jit.script
+def r3_log_isotropic_gaussian(x: torch.Tensor, std: Union[float, torch.Tensor]) -> torch.Tensor:
+    if not isinstance(std, torch.Tensor):
+        std = torch.tensor(std, device=x.device, dtype=x.dtype)
+
+    return -0.5 * torch.square(x).sum(dim=-1) / torch.square(std) - 1.5 * math.log(
+        2 * torch.square(std) * torch.pi)  # gaussian
+
+
+@torch.jit.script
+def r3_isotropic_gaussian(x: torch.Tensor, std: Union[float, torch.Tensor]) -> torch.Tensor:
+    if not isinstance(std, torch.Tensor):
+        std = torch.tensor(std, device=x.device, dtype=x.dtype)
+
+    return torch.exp(r3_log_isotropic_gaussian(x=x, std=std))  # gaussian
+
+
+@torch.jit.script
+def se3_isotropic_gaussian_score(T: torch.Tensor,
+                                 eps: Union[float, torch.Tensor],
+                                 std: Union[float, torch.Tensor]) -> SE3_SCORE_TYPE:
+    q = T[..., :4]
+    x = T[..., 4:]
+
+    ang_score = igso3_score(q=q, eps=eps)
+    lin_score = r3_isotropic_gaussian_score(x=x, std=std)
+    lin_score = transforms.quaternion_apply(transforms.quaternion_invert(q), lin_score)
+
+    return ang_score, lin_score
+
+
+@torch.jit.script
+def adjoint_se3_score(T_ref: torch.Tensor, ang_score: torch.Tensor, lin_score: torch.Tensor) -> SE3_SCORE_TYPE:
+    assert ang_score.shape[:-1] == lin_score.shape[:-1] == T_ref.shape[:-1]
+    assert T_ref.shape[-1] == 7
+
+    ang_score = transforms.quaternion_apply(T_ref[..., :4], ang_score)
+    lin_score = torch.cross(T_ref[..., 4:], ang_score, dim=-1) + transforms.quaternion_apply(T_ref[..., :4], lin_score)
+
+    return ang_score, lin_score
+
+
+@torch.jit.script
+def adjoint_isotropic_se3_score(x_ref: torch.Tensor, ang_score: torch.Tensor,
+                                lin_score: torch.Tensor) -> SE3_SCORE_TYPE:
+    assert ang_score.shape[:-1] == lin_score.shape[:-1] == x_ref.shape[:-1]
+    assert x_ref.shape[-1] == 3
+
+    lin_score = torch.cross(x_ref, ang_score, dim=-1) + lin_score
+
+    return ang_score, lin_score
+
+
+@torch.jit.script
+def adjoint_inv_tr_se3_score(T_ref: torch.Tensor, ang_score: torch.Tensor, lin_score: torch.Tensor) -> SE3_SCORE_TYPE:
+    assert ang_score.shape[:-1] == lin_score.shape[:-1] == T_ref.shape[:-1]
+    assert T_ref.shape[-1] == 7
+
+    lin_score = transforms.quaternion_apply(T_ref[..., :4], lin_score)
+    ang_score = transforms.quaternion_apply(T_ref[..., :4], ang_score) + torch.cross(T_ref[..., 4:], lin_score, dim=-1)
+
+    return ang_score, lin_score
+
+
+@torch.jit.script
+def adjoint_inv_tr_isotropic_se3_score(x_ref: torch.Tensor, ang_score: torch.Tensor,
+                                       lin_score: torch.Tensor) -> SE3_SCORE_TYPE:
+    assert ang_score.shape[:-1] == lin_score.shape[:-1] == x_ref.shape[:-1]
+    assert x_ref.shape[-1] == 3
+
+    ang_score = ang_score + torch.cross(x_ref, lin_score, dim=-1)
+
+    return ang_score, lin_score
+
+
+def sample_isotropic_se3_gaussian(eps: Union[float, torch.Tensor], std: Union[float, torch.Tensor], N: int = 1,
+                                  dtype: Optional[torch.dtype] = torch.float64,
+                                  device: Optional[Union[str, torch.device]] = None) -> torch.Tensor:
+    x = torch.randn(N, 3, device=device, dtype=dtype) * std
+    q = sample_igso3(eps=eps, N=N, dtype=dtype, device=device)
+    return torch.cat([q, x], dim=-1)
+
+
+def diffuse_isotropic_se3(T0: torch.Tensor,
+                          eps: Union[float, torch.Tensor],
+                          std: Union[float, torch.Tensor],
+                          x_ref: Optional[torch.Tensor] = None,
+                          double_precision: bool = True) -> Tuple[torch.Tensor,
+torch.Tensor,
+SE3_SCORE_TYPE,
+SE3_SCORE_TYPE]:
+    assert T0.ndim == 2 and T0.shape[-1] == 7  # T0: shape (nT, 7)
+    assert x_ref.ndim == 2 and x_ref.shape[-1] == 3  # x_ref: shape (nT, 3)
+
+    input_dtype = T0.dtype
+    if double_precision:
+        T0 = T0.type(dtype=torch.float64)
+        if isinstance(eps, torch.Tensor):
+            eps = eps.type(dtype=torch.float64)
+        if isinstance(std, torch.Tensor):
+            std = std.type(dtype=torch.float64)
+        if isinstance(x_ref, torch.Tensor):
+            x_ref = x_ref.type(dtype=torch.float64)
+
+    delta_T = sample_isotropic_se3_gaussian(eps=eps, std=std, N=len(T0), dtype=T0.dtype,
+                                            device=T0.device)  # shape: (nT, 7)
+    ang_score_ref, lin_score_ref = se3_isotropic_gaussian_score(T=delta_T, eps=eps, std=std)  # shape: (nT, 3), (nT, 3)
+    if x_ref is not None:
+        ang_score, lin_score = adjoint_inv_tr_isotropic_se3_score(x_ref=x_ref, ang_score=ang_score_ref,
+                                                                  lin_score=lin_score_ref)  # shape: (nT, 3), (nT, 3)
+    else:
+        ang_score, lin_score = ang_score_ref, lin_score_ref  # shape: (nT, 3), (nT, 3)
+
+    if x_ref is not None:
+        delta_T = torch.cat([delta_T[..., :4],
+                             delta_T[..., 4:] + x_ref - transforms.quaternion_apply(delta_T[..., :4], x_ref)
+                             ], dim=-1)  # shape: (nT, 7)
+
+    T = transforms.multiply_se3(T0, delta_T)  # shape: (nT, 7)
+
+    return (
+        T.type(dtype=input_dtype),
+        delta_T.type(dtype=input_dtype),
+        (ang_score.type(dtype=input_dtype), lin_score.type(dtype=input_dtype)),
+        (ang_score_ref.type(dtype=input_dtype), lin_score_ref.type(dtype=input_dtype))
+    )
+
+
+def diffuse_isotropic_se3_batched(T0: torch.Tensor,
+                                  eps: Union[float, torch.Tensor],
+                                  std: Union[float, torch.Tensor],
+                                  x_ref: Optional[torch.Tensor],
+                                  is_left_diff: bool = False,
+                                  double_precision: bool = True) -> Tuple[torch.Tensor,
+                                                                          torch.Tensor,
+                                                                          SE3_SCORE_TYPE,
+                                                                          SE3_SCORE_TYPE]:
+    assert T0.ndim == 2 and T0.shape[-1] == 7  # T0: shape (nT, 7)
+
+    if x_ref is not None:
+        assert x_ref.ndim == 2 and x_ref.shape[-1] == 3  # x_ref: shape (nT, 3)
+
+    input_dtype = T0.dtype
+    if double_precision:
+        T0 = T0.type(dtype=torch.float64)
+        if isinstance(eps, torch.Tensor):
+            eps = eps.type(dtype=torch.float64)
+        if isinstance(std, torch.Tensor):
+            std = std.type(dtype=torch.float64)
+        if isinstance(x_ref, torch.Tensor):
+            x_ref = x_ref.type(dtype=torch.float64)
+
+    delta_T = sample_isotropic_se3_gaussian(eps=eps, std=std, N=len(x_ref) * len(T0), dtype=T0.dtype,
+                                            device=T0.device)  # shape: (nXref*nT, 7)
+    ang_score_ref, lin_score_ref = se3_isotropic_gaussian_score(T=delta_T, eps=eps,
+                                                                std=std)  # shape: (nXref*nT, 3), (nXref*nT, 3)
+    if x_ref is not None:
+        ang_score, lin_score = adjoint_inv_tr_isotropic_se3_score(x_ref=x_ref, ang_score=ang_score_ref,
+                                                                  lin_score=lin_score_ref)  # shape: (nXref*nT, 3), (nXref*nT, 3)
+    else:
+        ang_score, lin_score = ang_score_ref, lin_score_ref  # shape: (nXref*nT, 3), (nXref*nT, 3)
+
+    delta_T = delta_T.view(len(x_ref), *T0.shape)  # shape: (nXref, nT, 7)
+    ang_score = ang_score.view(len(x_ref), *T0.shape[:-1], 3)  # shape: (nXref, nT, 3)
+    lin_score = lin_score.view(len(x_ref), *T0.shape[:-1], 3)  # shape: (nXref, nT, 3)
+    ang_score_ref = ang_score_ref.view(len(x_ref), *T0.shape[:-1], 3)  # shape: (nXref, nT, 3)
+    lin_score_ref = lin_score_ref.view(len(x_ref), *T0.shape[:-1], 3)  # shape: (nXref, nT, 3)
+
+    if x_ref is not None:
+        delta_T = torch.cat([delta_T[..., :4],
+                             delta_T[..., 4:] + x_ref.unsqueeze(-2) - transforms.quaternion_apply(delta_T[..., :4],
+                                                                                                  x_ref.unsqueeze(-2))
+                             ], dim=-1)  # shape: (nXref, nT, 7)
+    if is_left_diff:
+        T = transforms.multiply_se3(delta_T, T0.unsqueeze(-3))  # shape: (nXref, nT, 7)
+    else:
+        T = transforms.multiply_se3(T0.unsqueeze(-3), delta_T)  # shape: (nXref, nT, 7)
+
+    return (
+        T.type(dtype=input_dtype),  # shape: (nXref, nT, 7)
+        delta_T.type(dtype=input_dtype),  # shape: (nXref, nT, 7)
+        (ang_score.type(dtype=input_dtype), lin_score.type(dtype=input_dtype)),
+        # shape: (nXref, nT, 3), (nXref, nT, 3),
+        (ang_score_ref.type(dtype=input_dtype), lin_score_ref.type(dtype=input_dtype))
+    # shape: (nXref, nT, 3), (nXref, nT, 3),
+    )
+
diff --git a/diffuser_actor/trajectory_optimization/transforms.py b/diffuser_actor/trajectory_optimization/transforms.py
new file mode 100644
index 0000000..84e2b06
--- /dev/null
+++ b/diffuser_actor/trajectory_optimization/transforms.py
@@ -0,0 +1,923 @@
+#### Codes borrowed from pytorch3d ####
+
+
+from typing import Optional, Union, Tuple, List
+import math
+import torch
+import torch.nn.functional as F
+
+Device = Union[str, torch.device]
+
+
+def _sqrt_positive_part(x: torch.Tensor) -> torch.Tensor:
+    """
+    Returns torch.sqrt(torch.max(0, x))
+    but with a zero subgradient where x is 0.
+    """
+    ret = torch.zeros_like(x)
+    positive_mask = x > 0
+    ret[positive_mask] = torch.sqrt(x[positive_mask])
+    return ret
+
+
+def matrix_to_quaternion(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as rotation matrices to quaternions.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix shape {matrix.shape}.")
+
+    batch_dim = matrix.shape[:-2]
+    m00, m01, m02, m10, m11, m12, m20, m21, m22 = torch.unbind(
+        matrix.reshape(batch_dim + (9,)), dim=-1
+    )
+
+    q_abs = _sqrt_positive_part(
+        torch.stack(
+            [
+                1.0 + m00 + m11 + m22,
+                1.0 + m00 - m11 - m22,
+                1.0 - m00 + m11 - m22,
+                1.0 - m00 - m11 + m22,
+            ],
+            dim=-1,
+        )
+    )
+
+    # we produce the desired quaternion multiplied by each of r, i, j, k
+    quat_by_rijk = torch.stack(
+        [
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([q_abs[..., 0] ** 2, m21 - m12, m02 - m20, m10 - m01], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m21 - m12, q_abs[..., 1] ** 2, m10 + m01, m02 + m20], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m02 - m20, m10 + m01, q_abs[..., 2] ** 2, m12 + m21], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m10 - m01, m20 + m02, m21 + m12, q_abs[..., 3] ** 2], dim=-1),
+        ],
+        dim=-2,
+    )
+
+    # We floor here at 0.1 but the exact level is not important; if q_abs is small,
+    # the candidate won't be picked.
+    flr = torch.tensor(0.1).to(dtype=q_abs.dtype, device=q_abs.device)
+    quat_candidates = quat_by_rijk / (2.0 * q_abs[..., None].max(flr))
+
+    # if not for numerical problems, quat_candidates[i] should be same (up to a sign),
+    # forall i; we pick the best-conditioned one (with the largest denominator)
+
+    return quat_candidates[
+           F.one_hot(q_abs.argmax(dim=-1), num_classes=4) > 0.5, :
+           ].reshape(batch_dim + (4,))
+
+
+def quaternion_to_matrix(quaternions: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as quaternions to rotation matrices.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    r, i, j, k = torch.unbind(quaternions, -1)
+    # pyre-fixme[58]: `/` is not supported for operand types `float` and `Tensor`.
+    two_s = 2.0 / (quaternions * quaternions).sum(-1)
+
+    o = torch.stack(
+        (
+            1 - two_s * (j * j + k * k),
+            two_s * (i * j - k * r),
+            two_s * (i * k + j * r),
+            two_s * (i * j + k * r),
+            1 - two_s * (i * i + k * k),
+            two_s * (j * k - i * r),
+            two_s * (i * k - j * r),
+            two_s * (j * k + i * r),
+            1 - two_s * (i * i + j * j),
+        ),
+        -1,
+    )
+    return o.reshape(quaternions.shape[:-1] + (3, 3))
+
+
+def quaternion_raw_multiply(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Multiply two quaternions.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions shape (..., 4).
+    """
+    aw, ax, ay, az = torch.unbind(a, -1)
+    bw, bx, by, bz = torch.unbind(b, -1)
+    ow = aw * bw - ax * bx - ay * by - az * bz
+    ox = aw * bx + ax * bw + ay * bz - az * by
+    oy = aw * by - ax * bz + ay * bw + az * bx
+    oz = aw * bz + ax * by - ay * bx + az * bw
+    return torch.stack((ow, ox, oy, oz), -1)
+
+
+def quaternion_invert(quaternion: torch.Tensor) -> torch.Tensor:
+    """
+    Given a quaternion representing rotation, get the quaternion representing
+    its inverse.
+    Args:
+        quaternion: Quaternions as tensor of shape (..., 4), with real part
+            first, which must be versors (unit quaternions).
+    Returns:
+        The inverse, a tensor of quaternions of shape (..., 4).
+    """
+
+    scaling = torch.tensor([1, -1, -1, -1], device=quaternion.device)
+    return quaternion * scaling
+
+
+def quaternion_apply(quaternion: torch.Tensor, point: torch.Tensor) -> torch.Tensor:
+    """
+    Apply the rotation given by a quaternion to a 3D point.
+    Usual torch rules for broadcasting apply.
+    Args:
+        quaternion: Tensor of quaternions, real part first, of shape (..., 4).
+        point: Tensor of 3D points of shape (..., 3).
+    Returns:
+        Tensor of rotated points of shape (..., 3).
+    """
+    if point.size(-1) != 3:
+        raise ValueError(f"Points are not in 3D, {point.shape}.")
+    real_parts = point.new_zeros(point.shape[:-1] + (1,))
+    point_as_quaternion = torch.cat((real_parts, point), -1)
+    out = quaternion_raw_multiply(
+        quaternion_raw_multiply(quaternion, point_as_quaternion),
+        quaternion_invert(quaternion),
+    )
+    return out[..., 1:]
+
+
+def axis_angle_to_quaternion(axis_angle: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as axis/angle to quaternions.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    angles = torch.norm(axis_angle, p=2, dim=-1, keepdim=True)
+    half_angles = angles * 0.5
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+            torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+            0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    quaternions = torch.cat(
+        [torch.cos(half_angles), axis_angle * sin_half_angles_over_angles], dim=-1
+    )
+    return quaternions
+
+
+def standardize_quaternion(quaternions: torch.Tensor) -> torch.Tensor:
+    """
+    Convert a unit quaternion to a standard form: one in which the real
+    part is non negative.
+    Args:
+        quaternions: Quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Standardized quaternions as tensor of shape (..., 4).
+    """
+    return torch.where(quaternions[..., 0:1] < 0, -quaternions, quaternions)
+
+
+def quaternion_multiply(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Multiply two quaternions representing rotations, returning the quaternion
+    representing their composition, i.e. the versor with nonnegative real part.
+    Usual torch rules for broadcasting apply.
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+    Returns:
+        The product of a and b, a tensor of quaternions of shape (..., 4).
+    """
+    ab = quaternion_raw_multiply(a, b)
+    return standardize_quaternion(ab)
+
+
+@torch.jit.script
+def normalize_quaternion(q: torch.Tensor) -> torch.Tensor:
+    return q / torch.norm(q, dim=-1, keepdim=True)
+
+
+def _angle_from_tan(
+        axis: str, other_axis: str, data, horizontal: bool, tait_bryan: bool
+) -> torch.Tensor:
+    """
+    Extract the first or third Euler angle from the two members of
+    the matrix which are positive constant times its sine and cosine.
+    Args:
+        axis: Axis label "X" or "Y or "Z" for the angle we are finding.
+        other_axis: Axis label "X" or "Y or "Z" for the middle axis in the
+            convention.
+        data: Rotation matrices as tensor of shape (..., 3, 3).
+        horizontal: Whether we are looking for the angle for the third axis,
+            which means the relevant entries are in the same row of the
+            rotation matrix. If not, they are in the same column.
+        tait_bryan: Whether the first and third axes in the convention differ.
+    Returns:
+        Euler Angles in radians for each matrix in data as a tensor
+        of shape (...).
+    """
+
+    i1, i2 = {"X": (2, 1), "Y": (0, 2), "Z": (1, 0)}[axis]
+    if horizontal:
+        i2, i1 = i1, i2
+    even = (axis + other_axis) in ["XY", "YZ", "ZX"]
+    if horizontal == even:
+        return torch.atan2(data[..., i1], data[..., i2])
+    if tait_bryan:
+        return torch.atan2(-data[..., i2], data[..., i1])
+    return torch.atan2(data[..., i2], -data[..., i1])
+
+
+def _index_from_letter(letter: str) -> int:
+    if letter == "X":
+        return 0
+    if letter == "Y":
+        return 1
+    if letter == "Z":
+        return 2
+    raise ValueError("letter must be either X, Y or Z.")
+
+
+def matrix_to_euler_angles(matrix: torch.Tensor, convention: str) -> torch.Tensor:
+    """
+    Convert rotations given as rotation matrices to Euler angles in radians.
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+        convention: Convention string of three uppercase letters.
+    Returns:
+        Euler angles in radians as tensor of shape (..., 3).
+    """
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix shape {matrix.shape}.")
+    i0 = _index_from_letter(convention[0])
+    i2 = _index_from_letter(convention[2])
+    tait_bryan = i0 != i2
+    if tait_bryan:
+        central_angle = torch.asin(
+            matrix[..., i0, i2] * (-1.0 if i0 - i2 in [-1, 2] else 1.0)
+        )
+    else:
+        central_angle = torch.acos(matrix[..., i0, i0])
+
+    o = (
+        _angle_from_tan(
+            convention[0], convention[1], matrix[..., i2], False, tait_bryan
+        ),
+        central_angle,
+        _angle_from_tan(
+            convention[2], convention[1], matrix[..., i0, :], True, tait_bryan
+        ),
+    )
+    return torch.stack(o, -1)
+
+
+def _copysign(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Return a tensor where each element has the absolute value taken from the,
+    corresponding element of a, with sign taken from the corresponding
+    element of b. This is like the standard copysign floating-point operation,
+    but is not careful about negative 0 and NaN.
+    Args:
+        a: source tensor.
+        b: tensor whose signs will be used, of the same shape as a.
+    Returns:
+        Tensor of the same shape as a with the signs of b.
+    """
+    signs_differ = (a < 0) != (b < 0)
+    return torch.where(signs_differ, -a, a)
+
+
+# def random_quaternions(
+#     n: int, *ns, dtype: Optional[torch.dtype] = None, device: Optional[Device] = None
+# ) -> torch.Tensor:
+#     """
+#     Generate random quaternions representing rotations,
+#     i.e. versors with nonnegative real part.
+#     Args:
+#         n: Number of quaternions in a batch to return.
+#         dtype: Type to return.
+#         device: Desired device of returned tensor. Default:
+#             uses the current device for the default tensor type.
+#     Returns:
+#         Quaternions as tensor of shape (N, 4).
+#     """
+#     if isinstance(device, str):
+#         device = torch.device(device)
+#     shape = [n] + [i for i in ns] + [4]
+#     o = torch.randn(shape, dtype=dtype, device=device)
+#     s = (o * o).sum(dim=-1)
+#     o = o / _copysign(torch.sqrt(s), o[:, 0])[:, None]
+#     return o
+
+@torch.jit.script
+def random_quaternions(n: int, device: Optional[Union[str, torch.device]] = None, dtype: Optional[torch.dtype] = None):
+    if isinstance(device, str):
+        device = torch.device(device)
+    q = torch.randn(n, 4, device=device, dtype=dtype)
+
+    return standardize_quaternion(q / torch.norm(q, dim=-1, keepdim=True))
+
+
+def hat(v: torch.Tensor) -> torch.Tensor:
+    """
+    Compute the Hat operator [1] of a batch of 3D vectors.
+    Args:
+        v: Batch of vectors of shape `(minibatch , 3)`.
+    Returns:
+        Batch of skew-symmetric matrices of shape
+        `(minibatch, 3 , 3)` where each matrix is of the form:
+            `[    0  -v_z   v_y ]
+             [  v_z     0  -v_x ]
+             [ -v_y   v_x     0 ]`
+    Raises:
+        ValueError if `v` is of incorrect shape.
+    [1] https://en.wikipedia.org/wiki/Hat_operator
+    """
+
+    N, dim = v.shape
+    if dim != 3:
+        raise ValueError("Input vectors have to be 3-dimensional.")
+
+    h = torch.zeros((N, 3, 3), dtype=v.dtype, device=v.device)
+
+    x, y, z = v.unbind(1)
+
+    h[:, 0, 1] = -z
+    h[:, 0, 2] = y
+    h[:, 1, 0] = z
+    h[:, 1, 2] = -x
+    h[:, 2, 0] = -y
+    h[:, 2, 1] = x
+
+    return h
+
+
+def hat_inv(h: torch.Tensor) -> torch.Tensor:
+    """
+    Compute the inverse Hat operator [1] of a batch of 3x3 matrices.
+    Args:
+        h: Batch of skew-symmetric matrices of shape `(minibatch, 3, 3)`.
+    Returns:
+        Batch of 3d vectors of shape `(minibatch, 3, 3)`.
+    Raises:
+        ValueError if `h` is of incorrect shape.
+        ValueError if `h` not skew-symmetric.
+    [1] https://en.wikipedia.org/wiki/Hat_operator
+    """
+
+    N, dim1, dim2 = h.shape
+    if dim1 != 3 or dim2 != 3:
+        raise ValueError("Input has to be a batch of 3x3 Tensors.")
+
+    ss_diff = torch.abs(h + h.permute(0, 2, 1)).max()
+
+    HAT_INV_SKEW_SYMMETRIC_TOL = 1e-5
+    if float(ss_diff) > HAT_INV_SKEW_SYMMETRIC_TOL:
+        raise ValueError("One of input matrices is not skew-symmetric.")
+
+    x = h[:, 2, 1]
+    y = h[:, 0, 2]
+    z = h[:, 1, 0]
+
+    v = torch.stack((x, y, z), dim=1)
+
+    return v
+
+
+def _so3_exp_map(
+        log_rot: torch.Tensor, eps: float = 0.0001
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+    """
+    A helper function that computes the so3 exponential map and,
+    apart from the rotation matrix, also returns intermediate variables
+    that can be re-used in other functions.
+    """
+    _, dim = log_rot.shape
+    if dim != 3:
+        raise ValueError("Input tensor shape has to be Nx3.")
+
+    nrms = (log_rot * log_rot).sum(1)
+    # phis ... rotation angles
+    rot_angles = torch.clamp(nrms, eps).sqrt()
+    # pyre-fixme[58]: `/` is not supported for operand types `float` and `Tensor`.
+    rot_angles_inv = 1.0 / rot_angles
+    fac1 = rot_angles_inv * rot_angles.sin()
+    fac2 = rot_angles_inv * rot_angles_inv * (1.0 - rot_angles.cos())
+    skews = hat(log_rot)
+    skews_square = torch.bmm(skews, skews)
+
+    R = (
+            fac1[:, None, None] * skews
+            # pyre-fixme[16]: `float` has no attribute `__getitem__`.
+            + fac2[:, None, None] * skews_square
+            + torch.eye(3, dtype=log_rot.dtype, device=log_rot.device)[None]
+    )
+
+    return R, rot_angles, skews, skews_square
+
+
+def _se3_V_matrix(
+        log_rotation: torch.Tensor,
+        log_rotation_hat: torch.Tensor,
+        log_rotation_hat_square: torch.Tensor,
+        rotation_angles: torch.Tensor,
+        eps: float = 1e-4,
+) -> torch.Tensor:
+    """
+    A helper function that computes the "V" matrix from [1], Sec 9.4.2.
+    [1] https://jinyongjeong.github.io/Download/SE3/jlblanco2010geometry3d_techrep.pdf
+    """
+
+    V = (
+            torch.eye(3, dtype=log_rotation.dtype, device=log_rotation.device)[None]
+            + log_rotation_hat
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and `int`.
+            * ((1 - torch.cos(rotation_angles)) / (rotation_angles ** 2))[:, None, None]
+            + (
+                    log_rotation_hat_square
+                    # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+                    #  `int`.
+                    * ((rotation_angles - torch.sin(rotation_angles)) / (rotation_angles ** 3))[
+                      :, None, None
+                      ]
+            )
+    )
+
+    return V
+
+
+def se3_exp_map(log_transform: torch.Tensor, eps: float = 1e-4) -> torch.Tensor:
+    """
+    Convert a batch of logarithmic representations of SE(3) matrices `log_transform`
+    to a batch of 4x4 SE(3) matrices using the exponential map.
+    See e.g. [1], Sec 9.4.2. for more detailed description.
+    A SE(3) matrix has the following form:
+        ```
+        [ R T ]
+        [ 0 1 ] ,
+        ```
+    where `R` is a 3x3 rotation matrix and `T` is a 3-D translation vector.
+    SE(3) matrices are commonly used to represent rigid motions or camera extrinsics.
+    In the SE(3) logarithmic representation SE(3) matrices are
+    represented as 6-dimensional vectors `[log_translation | log_rotation]`,
+    i.e. a concatenation of two 3D vectors `log_translation` and `log_rotation`.
+    The conversion from the 6D representation to a 4x4 SE(3) matrix `transform`
+    is done as follows:
+        ```
+        transform = exp( [ hat(log_rotation) log_translation ]
+                         [   0                        1      ] ) ,
+        ```
+    where `exp` is the matrix exponential and `hat` is the Hat operator [2].
+    Note that for any `log_transform` with `0 <= ||log_rotation|| < 2pi`
+    (i.e. the rotation angle is between 0 and 2pi), the following identity holds:
+    ```
+    se3_log_map(se3_exponential_map(log_transform)) == log_transform
+    ```
+    The conversion has a singularity around `||log(transform)|| = 0`
+    which is handled by clamping controlled with the `eps` argument.
+    Args:
+        log_transform: Batch of vectors of shape `(minibatch, 6)`.
+        eps: A threshold for clipping the squared norm of the rotation logarithm
+            to avoid unstable gradients in the singular case.
+    Returns:
+        Batch of transformation matrices of shape `(minibatch, 4, 4)`.
+    Raises:
+        ValueError if `log_transform` is of incorrect shape.
+    [1] https://jinyongjeong.github.io/Download/SE3/jlblanco2010geometry3d_techrep.pdf
+    [2] https://en.wikipedia.org/wiki/Hat_operator
+    """
+
+    if log_transform.ndim != 2 or log_transform.shape[1] != 6:
+        raise ValueError("Expected input to be of shape (N, 6).")
+
+    N, _ = log_transform.shape
+
+    log_translation = log_transform[..., :3]
+    log_rotation = log_transform[..., 3:]
+
+    # rotation is an exponential map of log_rotation
+    (
+        R,
+        rotation_angles,
+        log_rotation_hat,
+        log_rotation_hat_square,
+    ) = _so3_exp_map(log_rotation, eps=eps)
+
+    # translation is V @ T
+    V = _se3_V_matrix(
+        log_rotation,
+        log_rotation_hat,
+        log_rotation_hat_square,
+        rotation_angles,
+        eps=eps,
+    )
+    T = torch.bmm(V, log_translation[:, :, None])[:, :, 0]
+
+    transform = torch.zeros(
+        N, 4, 4, dtype=log_transform.dtype, device=log_transform.device
+    )
+
+    transform[:, :3, :3] = R
+    transform[:, :3, 3] = T
+    transform[:, 3, 3] = 1.0
+
+    return transform
+
+
+DEFAULT_ACOS_BOUND: float = 1.0 - 1e-4
+
+
+def acos_linear_extrapolation(
+        x: torch.Tensor,
+        bounds: Tuple[float, float] = (-DEFAULT_ACOS_BOUND, DEFAULT_ACOS_BOUND),
+) -> torch.Tensor:
+    """
+    Implements `arccos(x)` which is linearly extrapolated outside `x`'s original
+    domain of `(-1, 1)`. This allows for stable backpropagation in case `x`
+    is not guaranteed to be strictly within `(-1, 1)`.
+    More specifically::
+        bounds=(lower_bound, upper_bound)
+        if lower_bound <= x <= upper_bound:
+            acos_linear_extrapolation(x) = acos(x)
+        elif x <= lower_bound: # 1st order Taylor approximation
+            acos_linear_extrapolation(x)
+                = acos(lower_bound) + dacos/dx(lower_bound) * (x - lower_bound)
+        else:  # x >= upper_bound
+            acos_linear_extrapolation(x)
+                = acos(upper_bound) + dacos/dx(upper_bound) * (x - upper_bound)
+    Args:
+        x: Input `Tensor`.
+        bounds: A float 2-tuple defining the region for the
+            linear extrapolation of `acos`.
+            The first/second element of `bound`
+            describes the lower/upper bound that defines the lower/upper
+            extrapolation region, i.e. the region where
+            `x <= bound[0]`/`bound[1] <= x`.
+            Note that all elements of `bound` have to be within (-1, 1).
+    Returns:
+        acos_linear_extrapolation: `Tensor` containing the extrapolated `arccos(x)`.
+    """
+
+    lower_bound, upper_bound = bounds
+
+    if lower_bound > upper_bound:
+        raise ValueError("lower bound has to be smaller or equal to upper bound.")
+
+    if lower_bound <= -1.0 or upper_bound >= 1.0:
+        raise ValueError("Both lower bound and upper bound have to be within (-1, 1).")
+
+    # init an empty tensor and define the domain sets
+    acos_extrap = torch.empty_like(x)
+    x_upper = x >= upper_bound
+    x_lower = x <= lower_bound
+    x_mid = (~x_upper) & (~x_lower)
+
+    # acos calculation for upper_bound < x < lower_bound
+    acos_extrap[x_mid] = torch.acos(x[x_mid])
+    # the linear extrapolation for x >= upper_bound
+    acos_extrap[x_upper] = _acos_linear_approximation(x[x_upper], upper_bound)
+    # the linear extrapolation for x <= lower_bound
+    acos_extrap[x_lower] = _acos_linear_approximation(x[x_lower], lower_bound)
+
+    return acos_extrap
+
+
+def _acos_linear_approximation(x: torch.Tensor, x0: float) -> torch.Tensor:
+    """
+    Calculates the 1st order Taylor expansion of `arccos(x)` around `x0`.
+    """
+    return (x - x0) * _dacos_dx(x0) + math.acos(x0)
+
+
+def _dacos_dx(x: float) -> float:
+    """
+    Calculates the derivative of `arccos(x)` w.r.t. `x`.
+    """
+    return (-1.0) / math.sqrt(1.0 - x * x)
+
+
+def so3_rotation_angle(
+        R: torch.Tensor,
+        eps: float = 1e-4,
+        cos_angle: bool = False,
+        cos_bound: float = 1e-4,
+) -> torch.Tensor:
+    """
+    Calculates angles (in radians) of a batch of rotation matrices `R` with
+    `angle = acos(0.5 * (Trace(R)-1))`. The trace of the
+    input matrices is checked to be in the valid range `[-1-eps,3+eps]`.
+    The `eps` argument is a small constant that allows for small errors
+    caused by limited machine precision.
+    Args:
+        R: Batch of rotation matrices of shape `(minibatch, 3, 3)`.
+        eps: Tolerance for the valid trace check.
+        cos_angle: If==True return cosine of the rotation angles rather than
+            the angle itself. This can avoid the unstable
+            calculation of `acos`.
+        cos_bound: Clamps the cosine of the rotation angle to
+            [-1 + cos_bound, 1 - cos_bound] to avoid non-finite outputs/gradients
+            of the `acos` call. Note that the non-finite outputs/gradients
+            are returned when the angle is requested (i.e. `cos_angle==False`)
+            and the rotation angle is close to 0 or π.
+    Returns:
+        Corresponding rotation angles of shape `(minibatch,)`.
+        If `cos_angle==True`, returns the cosine of the angles.
+    Raises:
+        ValueError if `R` is of incorrect shape.
+        ValueError if `R` has an unexpected trace.
+    """
+
+    N, dim1, dim2 = R.shape
+    if dim1 != 3 or dim2 != 3:
+        raise ValueError("Input has to be a batch of 3x3 Tensors.")
+
+    rot_trace = R[:, 0, 0] + R[:, 1, 1] + R[:, 2, 2]
+
+    if ((rot_trace < -1.0 - eps) + (rot_trace > 3.0 + eps)).any():
+        raise ValueError("A matrix has trace outside valid range [-1-eps,3+eps].")
+
+    # phi ... rotation angle
+    phi_cos = (rot_trace - 1.0) * 0.5
+
+    if cos_angle:
+        return phi_cos
+    else:
+        if cos_bound > 0.0:
+            bound = 1.0 - cos_bound
+            return acos_linear_extrapolation(phi_cos, (-bound, bound))
+        else:
+            return torch.acos(phi_cos)
+
+
+def so3_log_map(
+        R: torch.Tensor, eps: float = 0.0001, cos_bound: float = 1e-4
+) -> torch.Tensor:
+    """
+    Convert a batch of 3x3 rotation matrices `R`
+    to a batch of 3-dimensional matrix logarithms of rotation matrices
+    The conversion has a singularity around `(R=I)` which is handled
+    by clamping controlled with the `eps` and `cos_bound` arguments.
+    Args:
+        R: batch of rotation matrices of shape `(minibatch, 3, 3)`.
+        eps: A float constant handling the conversion singularity.
+        cos_bound: Clamps the cosine of the rotation angle to
+            [-1 + cos_bound, 1 - cos_bound] to avoid non-finite outputs/gradients
+            of the `acos` call when computing `so3_rotation_angle`.
+            Note that the non-finite outputs/gradients are returned when
+            the rotation angle is close to 0 or π.
+    Returns:
+        Batch of logarithms of input rotation matrices
+        of shape `(minibatch, 3)`.
+    Raises:
+        ValueError if `R` is of incorrect shape.
+        ValueError if `R` has an unexpected trace.
+    """
+
+    N, dim1, dim2 = R.shape
+    if dim1 != 3 or dim2 != 3:
+        raise ValueError("Input has to be a batch of 3x3 Tensors.")
+
+    phi = so3_rotation_angle(R, cos_bound=cos_bound, eps=eps)
+
+    phi_sin = torch.sin(phi)
+
+    # We want to avoid a tiny denominator of phi_factor = phi / (2.0 * phi_sin).
+    # Hence, for phi_sin.abs() <= 0.5 * eps, we approximate phi_factor with
+    # 2nd order Taylor expansion: phi_factor = 0.5 + (1.0 / 12) * phi**2
+    phi_factor = torch.empty_like(phi)
+    ok_denom = phi_sin.abs() > (0.5 * eps)
+    # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and `int`.
+    phi_factor[~ok_denom] = 0.5 + (phi[~ok_denom] ** 2) * (1.0 / 12)
+    phi_factor[ok_denom] = phi[ok_denom] / (2.0 * phi_sin[ok_denom])
+
+    log_rot_hat = phi_factor[:, None, None] * (R - R.permute(0, 2, 1))
+
+    log_rot = hat_inv(log_rot_hat)
+
+    return log_rot
+
+
+def _get_se3_V_input(log_rotation: torch.Tensor, eps: float = 1e-4):
+    """
+    A helper function that computes the input variables to the `_se3_V_matrix`
+    function.
+    """
+    # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and `int`.
+    nrms = (log_rotation ** 2).sum(-1)
+    rotation_angles = torch.clamp(nrms, eps).sqrt()
+    log_rotation_hat = hat(log_rotation)
+    log_rotation_hat_square = torch.bmm(log_rotation_hat, log_rotation_hat)
+    return log_rotation, log_rotation_hat, log_rotation_hat_square, rotation_angles
+
+
+def se3_log_map(
+        transform: torch.Tensor, eps: float = 1e-4, cos_bound: float = 1e-4
+) -> torch.Tensor:
+    """
+    Convert a batch of 4x4 transformation matrices `transform`
+    to a batch of 6-dimensional SE(3) logarithms of the SE(3) matrices.
+    See e.g. [1], Sec 9.4.2. for more detailed description.
+    A SE(3) matrix has the following form:
+        ```
+        [ R 0 ]
+        [ T 1 ] ,
+        ```
+    where `R` is an orthonormal 3x3 rotation matrix and `T` is a 3-D translation vector.
+    SE(3) matrices are commonly used to represent rigid motions or camera extrinsics.
+    In the SE(3) logarithmic representation SE(3) matrices are
+    represented as 6-dimensional vectors `[log_translation | log_rotation]`,
+    i.e. a concatenation of two 3D vectors `log_translation` and `log_rotation`.
+    The conversion from the 4x4 SE(3) matrix `transform` to the
+    6D representation `log_transform = [log_translation | log_rotation]`
+    is done as follows:
+        ```
+        log_transform = log(transform)
+        log_translation = log_transform[3, :3]
+        log_rotation = inv_hat(log_transform[:3, :3])
+        ```
+    where `log` is the matrix logarithm
+    and `inv_hat` is the inverse of the Hat operator [2].
+    Note that for any valid 4x4 `transform` matrix, the following identity holds:
+    ```
+    se3_exp_map(se3_log_map(transform)) == transform
+    ```
+    The conversion has a singularity around `(transform=I)` which is handled
+    by clamping controlled with the `eps` and `cos_bound` arguments.
+    Args:
+        transform: batch of SE(3) matrices of shape `(minibatch, 4, 4)`.
+        eps: A threshold for clipping the squared norm of the rotation logarithm
+            to avoid division by zero in the singular case.
+        cos_bound: Clamps the cosine of the rotation angle to
+            [-1 + cos_bound, 3 - cos_bound] to avoid non-finite outputs.
+            The non-finite outputs can be caused by passing small rotation angles
+            to the `acos` function in `so3_rotation_angle` of `so3_log_map`.
+    Returns:
+        Batch of logarithms of input SE(3) matrices
+        of shape `(minibatch, 6)`.
+    Raises:
+        ValueError if `transform` is of incorrect shape.
+        ValueError if `R` has an unexpected trace.
+    [1] https://jinyongjeong.github.io/Download/SE3/jlblanco2010geometry3d_techrep.pdf
+    [2] https://en.wikipedia.org/wiki/Hat_operator
+    """
+
+    if transform.ndim != 3:
+        raise ValueError("Input tensor shape has to be (N, 4, 4).")
+
+    N, dim1, dim2 = transform.shape
+    if dim1 != 4 or dim2 != 4:
+        raise ValueError("Input tensor shape has to be (N, 4, 4).")
+
+    if not torch.allclose(transform[:, 3, :3], torch.zeros_like(transform[:, 3, :3])):
+        raise ValueError("All elements of `transform[:, 3, :3]` should be 0.")
+
+    # log_rot is just so3_log_map of the upper left 3x3 block
+    R = transform[:, :3, :3]
+    log_rotation = so3_log_map(R, eps=eps, cos_bound=cos_bound)
+
+    # log_translation is V^-1 @ T
+    T = transform[:, :3, 3]
+    V = _se3_V_matrix(*_get_se3_V_input(log_rotation), eps=eps)
+    log_translation = torch.linalg.solve(V, T[:, :])[:, :]
+
+    return torch.cat((log_translation, log_rotation), dim=1)
+
+
+def quaternion_to_axis_angle(quaternions: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as quaternions to axis/angle.
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    norms = torch.norm(quaternions[..., 1:], p=2, dim=-1, keepdim=True)
+    half_angles = torch.atan2(norms, quaternions[..., :1])
+    angles = 2 * half_angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+            torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+            0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    return quaternions[..., 1:] / sin_half_angles_over_angles
+
+
+def axis_angle_to_quaternion(axis_angle: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as axis/angle to quaternions.
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    angles = torch.norm(axis_angle, p=2, dim=-1, keepdim=True)
+    half_angles = angles * 0.5
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+            torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+            0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    quaternions = torch.cat(
+        [torch.cos(half_angles), axis_angle * sin_half_angles_over_angles], dim=-1
+    )
+    return quaternions
+
+
+@torch.jit.script
+def multiply_se3(T1: torch.Tensor, T2: torch.Tensor, pre_normalize: bool = False,
+                 post_normalize: bool = True) -> torch.Tensor:
+    if len(T1) == 1 or len(T2) == 1:
+        assert T1.shape[1:] == T2.shape[1:], f"Shape mismatch: T1: {T1.shape} || T2: {T2.shape}"
+    elif T1.ndim + 1 == T2.ndim:
+        assert T1.shape[:] == T2.shape[1:], f"Shape mismatch: T1: {T1.shape} || T2: {T2.shape}"
+    elif T1.ndim == T2.ndim + 1:
+        assert T1.shape[1:] == T2.shape[:], f"Shape mismatch: T1: {T1.shape} || T2: {T2.shape}"
+    else:
+        assert T1.shape == T2.shape, f"Shape mismatch: T1: {T1.shape} || T2: {T2.shape}"
+
+    q1, x1 = T1[..., :4], T1[..., 4:]
+    q2, x2 = T2[..., :4], T2[..., 4:]
+    if pre_normalize:
+        q1 = normalize_quaternion(q1)
+        q2 = normalize_quaternion(q2)
+
+    x = quaternion_apply(q1, x2) + x1
+    q = quaternion_multiply(q1, q2)
+    if post_normalize:
+        q = normalize_quaternion(q)
+
+    return torch.cat([q, x], dim=-1)
+
+
+@torch.jit.script
+def se3_invert(T: torch.Tensor) -> torch.Tensor:
+    qinv = quaternion_invert(T[..., :4])
+    return torch.cat([qinv, quaternion_apply(qinv, -T[..., 4:])], dim=-1)
+
+
+@torch.jit.script
+def quaternion_identity(n: int, device: Optional[torch.device] = None,
+                        dtype: Optional[torch.dtype] = None) -> torch.Tensor:
+    return torch.tensor((1., 0., 0., 0.), device=device, dtype=dtype).repeat((n, 1))
+
+
+@torch.jit.script
+def se3_from_r3(x: torch.Tensor) -> torch.Tensor:
+    return torch.cat([torch.ones_like(x[..., 0:1]), torch.zeros_like(x[..., :3]), x], dim=-1)
\ No newline at end of file
diff --git a/diffuser_actor/utils/__init__.py b/diffuser_actor/utils/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/diffuser_actor/utils/clip.py b/diffuser_actor/utils/clip.py
new file mode 100644
index 0000000..e79db28
--- /dev/null
+++ b/diffuser_actor/utils/clip.py
@@ -0,0 +1,43 @@
+# Adapted from https://github.com/openai/CLIP/blob/main/clip/model.py
+
+import torch
+
+import clip
+from clip.model import ModifiedResNet
+
+
+def load_clip():
+    clip_model, clip_transforms = clip.load("RN50")
+    state_dict = clip_model.state_dict()
+    layers = tuple([len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}")))
+                    for b in [1, 2, 3, 4]])
+    output_dim = state_dict["text_projection"].shape[1]
+    heads = state_dict["visual.layer1.0.conv1.weight"].shape[0] * 32 // 64
+    backbone = ModifiedResNetFeatures(layers, output_dim, heads)
+    backbone.load_state_dict(clip_model.visual.state_dict())
+    normalize = clip_transforms.transforms[-1]
+    return backbone, normalize
+
+
+class ModifiedResNetFeatures(ModifiedResNet):
+    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
+        super().__init__(layers, output_dim, heads, input_resolution, width)
+
+    def forward(self, x: torch.Tensor):
+        x = x.type(self.conv1.weight.dtype)
+        x = self.relu1(self.bn1(self.conv1(x)))
+        x = self.relu2(self.bn2(self.conv2(x)))
+        x0 = self.relu3(self.bn3(self.conv3(x)))
+        x = self.avgpool(x0)
+        x1 = self.layer1(x)
+        x2 = self.layer2(x1)
+        x3 = self.layer3(x2)
+        x4 = self.layer4(x3)
+
+        return {
+            "res1": x0,
+            "res2": x1,
+            "res3": x2,
+            "res4": x3,
+            "res5": x4,
+        }
diff --git a/diffuser_actor/utils/converter.py b/diffuser_actor/utils/converter.py
new file mode 100644
index 0000000..b0552a3
--- /dev/null
+++ b/diffuser_actor/utils/converter.py
@@ -0,0 +1,191 @@
+from typing import OrderedDict
+
+import torch.nn as nn
+import torch.nn.functional as F
+from torchvision.ops import FeaturePyramidNetwork
+
+from .multihead_custom_attention import MultiheadCustomAttention
+# from .multihead_flash_attention import MultiheadFlashAttention
+
+
+# def convert_multihead_flash_attention(module, process_group=None):
+#     r"""Helper function to convert all `MultiheadCustomAttention` layers in
+#     the model to `MultiheadFlashAttention` layers.
+#
+#     Follow the implementation of torch.nn.SyncBatchNorm.convert_sync_batchnorm
+#
+#     Args:
+#         module (nn.Module): module containing one or more
+#             `MultiheadCustomAttention` layers
+#         process_group (optional): process group to scope synchronization,
+#             default is the whole world
+#
+#     Returns:
+#         The original `module` with the converted `MultiheadFlashAttention`
+#         layers. If the original `module` is a `MultiheadCustomAttention`
+#         layer, a new `MultiheadFlashAttention` layer object will be returned
+#         instead.
+#     """
+#     module_output = module
+#     if isinstance(module, MultiheadCustomAttention):
+#         module_output = MultiheadFlashAttention(
+#             embed_dim=module.embed_dim,
+#             num_heads=module.num_heads,
+#             dropout=module.dropout,
+#             bias=module.in_proj_bias is not None,
+#             add_bias_kv=module.bias_k is not None,
+#             add_zero_attn=module.add_zero_attn,
+#             kdim=module.kdim,
+#             vdim=module.vdim,
+#             slot_competition=module.slot_competition,
+#             return_kv=module.return_kv,
+#             gate_attn=module.gate_attn is not None
+#         )
+#     for name, child in module.named_children():
+#         module_output.add_module(
+#             name, convert_multihead_flash_attention(child, process_group)
+#         )
+#     del module
+#     return module_output
+
+
+def convert_diffusion_scheduler(model, diffusion_scheduler, **kwargs):
+    """Convert model.rotation_noise_scheduler and model.position_noise_scheduler
+    to specified scheduler
+    """
+
+    config = {}
+    if diffusion_scheduler == 'DDIM':
+        from diffusers.schedulers.scheduling_ddim import DDIMScheduler
+        diffusion_class = DDIMScheduler
+    else:
+        raise NotImplementedError
+
+    num_eval_steps = kwargs["num_eval_timesteps"]
+    model.rotation_noise_scheduler = diffusion_class(
+        num_train_timesteps=model.rotation_noise_scheduler.config.num_train_timesteps,
+        beta_schedule=model.rotation_noise_scheduler.config.beta_schedule,
+        prediction_type=model.rotation_noise_scheduler.config.prediction_type,
+        **config,
+    )
+    model.position_noise_scheduler = diffusion_class(
+        num_train_timesteps=model.position_noise_scheduler.config.num_train_timesteps,
+        beta_schedule=model.position_noise_scheduler.config.beta_schedule,
+        prediction_type=model.position_noise_scheduler.config.prediction_type,
+        **config,
+    )
+    model.n_steps = kwargs["num_eval_timesteps"]
+
+
+class EfficientFeaturePyramidNetwork(FeaturePyramidNetwork):
+    def __init__(
+        self,
+        in_channels_list,
+        out_channels,
+        extra_blocks=None,
+        norm_layer=None,
+        output_level="res3"
+    ):
+        super().__init__(
+            in_channels_list,
+            out_channels,
+            extra_blocks,
+            norm_layer,
+        )
+        self.output_level = output_level
+
+    def forward(self, x):
+        """
+        Computes the FPN for a set of feature maps.
+
+        Args:
+            x (OrderedDict[Tensor]): feature maps for each feature level.
+            level_name: the level name to stop the FPN computation at. If None,
+                the entire FPN is computed.
+
+        Returns:
+            results (OrderedDict[Tensor]): feature maps after FPN layers.
+                They are ordered from the highest resolution first.
+        """
+        # unpack OrderedDict into two lists for easier handling
+        names = list(x.keys())
+        x = list(x.values())
+
+        last_inner = self.get_result_from_inner_blocks(x[-1], -1)
+        results = []
+        results.append(self.get_result_from_layer_blocks(last_inner, -1))
+
+        for idx in range(len(x) - 2, -1, -1):
+            inner_lateral = self.get_result_from_inner_blocks(x[idx], idx)
+            feat_shape = inner_lateral.shape[-2:]
+            inner_top_down = F.interpolate(last_inner, size=feat_shape, mode="nearest")
+            last_inner = inner_lateral + inner_top_down
+            results.insert(0, self.get_result_from_layer_blocks(last_inner, idx))
+
+            # Don't go over all levels to save compute
+            if names[idx] == self.output_level:
+                names = names[idx:]
+                break
+
+        if self.extra_blocks is not None:
+            results, names = self.extra_blocks(results, x, names)
+
+        # make it back an OrderedDict
+        out = OrderedDict([(k, v) for k, v in zip(names, results)])
+
+        return out
+
+
+def convert_efficient_fpn(model):
+
+    def _convert_efficient_fpn(module, output_level):
+        module_output = module
+        if isinstance(module, FeaturePyramidNetwork):
+            in_channels_list = [
+                m[0].in_channels 
+                for m in module.inner_blocks
+            ]
+            out_channels = module.inner_blocks[-1][-1].out_channels
+            module_output = EfficientFeaturePyramidNetwork(
+                in_channels_list, out_channels, output_level=output_level
+            )
+        for name, child in module.named_children():
+            module_output.add_module(
+                name, _convert_efficient_fpn(child, output_level)
+            )
+        del module
+        return module_output
+
+    # Very hackish, requires to know the inner structure of model
+    output_level = model.prediction_head.feature_map_pyramid[
+        model.prediction_head.feat_scales-1
+    ]
+    return _convert_efficient_fpn(model, output_level)
+
+
+def convert_weights(model: nn.Module):
+    """Convert applicable model parameters to fp16"""
+
+    def _convert_weights_to_fp16(l):
+        if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
+            l.weight.data = l.weight.data.half()
+            if l.bias is not None:
+                l.bias.data = l.bias.data.half()
+
+        # if isinstance(l, (nn.MultiheadAttention, MultiheadCustomAttention, MultiheadFlashAttention)):
+        if isinstance(l, (nn.MultiheadAttention, MultiheadCustomAttention)):
+            for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
+                tensor = getattr(l, attr)
+                if tensor is not None:
+                    tensor.data = tensor.data.half()
+
+        for name in ["text_projection", "proj"]:
+            if hasattr(l, name):
+                attr = getattr(l, name)
+                if attr is not None:
+                    attr.data = attr.data.half()
+
+    model.apply(_convert_weights_to_fp16)
+
+def convert_half_precision(model):
+    convert_weights(model)
\ No newline at end of file
diff --git a/diffuser_actor/utils/encoder.py b/diffuser_actor/utils/encoder.py
new file mode 100644
index 0000000..8b72fb0
--- /dev/null
+++ b/diffuser_actor/utils/encoder.py
@@ -0,0 +1,289 @@
+import dgl.geometry as dgl_geo
+import einops
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torchvision.ops import FeaturePyramidNetwork
+
+from .position_encodings import RotaryPositionEncoding3D
+from .layers import FFWRelativeCrossAttentionModule, ParallelAttention
+from .resnet import load_resnet50, load_resnet18
+from .clip import load_clip
+
+
+class Encoder(nn.Module):
+
+    def __init__(self,
+                 backbone="clip",
+                 image_size=(256, 256),
+                 embedding_dim=60,
+                 num_sampling_level=3,
+                 nhist=3,
+                 num_attn_heads=8,
+                 num_vis_ins_attn_layers=2,
+                 fps_subsampling_factor=5):
+        super().__init__()
+        assert backbone in ["resnet50", "resnet18", "clip"]
+        assert image_size in [(128, 128), (256, 256)]
+        assert num_sampling_level in [1, 2, 3, 4]
+
+        self.image_size = image_size
+        self.num_sampling_level = num_sampling_level
+        self.fps_subsampling_factor = fps_subsampling_factor
+
+        # Frozen backbone
+        if backbone == "resnet50":
+            self.backbone, self.normalize = load_resnet50()
+        elif backbone == "resnet18":
+            self.backbone, self.normalize = load_resnet18()
+        elif backbone == "clip":
+            self.backbone, self.normalize = load_clip()
+        for p in self.backbone.parameters():
+            p.requires_grad = False
+
+        # Semantic visual features at different scales
+        self.feature_pyramid = FeaturePyramidNetwork(
+            [64, 256, 512, 1024, 2048], embedding_dim
+        )
+        if self.image_size == (128, 128):
+            # Coarse RGB features are the 2nd layer of the feature pyramid
+            # at 1/4 resolution (32x32)
+            # Fine RGB features are the 1st layer of the feature pyramid
+            # at 1/2 resolution (64x64)
+            self.coarse_feature_map = ['res2', 'res1', 'res1', 'res1']
+            self.downscaling_factor_pyramid = [4, 2, 2, 2]
+        elif self.image_size == (256, 256):
+            # Coarse RGB features are the 3rd layer of the feature pyramid
+            # at 1/8 resolution (32x32)
+            # Fine RGB features are the 1st layer of the feature pyramid
+            # at 1/2 resolution (128x128)
+            self.feature_map_pyramid = ['res3', 'res1', 'res1', 'res1']
+            self.downscaling_factor_pyramid = [8, 2, 2, 2]
+
+        # 3D relative positional embeddings
+        self.relative_pe_layer = RotaryPositionEncoding3D(embedding_dim)
+
+        # Current gripper learnable features
+        self.curr_gripper_embed = nn.Embedding(nhist, embedding_dim)
+        self.gripper_context_head = FFWRelativeCrossAttentionModule(
+            embedding_dim, num_attn_heads, num_layers=3, use_adaln=False
+        )
+
+        # Goal gripper learnable features
+        self.goal_gripper_embed = nn.Embedding(1, embedding_dim)
+
+        # Instruction encoder
+        self.instruction_encoder = nn.Linear(512, embedding_dim)
+
+        # Attention from vision to language
+        layer = ParallelAttention(
+            num_layers=num_vis_ins_attn_layers,
+            d_model=embedding_dim, n_heads=num_attn_heads,
+            self_attention1=False, self_attention2=False,
+            cross_attention1=True, cross_attention2=False
+        )
+        self.vl_attention = nn.ModuleList([
+            layer
+            for _ in range(1)
+            for _ in range(1)
+        ])
+
+    def forward(self):
+        return None
+
+    def encode_curr_gripper(self, curr_gripper, context_feats, context):
+        """
+        Compute current gripper position features and positional embeddings.
+
+        Args:
+            - curr_gripper: (B, nhist, 3+)
+
+        Returns:
+            - curr_gripper_feats: (B, nhist, F)
+            - curr_gripper_pos: (B, nhist, F, 2)
+        """
+        return self._encode_gripper(curr_gripper, self.curr_gripper_embed,
+                                    context_feats, context)
+
+    def encode_goal_gripper(self, goal_gripper, context_feats, context):
+        """
+        Compute goal gripper position features and positional embeddings.
+
+        Args:
+            - goal_gripper: (B, 3+)
+
+        Returns:
+            - goal_gripper_feats: (B, 1, F)
+            - goal_gripper_pos: (B, 1, F, 2)
+        """
+        goal_gripper_feats, goal_gripper_pos = self._encode_gripper(
+            goal_gripper[:, None], self.goal_gripper_embed,
+            context_feats, context
+        )
+        return goal_gripper_feats, goal_gripper_pos
+
+    def _encode_gripper(self, gripper, gripper_embed, context_feats, context):
+        """
+        Compute gripper position features and positional embeddings.
+
+        Args:
+            - gripper: (B, npt, 3+)
+            - context_feats: (B, npt, C)
+            - context: (B, npt, 3)
+
+        Returns:
+            - gripper_feats: (B, npt, F)
+            - gripper_pos: (B, npt, F, 2)
+        """
+        # Learnable embedding for gripper
+        gripper_feats = gripper_embed.weight.unsqueeze(0).repeat(
+            len(gripper), 1, 1
+        )
+
+        # Rotary positional encoding
+        gripper_pos = self.relative_pe_layer(gripper[..., :3])
+        context_pos = self.relative_pe_layer(context)
+
+        gripper_feats = einops.rearrange(
+            gripper_feats, 'b npt c -> npt b c'
+        )
+        context_feats = einops.rearrange(
+            context_feats, 'b npt c -> npt b c'
+        )
+        gripper_feats = self.gripper_context_head(
+            query=gripper_feats, value=context_feats,
+            query_pos=gripper_pos, value_pos=context_pos
+        )[-1]
+        gripper_feats = einops.rearrange(
+            gripper_feats, 'nhist b c -> b nhist c'
+        )
+
+        return gripper_feats, gripper_pos
+
+    def encode_images(self, rgb, pcd):
+        """
+        Compute visual features/pos embeddings at different scales.
+
+        Args:
+            - rgb: (B, ncam, 3, H, W), pixel intensities
+            - pcd: (B, ncam, 3, H, W), positions
+
+        Returns:
+            - rgb_feats_pyramid: [(B, ncam, F, H_i, W_i)]
+            - pcd_pyramid: [(B, ncam * H_i * W_i, 3)]
+        """
+        num_cameras = rgb.shape[1]
+
+        # Pass each view independently through backbone
+        rgb = einops.rearrange(rgb, "bt ncam c h w -> (bt ncam) c h w")
+        rgb = self.normalize(rgb)
+        rgb_features = self.backbone(rgb)
+
+        # Pass visual features through feature pyramid network
+        rgb_features = self.feature_pyramid(rgb_features)
+
+        # Treat different cameras separately
+        pcd = einops.rearrange(pcd, "bt ncam c h w -> (bt ncam) c h w")
+
+        rgb_feats_pyramid = []
+        pcd_pyramid = []
+        for i in range(self.num_sampling_level):
+            # Isolate level's visual features
+            rgb_features_i = rgb_features[self.feature_map_pyramid[i]]
+
+            # Interpolate xy-depth to get the locations for this level
+            feat_h, feat_w = rgb_features_i.shape[-2:]
+            pcd_i = F.interpolate(
+                pcd,
+                (feat_h, feat_w),
+                mode='bilinear'
+            )
+
+            # # ZXP visualize pcd_i[0]
+            # import matplotlib.pyplot as plt
+            # pcd_0 = pcd_i.reshape(40, 3, 3, 32, 32)[0].clone().detach().cpu().permute(0, 2, 3, 1).reshape(-1, 3)
+            # fig = plt.figure(figsize=(15, 15))
+            # ax = fig.add_subplot(projection='3d')
+            # ax.scatter(pcd_0[:, 0], pcd_0[:, 1], pcd_0[:, 2], marker='.')
+            # ax.set_xlabel('X Label')
+            # ax.set_ylabel('Y Label')
+            # ax.set_zlabel('Z Label')
+            # plt.show()
+
+            # Merge different cameras for clouds, separate for rgb features
+            h, w = pcd_i.shape[-2:]
+            pcd_i = einops.rearrange(
+                pcd_i,
+                "(bt ncam) c h w -> bt (ncam h w) c", ncam=num_cameras
+            )
+            rgb_features_i = einops.rearrange(
+                rgb_features_i,
+                "(bt ncam) c h w -> bt ncam c h w", ncam=num_cameras
+            )
+
+            rgb_feats_pyramid.append(rgb_features_i)
+            pcd_pyramid.append(pcd_i)
+
+        return rgb_feats_pyramid, pcd_pyramid
+
+    def encode_instruction(self, instruction):
+        """
+        Compute language features/pos embeddings on top of CLIP features.
+
+        Args:
+            - instruction: (B, max_instruction_length, 512)
+
+        Returns:
+            - instr_feats: (B, 53, F)
+            - instr_dummy_pos: (B, 53, F, 2)
+        """
+        instr_feats = self.instruction_encoder(instruction)
+        # Dummy positional embeddings, all 0s
+        instr_dummy_pos = torch.zeros(
+            len(instruction), instr_feats.shape[1], 3,
+            device=instruction.device
+        )
+        instr_dummy_pos = self.relative_pe_layer(instr_dummy_pos)
+        return instr_feats, instr_dummy_pos
+
+    def run_fps(self, context_features, context_pos):
+        # context_features (Np, B, F)
+        # context_pos (B, Np, F, 2)
+        # outputs of analogous shape, with smaller Np
+        npts, bs, ch = context_features.shape
+
+        # Sample points with FPS
+        sampled_inds = dgl_geo.farthest_point_sampler(
+            einops.rearrange(
+                context_features,
+                "npts b c -> b npts c"
+            ).to(torch.float64),
+            max(npts // self.fps_subsampling_factor, 1), 0
+        ).long()
+
+        # Sample features
+        expanded_sampled_inds = sampled_inds.unsqueeze(-1).expand(-1, -1, ch)
+        sampled_context_features = torch.gather(
+            context_features,
+            0,
+            einops.rearrange(expanded_sampled_inds, "b npts c -> npts b c")
+        )
+
+        # Sample positional embeddings
+        _, _, ch, npos = context_pos.shape
+        expanded_sampled_inds = (
+            sampled_inds.unsqueeze(-1).unsqueeze(-1).expand(-1, -1, ch, npos)
+        )
+        sampled_context_pos = torch.gather(
+            context_pos, 1, expanded_sampled_inds
+        )
+        return sampled_context_features, sampled_context_pos
+
+    def vision_language_attention(self, feats, instr_feats):
+        feats, _ = self.vl_attention[0](
+            seq1=feats, seq1_key_padding_mask=None,
+            seq2=instr_feats, seq2_key_padding_mask=None,
+            seq1_pos=None, seq2_pos=None,
+            seq1_sem_pos=None, seq2_sem_pos=None
+        )
+        return feats
diff --git a/diffuser_actor/utils/layers.py b/diffuser_actor/utils/layers.py
new file mode 100644
index 0000000..4273704
--- /dev/null
+++ b/diffuser_actor/utils/layers.py
@@ -0,0 +1,489 @@
+import numpy as np
+from torch import nn
+from torch.nn import functional as F
+
+from .multihead_custom_attention import MultiheadCustomAttention
+
+
+class ParallelAttentionLayer(nn.Module):
+    """Self-/Cross-attention between two sequences."""
+
+    def __init__(self, d_model=256, dropout=0.1, n_heads=8, pre_norm=False,
+                 self_attention1=True, self_attention2=True,
+                 cross_attention1=True, cross_attention2=True,
+                 apply_ffn=True,
+                 slot_attention12=False, slot_attention21=False,
+                 rotary_pe=False, use_adaln=False):
+        """Initialize layers, d_model is the encoder dimension."""
+        super().__init__()
+        self.pre_norm = pre_norm
+        self.self_attention1 = self_attention1
+        self.self_attention2 = self_attention2
+        self.cross_attention1 = cross_attention1
+        self.cross_attention2 = cross_attention2
+        self.apply_ffn = apply_ffn
+        self.rotary_pe = rotary_pe
+
+        # Self-attention for seq1
+        if self.self_attention1:
+            self.adaln_1 = None
+            if use_adaln:
+                self.adaln_1 = AdaLN(d_model)
+            self.sa1 = MultiheadCustomAttention(
+                d_model, n_heads, dropout=dropout
+            )
+            self.dropout_1 = nn.Dropout(dropout)
+            self.norm_1 = nn.LayerNorm(d_model)
+
+        # Self-attention for seq2
+        if self.self_attention2:
+            self.adaln_2 = None
+            if use_adaln:
+                self.adaln_2 = AdaLN(d_model)
+            self.sa2 = MultiheadCustomAttention(
+                d_model, n_heads, dropout=dropout
+            )
+            self.dropout_2 = nn.Dropout(dropout)
+            self.norm_2 = nn.LayerNorm(d_model)
+
+        # Cross attention from seq1 to seq2
+        self.norm_12 = None
+        if cross_attention1:
+            self.adaln_12 = None
+            if use_adaln:
+                self.adaln_12 = AdaLN(d_model)
+            self.cross_12 = MultiheadCustomAttention(
+                d_model, n_heads, dropout=dropout,
+                slot_competition=slot_attention12
+            )
+            self.dropout_12 = nn.Dropout(dropout)
+            self.norm_12 = nn.LayerNorm(d_model)
+
+        # Cross attention from seq2 to seq1
+        self.norm_21 = None
+        if cross_attention2:
+            self.adaln_21 = None
+            if use_adaln:
+                self.adaln_21 = AdaLN(d_model)
+            self.cross_21 = MultiheadCustomAttention(
+                d_model, n_heads, dropout=dropout,
+                slot_competition=slot_attention21
+            )
+            self.dropout_21 = nn.Dropout(dropout)
+            self.norm_21 = nn.LayerNorm(d_model)
+
+        # FFN-1
+        if self_attention1 or cross_attention1:
+            self.adaln_ff1 = None
+            if use_adaln:
+                self.adaln_ff1 = AdaLN(d_model)
+            self.ffn_12 = nn.Sequential(
+                nn.Linear(d_model, 4 * d_model),
+                nn.ReLU(),
+                nn.Dropout(dropout),
+                nn.Linear(4 * d_model, d_model),
+                nn.Dropout(dropout)
+            )
+            self.norm_122 = nn.LayerNorm(d_model)
+
+        # FFN-2
+        if self_attention2 or cross_attention2:
+            self.adaln_ff2 = None
+            if use_adaln:
+                self.adaln_ff2 = AdaLN(d_model)
+            self.ffn_21 = nn.Sequential(
+                nn.Linear(d_model, 4 * d_model),
+                nn.ReLU(),
+                nn.Dropout(dropout),
+                nn.Linear(4 * d_model, d_model),
+                nn.Dropout(dropout)
+            )
+            self.norm_212 = nn.LayerNorm(d_model)
+
+    def _norm(self, x, layer, normalize=True):
+        if normalize and layer is not None:
+            return layer(x)
+        return x
+
+    def with_pos_embed(self, tensor, pos=None):
+        return tensor if pos is None else tensor + pos
+
+    def _adaln(self, x, layer, ada_sgnl):
+        if layer is not None and ada_sgnl is not None:
+            return layer(x.transpose(0, 1), ada_sgnl).transpose(0, 1)
+        return x
+
+    def forward(self, seq1, seq1_key_padding_mask, seq2,
+                seq2_key_padding_mask,
+                seq1_pos=None, seq2_pos=None,
+                seq1_sem_pos=None, seq2_sem_pos=None,
+                ada_sgnl=None):
+        """Forward pass, seq1 (B, S1, F), seq2 (B, S2, F)."""
+        rot_args = {}
+
+        # Create key, query, value for seq1, seq2
+        q1 = k1 = v1 = self._norm(seq1, self.norm_12, self.pre_norm)
+        q2 = k2 = v2 = self._norm(seq2, self.norm_21, self.pre_norm)
+        if not self.rotary_pe:
+            q1 = k1 = self.with_pos_embed(seq1, seq1_pos)
+            q2 = k2 = self.with_pos_embed(seq2, seq2_pos)
+        q1 = self.with_pos_embed(q1, seq1_sem_pos)
+        k1 = self.with_pos_embed(k1, seq1_sem_pos)
+        q2 = self.with_pos_embed(q2, seq2_sem_pos)
+        k2 = self.with_pos_embed(k2, seq2_sem_pos)
+
+        # Cross-attention from seq1 to seq2
+        if self.cross_attention1:
+            if self.rotary_pe:
+                rot_args['rotary_pe'] = (seq1_pos, seq2_pos)
+            seq1b = self.cross_12(
+                query=self._adaln(q1, self.adaln_12, ada_sgnl).transpose(0, 1),
+                key=k2.transpose(0, 1),
+                value=v2.transpose(0, 1),
+                attn_mask=None,
+                key_padding_mask=seq2_key_padding_mask,  # (B, S2)
+                **rot_args
+            )[0].transpose(0, 1)
+            seq1 = seq1 + self.dropout_12(seq1b)
+            seq1 = self._norm(seq1, self.norm_12, not self.pre_norm)
+
+        # Cross-attention from seq2 to seq1
+        if self.cross_attention2:
+            if self.rotary_pe:
+                rot_args['rotary_pe'] = (seq2_pos, seq1_pos)
+            seq2b = self.cross_21(
+                query=self._adaln(q2, self.adaln_21, ada_sgnl).transpose(0, 1),
+                key=k1.transpose(0, 1),
+                value=v1.transpose(0, 1),
+                attn_mask=None,
+                key_padding_mask=seq1_key_padding_mask,  # (B, S1)
+                **rot_args
+            )[0].transpose(0, 1)
+            seq2 = seq2 + self.dropout_21(seq2b)
+            seq2 = self._norm(seq2, self.norm_21, not self.pre_norm)
+
+        # Self-attention for seq1
+        if self.self_attention1:
+            q1 = k1 = v1 = self._norm(seq1, self.norm_1, self.pre_norm)
+            if self.rotary_pe:
+                rot_args['rotary_pe'] = (seq1_pos, seq1_pos)
+            else:
+                q1 = k1 = self.with_pos_embed(seq1, seq1_pos)
+            q1 = self.with_pos_embed(q1, seq1_sem_pos)
+            k1 = self.with_pos_embed(k1, seq1_sem_pos)
+            seq1b = self.sa1(
+                query=self._adaln(q1, self.adaln_1, ada_sgnl).transpose(0, 1),
+                key=self._adaln(k1, self.adaln_1, ada_sgnl).transpose(0, 1),
+                value=self._adaln(v1, self.adaln_1, ada_sgnl).transpose(0, 1),
+                attn_mask=None,
+                key_padding_mask=seq1_key_padding_mask,  # (B, S1)
+                **rot_args
+            )[0].transpose(0, 1)
+            seq1 = seq1 + self.dropout_1(seq1b)
+            seq1 = self._norm(seq1, self.norm_1, not self.pre_norm)
+
+        # Self-attention for seq2
+        if self.self_attention2:
+            q2 = k2 = v2 = self._norm(seq2, self.norm_2, self.pre_norm)
+            if self.rotary_pe:
+                rot_args['rotary_pe'] = (seq2_pos, seq2_pos)
+            else:
+                q2 = k2 = self.with_pos_embed(seq2, seq2_pos)
+            q2 = self.with_pos_embed(q2, seq2_sem_pos)
+            k2 = self.with_pos_embed(k2, seq2_sem_pos)
+            seq2b = self.sa2(
+                query=self._adaln(q2, self.adaln_2, ada_sgnl).transpose(0, 1),
+                key=self._adaln(k2, self.adaln_2, ada_sgnl).transpose(0, 1),
+                value=self._adaln(v2, self.adaln_2, ada_sgnl).transpose(0, 1),
+                attn_mask=None,
+                key_padding_mask=seq2_key_padding_mask,  # (B, S2)
+                **rot_args
+            )[0].transpose(0, 1)
+            seq2 = seq2 + self.dropout_2(seq2b)
+            seq2 = self._norm(seq2, self.norm_2, not self.pre_norm)
+
+        # FFN-1
+        if (self.self_attention1 or self.cross_attention1) and self.apply_ffn:
+            seq1 = self._norm(seq1, self.norm_122, self.pre_norm)
+            seq1 = self._adaln(seq1, self.adaln_ff1, ada_sgnl)
+            seq1 = seq1 + self.ffn_12(seq1)
+            seq1 = self._norm(seq1, self.norm_122, not self.pre_norm)
+
+        # FFN-2
+        if (self.self_attention2 or self.cross_attention2) and self.apply_ffn:
+            seq2 = self._norm(seq2, self.norm_212, self.pre_norm)
+            seq2 = self._adaln(seq2, self.adaln_ff2, ada_sgnl)
+            seq2 = seq2 + self.ffn_21(seq2)
+            seq2 = self._norm(seq2, self.norm_212, not self.pre_norm)
+
+        return seq1, seq2
+
+
+class ParallelAttention(nn.Module):
+    """Self-/Cross-attention between two sequences."""
+
+    def __init__(self, num_layers=1,
+                 d_model=256, dropout=0.1, n_heads=8, pre_norm=False,
+                 self_attention1=True, self_attention2=True,
+                 cross_attention1=True, cross_attention2=True,
+                 apply_ffn=True,
+                 slot_attention12=False, slot_attention21=False,
+                 rotary_pe=False, use_adaln=False):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.update_seq1 = self_attention1 or cross_attention1
+        self.update_seq2 = self_attention2 or cross_attention2
+        for _ in range(num_layers):
+            self.layers.append(ParallelAttentionLayer(
+                d_model=d_model,
+                dropout=dropout,
+                n_heads=n_heads,
+                pre_norm=pre_norm,
+                self_attention1=self_attention1,
+                self_attention2=self_attention2,
+                cross_attention1=cross_attention1,
+                cross_attention2=cross_attention2,
+                apply_ffn=apply_ffn,
+                slot_attention12=slot_attention12,
+                slot_attention21=slot_attention21,
+                rotary_pe=rotary_pe,
+                use_adaln=use_adaln
+            ))
+
+    def forward(self, seq1, seq1_key_padding_mask, seq2,
+                seq2_key_padding_mask,
+                seq1_pos=None, seq2_pos=None,
+                seq1_sem_pos=None, seq2_sem_pos=None,
+                ada_sgnl=None):
+        """Forward pass, seq1 (B, S1, F), seq2 (B, S2, F)."""
+        for layer in self.layers:
+            seq1_, seq2_ = layer(
+                seq1=seq1, seq1_key_padding_mask=seq1_key_padding_mask,
+                seq2=seq2, seq2_key_padding_mask=seq2_key_padding_mask,
+                seq1_pos=seq1_pos, seq2_pos=seq2_pos,
+                seq1_sem_pos=seq1_sem_pos, seq2_sem_pos=seq2_sem_pos,
+                ada_sgnl=ada_sgnl
+            )
+            if self.update_seq1:
+                seq1 = seq1_
+            if self.update_seq2:
+                seq2 = seq2_
+        return seq1, seq2
+
+
+class AdaLN(nn.Module):
+
+    def __init__(self, embedding_dim):
+        super().__init__()
+        self.modulation = nn.Sequential(
+             nn.SiLU(), nn.Linear(embedding_dim, 2 * embedding_dim, bias=True)
+        )
+        nn.init.constant_(self.modulation[-1].weight, 0)
+        nn.init.constant_(self.modulation[-1].bias, 0)
+
+    def forward(self, x, t):
+        """
+        Args:
+            x: A tensor of shape (N, B, C)
+            t: A tensor of shape (B, C)
+        """
+        scale, shift = self.modulation(t).chunk(2, dim=-1)  # (B, C), (B, C)
+        x = x * (1 + scale.unsqueeze(0)) + shift.unsqueeze(0)
+        return x
+
+
+class FeedforwardLayer(nn.Module):
+
+    def __init__(self, embedding_dim, hidden_dim, dropout=0.0,
+                 use_adaln=False):
+        super().__init__()
+        self.linear1 = nn.Linear(embedding_dim, hidden_dim)
+        self.dropout = nn.Dropout(dropout)
+        self.linear2 = nn.Linear(hidden_dim, embedding_dim)
+        self.norm = nn.LayerNorm(embedding_dim)
+        self.activation = F.relu
+        self._reset_parameters()
+        if use_adaln:
+            self.adaln = AdaLN(embedding_dim)
+
+    def _reset_parameters(self):
+        for p in self.parameters():
+            if p.dim() > 1:
+                nn.init.xavier_uniform_(p)
+
+    def forward(self, x, diff_ts=None):
+        if diff_ts is not None:
+            x = self.adaln(x, diff_ts)
+        output = self.linear2(self.dropout(self.activation(self.linear1(x))))
+        output = x + self.dropout(output)
+        output = self.norm(output)
+        return output
+
+
+class RelativeCrossAttentionLayer(nn.Module):
+
+    def __init__(self, embedding_dim, num_heads, dropout=0.0, use_adaln=False):
+        super().__init__()
+        self.multihead_attn = MultiheadCustomAttention(
+            embedding_dim, num_heads, dropout=dropout
+        )
+        self.norm = nn.LayerNorm(embedding_dim)
+        self.dropout = nn.Dropout(dropout)
+        if use_adaln:
+            self.adaln = AdaLN(embedding_dim)
+
+    def forward(self, query, value, diff_ts=None,
+                query_pos=None, value_pos=None, pad_mask=None):
+        if diff_ts is not None:
+            adaln_query = self.adaln(query, diff_ts)
+        else:
+            adaln_query = query
+        attn_output, _ = self.multihead_attn(
+            query=adaln_query,
+            key=value,
+            value=value,
+            rotary_pe=None if query_pos is None else (query_pos, value_pos),
+            key_padding_mask=pad_mask
+        )
+        output = query + self.dropout(attn_output)
+        output = self.norm(output)
+        return output
+
+
+class SelfAttentionLayer(nn.Module):
+
+    def __init__(self, embedding_dim, num_heads, dropout=0.0, use_adaln=False):
+        super().__init__()
+        self.multihead_attn = MultiheadCustomAttention(
+            embedding_dim, num_heads, dropout=dropout
+        )
+        self.norm = nn.LayerNorm(embedding_dim)
+        self.dropout = nn.Dropout(dropout)
+        if use_adaln:
+            self.adaln = AdaLN(embedding_dim)
+
+    def forward(self, query, diff_ts=None,
+                query_pos=None, value_pos=None, pad_mask=None):
+        if diff_ts is not None:
+            adaln_query = self.adaln(query, diff_ts)
+        else:
+            adaln_query = query
+        attn_output, _ = self.multihead_attn(
+            query=adaln_query,
+            key=adaln_query,
+            value=adaln_query,
+        )
+        output = query + self.dropout(attn_output)
+        output = self.norm(output)
+        return output
+
+
+class FFWRelativeCrossAttentionModule(nn.Module):
+
+    def __init__(self, embedding_dim, num_attn_heads, num_layers,
+                 use_adaln=True):
+        super().__init__()
+
+        self.num_layers = num_layers
+        self.attn_layers = nn.ModuleList()
+        self.ffw_layers = nn.ModuleList()
+        for _ in range(num_layers):
+            self.attn_layers.append(RelativeCrossAttentionLayer(
+                embedding_dim, num_attn_heads, use_adaln=use_adaln
+            ))
+            self.ffw_layers.append(FeedforwardLayer(
+                embedding_dim, embedding_dim, use_adaln=use_adaln
+            ))
+
+    def forward(self, query, value, diff_ts=None,
+                query_pos=None, value_pos=None):
+        output = []
+        for i in range(self.num_layers):
+            query = self.attn_layers[i](
+                query, value, diff_ts, query_pos, value_pos
+            )
+            query = self.ffw_layers[i](query, diff_ts)
+            output.append(query)
+        return output
+
+
+class FFWRelativeSelfAttentionModule(nn.Module):
+
+    def __init__(self, embedding_dim, num_attn_heads, num_layers,
+                 use_adaln=True):
+        super().__init__()
+
+        self.num_layers = num_layers
+        self.attn_layers = nn.ModuleList()
+        self.ffw_layers = nn.ModuleList()
+        for _ in range(num_layers):
+            self.attn_layers.append(RelativeCrossAttentionLayer(
+                embedding_dim, num_attn_heads, use_adaln=use_adaln
+            ))
+            self.ffw_layers.append(FeedforwardLayer(
+                embedding_dim, embedding_dim, use_adaln=use_adaln
+            ))
+
+    def forward(self, query, diff_ts=None,
+                query_pos=None, context=None, context_pos=None):
+        output = []
+        for i in range(self.num_layers):
+            query = self.attn_layers[i](
+                query, query, diff_ts, query_pos, query_pos
+            )
+            query = self.ffw_layers[i](query, diff_ts)
+            output.append(query)
+        return output
+
+
+class FFWRelativeSelfCrossAttentionModule(nn.Module):
+
+    def __init__(self, embedding_dim, num_attn_heads,
+                 num_self_attn_layers, num_cross_attn_layers, use_adaln=True):
+        super().__init__()
+
+        self.num_layers = num_self_attn_layers
+        self.self_attn_layers = nn.ModuleList()
+        self.cross_attn_layers = nn.ModuleList()
+        self.ffw_layers = nn.ModuleList()
+
+        cross_inds = np.linspace(
+            0,
+            num_self_attn_layers,
+            num_cross_attn_layers + 1,
+            dtype=np.int32
+        ).tolist()
+        for ind in range(num_self_attn_layers):
+            self.self_attn_layers.append(RelativeCrossAttentionLayer(
+                embedding_dim, num_attn_heads, use_adaln=use_adaln
+            ))
+            if ind in cross_inds:
+                self.cross_attn_layers.append(RelativeCrossAttentionLayer(
+                    embedding_dim, num_attn_heads, use_adaln=use_adaln
+                ))
+            else:
+                self.cross_attn_layers.append(None)
+            self.ffw_layers.append(FeedforwardLayer(
+                embedding_dim, embedding_dim, use_adaln=use_adaln
+            ))
+
+    def forward(self, query, context, diff_ts=None,
+                query_pos=None, context_pos=None):
+        output = []
+        for i in range(self.num_layers):
+            # Cross attend to the context first
+            if self.cross_attn_layers[i] is not None:
+                if context_pos is None:
+                    cur_query_pos = None
+                else:
+                    cur_query_pos = query_pos
+                query = self.cross_attn_layers[i](
+                    query, context, diff_ts, cur_query_pos, context_pos
+                )
+            # Self attend next
+            query = self.self_attn_layers[i](
+                query, query, diff_ts, query_pos, query_pos
+            )
+            query = self.ffw_layers[i](query, diff_ts)
+            output.append(query)
+        return output
diff --git a/diffuser_actor/utils/multihead_custom_attention.py b/diffuser_actor/utils/multihead_custom_attention.py
new file mode 100644
index 0000000..064d6cb
--- /dev/null
+++ b/diffuser_actor/utils/multihead_custom_attention.py
@@ -0,0 +1,466 @@
+import warnings
+import torch
+from torch.nn import Linear
+from torch.nn.init import xavier_uniform_
+from torch.nn.init import constant_
+from torch.nn.init import xavier_normal_
+from torch.nn.parameter import Parameter
+from torch.nn import Module
+from torch.nn import functional as F
+
+from .position_encodings import RotaryPositionEncoding
+
+
+class MultiheadCustomAttention(Module):
+    r"""Allows the model to jointly attend to information
+    from different representation subspaces.
+    See reference: Attention Is All You Need
+    .. math::
+        \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
+        \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
+    Args:
+        embed_dim: total dimension of the model.
+        num_heads: parallel attention heads.
+        dropout: a Dropout layer on attn_output_weights. Default: 0.0.
+        bias: add bias as module parameter. Default: True.
+        add_bias_kv: add bias to the key and value sequences at dim=0.
+        add_zero_attn: add a new batch of zeros to the key and
+                       value sequences at dim=1.
+        kdim: total number of features in key. Default: None.
+        vdim: total number of features in key. Default: None.
+        Note: if kdim and vdim are None, they will be set to embed_dim such that
+        query, key, and value have the same number of features.
+    Examples::
+        >>> multihead_attn = nn.MultiheadAttention(embed_dim, num_heads)
+        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
+    """
+
+    def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None,
+                 vdim=None, slot_competition=False, return_kv=False, gate_attn=False):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
+
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
+        ##### Custom
+        self.slot_competition = slot_competition
+        self.return_kv = return_kv
+        self.gate_attn = None
+        if gate_attn:
+            self.gate_attn = Parameter(torch.randn(num_heads))  # randn
+        #####
+        if self._qkv_same_embed_dim is False:
+            self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
+            self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
+            self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
+        else:
+            self.register_parameter('q_proj_weight', None)
+            self.register_parameter('k_proj_weight', None)
+            self.register_parameter('v_proj_weight', None)
+
+        if bias:
+            self.in_proj_bias = Parameter(torch.empty(3 * embed_dim))
+        else:
+            self.register_parameter('in_proj_bias', None)
+        self.out_proj = Linear(embed_dim, embed_dim, bias=bias)
+
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.empty(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.empty(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+
+        self.add_zero_attn = add_zero_attn
+
+        self._reset_parameters()
+
+    def _reset_parameters(self):
+        if self._qkv_same_embed_dim:
+            xavier_uniform_(self.in_proj_weight)
+        else:
+            xavier_uniform_(self.q_proj_weight)
+            xavier_uniform_(self.k_proj_weight)
+            xavier_uniform_(self.v_proj_weight)
+
+        if self.in_proj_bias is not None:
+            constant_(self.in_proj_bias, 0.)
+            constant_(self.out_proj.bias, 0.)
+        if self.bias_k is not None:
+            xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            xavier_normal_(self.bias_v)
+
+    def forward(self, query, key, value, key_padding_mask=None, need_weights=True,
+                attn_mask=None, k_mem=None, v_mem=None, mem_mask=None, rotary_pe=None):
+        r"""
+    Args:
+        query, key, value: map a query and a set of key-value pairs to an output.
+            See "Attention Is All You Need" for more details.
+        key_padding_mask: if provided, specified padding elements in the key will
+            be ignored by the attention. This is an binary mask. When the value is True,
+            the corresponding value on the attention layer will be filled with -inf.
+        need_weights: output attn_output_weights.
+        attn_mask: mask that prevents attention to certain positions. This is an additive mask
+            (i.e. the values will be added to the attention layer).
+    Shape:
+        - Inputs:
+        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key_padding_mask: :math:`(N, S)`, ByteTensor, where N is the batch size, S is the source sequence length.
+        - attn_mask: :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+        - Outputs:
+        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
+          E is the embedding dimension.
+        - attn_output_weights: :math:`(N, L, S)` where N is the batch size,
+          L is the target sequence length, S is the source sequence length.
+        """
+        if hasattr(self, '_qkv_same_embed_dim') and self._qkv_same_embed_dim is False:
+            return multi_head_attention_forward(
+                query, key, value, self.embed_dim, self.num_heads,
+                self.in_proj_weight, self.in_proj_bias,
+                self.bias_k, self.bias_v, self.add_zero_attn,
+                self.dropout, self.out_proj.weight, self.out_proj.bias,
+                training=self.training,
+                key_padding_mask=key_padding_mask, need_weights=need_weights,
+                attn_mask=attn_mask, use_separate_proj_weight=True,
+                q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight,
+                v_proj_weight=self.v_proj_weight, slot_competition=self.slot_competition,
+                return_kv=self.return_kv, k_mem=k_mem, v_mem=v_mem,
+                gate_attn=self.gate_attn, mem_mask=mem_mask,
+                rotary_pe=rotary_pe)
+        else:
+            if not hasattr(self, '_qkv_same_embed_dim'):
+                warnings.warn('A new version of MultiheadAttention module has been implemented. \
+                    Please re-train your model with the new module',
+                              UserWarning)
+
+            return multi_head_attention_forward(
+                query, key, value, self.embed_dim, self.num_heads,
+                self.in_proj_weight, self.in_proj_bias,
+                self.bias_k, self.bias_v, self.add_zero_attn,
+                self.dropout, self.out_proj.weight, self.out_proj.bias,
+                training=self.training,
+                key_padding_mask=key_padding_mask, need_weights=need_weights,
+                attn_mask=attn_mask, slot_competition=self.slot_competition,
+                return_kv=self.return_kv, k_mem=k_mem, v_mem=v_mem,
+                gate_attn=self.gate_attn, mem_mask=mem_mask,
+                rotary_pe=rotary_pe)
+
+
+def multi_head_attention_forward(query,  # type: Tensor
+                                 key,  # type: Tensor
+                                 value,  # type: Tensor
+                                 embed_dim_to_check,  # type: int
+                                 num_heads,  # type: int
+                                 in_proj_weight,  # type: Tensor
+                                 in_proj_bias,  # type: Tensor
+                                 bias_k,  # type: Optional[Tensor]
+                                 bias_v,  # type: Optional[Tensor]
+                                 add_zero_attn,  # type: bool
+                                 dropout_p,  # type: float
+                                 out_proj_weight,  # type: Tensor
+                                 out_proj_bias,  # type: Tensor
+                                 training=True,  # type: bool
+                                 key_padding_mask=None,  # type: Optional[Tensor]
+                                 need_weights=True,  # type: bool
+                                 attn_mask=None,  # type: Optional[Tensor]
+                                 use_separate_proj_weight=False,  # type: bool
+                                 q_proj_weight=None,  # type: Optional[Tensor]
+                                 k_proj_weight=None,  # type: Optional[Tensor]
+                                 v_proj_weight=None,  # type: Optional[Tensor]
+                                 static_k=None,  # type: Optional[Tensor]
+                                 static_v=None,  # type: Optional[Tensor]
+                                 slot_competition=False,
+                                 rotary_pe=None,
+                                 return_kv=False,
+                                 k_mem=None,
+                                 v_mem=None,
+                                 gate_attn=None,
+                                 mem_mask=None
+                                 ):
+    # type: (...) -> Tuple[Tensor, Optional[Tensor]]
+    r"""
+    Args:
+        query, key, value: map a query and a set of key-value pairs to an output.
+            See "Attention Is All You Need" for more details.
+        embed_dim_to_check: total dimension of the model.
+        num_heads: parallel attention heads.
+        in_proj_weight, in_proj_bias: input projection weight and bias.
+        bias_k, bias_v: bias of the key and value sequences to be added at dim=0.
+        add_zero_attn: add a new batch of zeros to the key and
+                       value sequences at dim=1.
+        dropout_p: probability of an element to be zeroed.
+        out_proj_weight, out_proj_bias: the output projection weight and bias.
+        training: apply dropout if is ``True``.
+        key_padding_mask: if provided, specified padding elements in the key will
+            be ignored by the attention. This is an binary mask. When the value is True,
+            the corresponding value on the attention layer will be filled with -inf.
+        need_weights: output attn_output_weights.
+        attn_mask: mask that prevents attention to certain positions. This is an additive mask
+            (i.e. the values will be added to the attention layer).
+        use_separate_proj_weight: the function accept the proj. weights for query, key,
+            and value in differnt forms. If false, in_proj_weight will be used, which is
+            a combination of q_proj_weight, k_proj_weight, v_proj_weight.
+        q_proj_weight, k_proj_weight, v_proj_weight, in_proj_bias: input projection weight and bias.
+        static_k, static_v: static key and value used for attention operators.
+    Shape:
+        Inputs:
+        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key_padding_mask: :math:`(N, S)`, ByteTensor, where N is the batch size, S is the source sequence length.
+        - attn_mask: :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+        - static_k: :math:`(N*num_heads, S, E/num_heads)`, where S is the source sequence length,
+          N is the batch size, E is the embedding dimension. E/num_heads is the head dimension.
+        - static_v: :math:`(N*num_heads, S, E/num_heads)`, where S is the source sequence length,
+          N is the batch size, E is the embedding dimension. E/num_heads is the head dimension.
+        Outputs:
+        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
+          E is the embedding dimension.
+        - attn_output_weights: :math:`(N, L, S)` where N is the batch size,
+          L is the target sequence length, S is the source sequence length.
+    """
+
+    qkv_same = torch.equal(query, key) and torch.equal(key, value)
+    kv_same = torch.equal(key, value)
+
+    tgt_len, bsz, embed_dim = query.size()
+    assert embed_dim == embed_dim_to_check
+    assert list(query.size()) == [tgt_len, bsz, embed_dim]
+    assert key.size() == value.size()
+
+    head_dim = embed_dim // num_heads
+    assert head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"
+    scaling = float(head_dim) ** -0.5
+
+    if use_separate_proj_weight is not True:
+        if qkv_same:
+            # self-attention
+            q, k, v = F.linear(query, in_proj_weight, in_proj_bias).chunk(3, dim=-1)
+
+        elif kv_same:
+            # encoder-decoder attention
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = 0
+            _end = embed_dim
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            q = F.linear(query, _w, _b)
+
+            if key is None:
+                assert value is None
+                k = None
+                v = None
+            else:
+
+                # This is inline in_proj function with in_proj_weight and in_proj_bias
+                _b = in_proj_bias
+                _start = embed_dim
+                _end = None
+                _w = in_proj_weight[_start:, :]
+                if _b is not None:
+                    _b = _b[_start:]
+                k, v = F.linear(key, _w, _b).chunk(2, dim=-1)
+
+        else:
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = 0
+            _end = embed_dim
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            q = F.linear(query, _w, _b)
+
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = embed_dim
+            _end = embed_dim * 2
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            k = F.linear(key, _w, _b)
+
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = embed_dim * 2
+            _end = None
+            _w = in_proj_weight[_start:, :]
+            if _b is not None:
+                _b = _b[_start:]
+            v = F.linear(value, _w, _b)
+    else:
+        q_proj_weight_non_opt = torch.jit._unwrap_optional(q_proj_weight)
+        len1, len2 = q_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == query.size(-1)
+
+        k_proj_weight_non_opt = torch.jit._unwrap_optional(k_proj_weight)
+        len1, len2 = k_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == key.size(-1)
+
+        v_proj_weight_non_opt = torch.jit._unwrap_optional(v_proj_weight)
+        len1, len2 = v_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == value.size(-1)
+
+        if in_proj_bias is not None:
+            q = F.linear(query, q_proj_weight_non_opt, in_proj_bias[0:embed_dim])
+            k = F.linear(key, k_proj_weight_non_opt, in_proj_bias[embed_dim:(embed_dim * 2)])
+            v = F.linear(value, v_proj_weight_non_opt, in_proj_bias[(embed_dim * 2):])
+        else:
+            q = F.linear(query, q_proj_weight_non_opt, in_proj_bias)
+            k = F.linear(key, k_proj_weight_non_opt, in_proj_bias)
+            v = F.linear(value, v_proj_weight_non_opt, in_proj_bias)
+    q = q * scaling
+
+    if bias_k is not None and bias_v is not None:
+        if static_k is None and static_v is None:
+            k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask,
+                                       torch.zeros((attn_mask.size(0), 1),
+                                                   dtype=attn_mask.dtype,
+                                                   device=attn_mask.device)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, torch.zeros((key_padding_mask.size(0), 1),
+                                                   dtype=key_padding_mask.dtype,
+                                                   device=key_padding_mask.device)], dim=1)
+        else:
+            assert static_k is None, "bias cannot be added to static key."
+            assert static_v is None, "bias cannot be added to static value."
+    else:
+        assert bias_k is None
+        assert bias_v is None
+
+    if rotary_pe is not None:  # rotary pe ROPE disentangeld
+        qp, kvp = rotary_pe
+        q_cos, q_sin = qp[..., 0], qp[..., 1]
+        k_cos, k_sin = kvp[..., 0], kvp[..., 1]
+        q = RotaryPositionEncoding.embed_rotary(q.transpose(0, 1), q_cos, q_sin).transpose(0, 1)
+        k = RotaryPositionEncoding.embed_rotary(k.transpose(0, 1), k_cos, k_sin).transpose(0, 1)
+
+    q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)
+    if k is not None:
+        k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
+    if v is not None:
+        v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
+
+    if static_k is not None:
+        assert static_k.size(0) == bsz * num_heads
+        assert static_k.size(2) == head_dim
+        k = static_k
+
+    if static_v is not None:
+        assert static_v.size(0) == bsz * num_heads
+        assert static_v.size(2) == head_dim
+        v = static_v
+
+    src_len = k.size(1)
+
+    if key_padding_mask is not None:
+        assert key_padding_mask.size(0) == bsz
+        assert key_padding_mask.size(1) == src_len
+
+    if add_zero_attn:
+        src_len += 1
+        k = torch.cat([k, torch.zeros((k.size(0), 1) + k.size()[2:], dtype=k.dtype, device=k.device)], dim=1)
+        v = torch.cat([v, torch.zeros((v.size(0), 1) + v.size()[2:], dtype=v.dtype, device=v.device)], dim=1)
+        if attn_mask is not None:
+            attn_mask = torch.cat([attn_mask, torch.zeros((attn_mask.size(0), 1),
+                                                          dtype=attn_mask.dtype,
+                                                          device=attn_mask.device)], dim=1)
+        if key_padding_mask is not None:
+            key_padding_mask = torch.cat(
+                [key_padding_mask, torch.zeros((key_padding_mask.size(0), 1),
+                                               dtype=key_padding_mask.dtype,
+                                               device=key_padding_mask.device)], dim=1)
+
+    attn_output_weights = torch.bmm(q, k.transpose(1, 2))
+    assert list(attn_output_weights.size()) == [bsz * num_heads, tgt_len, src_len]
+
+    if attn_mask is not None:
+        attn_mask = attn_mask.unsqueeze(0)
+        attn_output_weights += attn_mask
+
+    if key_padding_mask is not None:
+        attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
+        attn_output_weights = attn_output_weights.masked_fill(
+            key_padding_mask.unsqueeze(1).unsqueeze(2),
+            float('-inf'),
+        )
+        attn_output_weights = attn_output_weights.view(bsz * num_heads, tgt_len, src_len)
+
+    if slot_competition:
+        attn_output_weights = F.softmax(attn_output_weights, dim=-2) + 1e-8
+        attn_output_weights = attn_output_weights / attn_output_weights.sum(dim=-1, keepdim=True)
+    else:
+        attn_output_weights = F.softmax(
+            attn_output_weights, dim=-1)
+
+    attn_output_weights = F.dropout(attn_output_weights, p=dropout_p, training=training)
+
+    attn_output = torch.bmm(attn_output_weights, v)
+    assert list(attn_output.size()) == [bsz * num_heads, tgt_len, head_dim]
+
+    # do memorizing transformer gating
+    if (gate_attn is not None) and (k_mem is not None) and (v_mem is not None):
+        k_mem = k_mem.permute((2, 0, 1))
+        key_mem_len = k_mem.shape[0]
+        k_mem = k_mem.contiguous().view(key_mem_len, bsz * num_heads, head_dim).transpose(0, 1)
+        v_mem = v_mem.permute((2, 0, 1))
+        v_mem = v_mem.contiguous().view(key_mem_len, bsz * num_heads, head_dim).transpose(0, 1)
+        #         if True:
+        #             k_mem = F.normalize(k_mem, dim = -1)
+
+        attn_output_weights_mem = torch.bmm(q, k_mem.transpose(1, 2))  # [24, 16, 110]
+        # bcz correspondance b/w key key is good not query, key visually
+        #         attn_output_weights_mem = torch.bmm(k, k_mem.transpose(1, 2))
+        attn_output_weights_mem = F.softmax(attn_output_weights_mem, dim=-1)
+        if mem_mask is not None:
+            mem_mask = mem_mask.unsqueeze(1).unsqueeze(1)  # [B, 1, 1, key_mem_len]
+            attn_output_weights_mem = attn_output_weights_mem.reshape(bsz, num_heads, tgt_len, key_mem_len)
+            attn_output_weights_mem = attn_output_weights_mem * mem_mask
+            attn_output_weights_mem = attn_output_weights_mem.reshape(bsz * num_heads, tgt_len, key_mem_len)
+
+        attn_output_weights_mem = F.dropout(attn_output_weights_mem, p=dropout_p, training=training)
+        attn_output_mem = torch.bmm(attn_output_weights_mem, v_mem)  # [bsz * num_heads, tgt_len, head_dim]
+
+        # gated learnable attention like memorizing transformers
+        print("gate_attn ", torch.sigmoid(gate_attn))
+        gate = torch.sigmoid(gate_attn).reshape(-1, 1, 1, 1)  # (n_head, 1, 1, 1)
+        attn_output_mem = attn_output_mem.view(bsz, num_heads, tgt_len, head_dim).transpose(0,
+                                                                                            1)  # [num_heads, bsz, tgt_len, head_dim]
+        attn_output = attn_output.view(bsz, num_heads, tgt_len, head_dim).transpose(0,
+                                                                                    1)  # [num_heads, bsz, tgt_len, head_dim]
+        attn_output = gate * attn_output_mem + (1. - gate) * attn_output
+        attn_output = attn_output.transpose(1, 0).view(bsz * num_heads, tgt_len, head_dim)
+
+    attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+    attn_output = F.linear(attn_output, out_proj_weight, out_proj_bias)
+
+    if return_kv:
+        return attn_output, q, k, v
+    elif need_weights:
+        # average attention weights over heads
+        attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
+        #         return attn_output, attn_output_weights.sum(dim=1) / num_heads
+        return attn_output, attn_output_weights
+    else:
+        return attn_output, None
diff --git a/diffuser_actor/utils/multihead_flash_attention.py b/diffuser_actor/utils/multihead_flash_attention.py
new file mode 100644
index 0000000..bcebd84
--- /dev/null
+++ b/diffuser_actor/utils/multihead_flash_attention.py
@@ -0,0 +1,412 @@
+"""Mostly follow multihead_custom_attention.py
+This script needed to be cleaned up
+"""
+import warnings
+import einops
+import torch
+from torch.nn import Linear
+from torch.nn.init import xavier_uniform_
+from torch.nn.init import constant_
+from torch.nn.init import xavier_normal_
+from torch.nn.parameter import Parameter
+from torch.nn import Module
+from torch.nn import functional as F
+
+from flash_attn import flash_attn_func
+
+from .position_encodings import RotaryPositionEncoding
+from .multihead_custom_attention import MultiheadCustomAttention
+
+
+class MultiheadFlashAttention(Module):
+    r"""Allows the model to jointly attend to information
+    from different representation subspaces.
+    See reference: Attention Is All You Need
+    .. math::
+        \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
+        \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
+    Args:
+        embed_dim: total dimension of the model.
+        num_heads: parallel attention heads.
+        dropout: a Dropout layer on attn_output_weights. Default: 0.0.
+        bias: add bias as module parameter. Default: True.
+        add_bias_kv: add bias to the key and value sequences at dim=0.
+        add_zero_attn: add a new batch of zeros to the key and
+                       value sequences at dim=1.
+        kdim: total number of features in key. Default: None.
+        vdim: total number of features in key. Default: None.
+        Note: if kdim and vdim are None, they will be set to embed_dim such that
+        query, key, and value have the same number of features.
+    Examples::
+        >>> multihead_attn = nn.MultiheadAttention(embed_dim, num_heads)
+        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
+    """
+
+    def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None,
+                 vdim=None, slot_competition=False, return_kv=False, gate_attn=False):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
+
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
+        ##### Custom
+        self.slot_competition = slot_competition
+        self.return_kv = return_kv
+        self.gate_attn = None
+        if gate_attn:
+            self.gate_attn = Parameter(torch.randn(num_heads))  # randn
+        #####
+        if self._qkv_same_embed_dim is False:
+            self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
+            self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
+            self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
+        else:
+            self.register_parameter('q_proj_weight', None)
+            self.register_parameter('k_proj_weight', None)
+            self.register_parameter('v_proj_weight', None)
+
+        if bias:
+            self.in_proj_bias = Parameter(torch.empty(3 * embed_dim))
+        else:
+            self.register_parameter('in_proj_bias', None)
+        self.out_proj = Linear(embed_dim, embed_dim, bias=bias)
+
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.empty(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.empty(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+
+        self.add_zero_attn = add_zero_attn
+
+        self._reset_parameters()
+
+    def _reset_parameters(self):
+        if self._qkv_same_embed_dim:
+            xavier_uniform_(self.in_proj_weight)
+        else:
+            xavier_uniform_(self.q_proj_weight)
+            xavier_uniform_(self.k_proj_weight)
+            xavier_uniform_(self.v_proj_weight)
+
+        if self.in_proj_bias is not None:
+            constant_(self.in_proj_bias, 0.)
+            constant_(self.out_proj.bias, 0.)
+        if self.bias_k is not None:
+            xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            xavier_normal_(self.bias_v)
+
+    def forward(self, query, key, value, key_padding_mask=None, need_weights=True,
+                attn_mask=None, k_mem=None, v_mem=None, mem_mask=None, rotary_pe=None):
+        r"""
+    Args:
+        query, key, value: map a query and a set of key-value pairs to an output.
+            See "Attention Is All You Need" for more details.
+        key_padding_mask: if provided, specified padding elements in the key will
+            be ignored by the attention. This is an binary mask. When the value is True,
+            the corresponding value on the attention layer will be filled with -inf.
+        need_weights: output attn_output_weights.
+        attn_mask: mask that prevents attention to certain positions. This is an additive mask
+            (i.e. the values will be added to the attention layer).
+    Shape:
+        - Inputs:
+        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key_padding_mask: :math:`(N, S)`, ByteTensor, where N is the batch size, S is the source sequence length.
+        - attn_mask: :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+        - Outputs:
+        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
+          E is the embedding dimension.
+        - attn_output_weights: :math:`(N, L, S)` where N is the batch size,
+          L is the target sequence length, S is the source sequence length.
+        """
+        if hasattr(self, '_qkv_same_embed_dim') and self._qkv_same_embed_dim is False:
+            return multi_head_attention_forward(
+                query, key, value, self.embed_dim, self.num_heads,
+                self.in_proj_weight, self.in_proj_bias,
+                self.bias_k, self.bias_v, self.add_zero_attn,
+                self.dropout, self.out_proj.weight, self.out_proj.bias,
+                training=self.training,
+                key_padding_mask=key_padding_mask, need_weights=need_weights,
+                attn_mask=attn_mask, use_separate_proj_weight=True,
+                q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight,
+                v_proj_weight=self.v_proj_weight, slot_competition=self.slot_competition,
+                return_kv=self.return_kv, k_mem=k_mem, v_mem=v_mem,
+                gate_attn=self.gate_attn, mem_mask=mem_mask,
+                rotary_pe=rotary_pe)
+        else:
+            if not hasattr(self, '_qkv_same_embed_dim'):
+                warnings.warn('A new version of MultiheadAttention module has been implemented. \
+                    Please re-train your model with the new module',
+                              UserWarning)
+
+            return multi_head_attention_forward(
+                query, key, value, self.embed_dim, self.num_heads,
+                self.in_proj_weight, self.in_proj_bias,
+                self.bias_k, self.bias_v, self.add_zero_attn,
+                self.dropout, self.out_proj.weight, self.out_proj.bias,
+                training=self.training,
+                key_padding_mask=key_padding_mask, need_weights=need_weights,
+                attn_mask=attn_mask, slot_competition=self.slot_competition,
+                return_kv=self.return_kv, k_mem=k_mem, v_mem=v_mem,
+                gate_attn=self.gate_attn, mem_mask=mem_mask,
+                rotary_pe=rotary_pe)
+
+
+def multi_head_attention_forward(query,  # type: Tensor
+                                 key,  # type: Tensor
+                                 value,  # type: Tensor
+                                 embed_dim_to_check,  # type: int
+                                 num_heads,  # type: int
+                                 in_proj_weight,  # type: Tensor
+                                 in_proj_bias,  # type: Tensor
+                                 bias_k,  # type: Optional[Tensor]
+                                 bias_v,  # type: Optional[Tensor]
+                                 add_zero_attn,  # type: bool
+                                 dropout_p,  # type: float
+                                 out_proj_weight,  # type: Tensor
+                                 out_proj_bias,  # type: Tensor
+                                 training=True,  # type: bool
+                                 key_padding_mask=None,  # type: Optional[Tensor]
+                                 need_weights=True,  # type: bool
+                                 attn_mask=None,  # type: Optional[Tensor]
+                                 use_separate_proj_weight=False,  # type: bool
+                                 q_proj_weight=None,  # type: Optional[Tensor]
+                                 k_proj_weight=None,  # type: Optional[Tensor]
+                                 v_proj_weight=None,  # type: Optional[Tensor]
+                                 static_k=None,  # type: Optional[Tensor]
+                                 static_v=None,  # type: Optional[Tensor]
+                                 slot_competition=False,
+                                 rotary_pe=None,
+                                 return_kv=False,
+                                 k_mem=None,
+                                 v_mem=None,
+                                 gate_attn=None,
+                                 mem_mask=None
+                                 ):
+    # type: (...) -> Tuple[Tensor, Optional[Tensor]]
+    r"""
+    Args:
+        query, key, value: map a query and a set of key-value pairs to an output.
+            See "Attention Is All You Need" for more details.
+        embed_dim_to_check: total dimension of the model.
+        num_heads: parallel attention heads.
+        in_proj_weight, in_proj_bias: input projection weight and bias.
+        bias_k, bias_v: bias of the key and value sequences to be added at dim=0.
+        add_zero_attn: add a new batch of zeros to the key and
+                       value sequences at dim=1.
+        dropout_p: probability of an element to be zeroed.
+        out_proj_weight, out_proj_bias: the output projection weight and bias.
+        training: apply dropout if is ``True``.
+        key_padding_mask: if provided, specified padding elements in the key will
+            be ignored by the attention. This is an binary mask. When the value is True,
+            the corresponding value on the attention layer will be filled with -inf.
+        need_weights: output attn_output_weights.
+        attn_mask: mask that prevents attention to certain positions. This is an additive mask
+            (i.e. the values will be added to the attention layer).
+        use_separate_proj_weight: the function accept the proj. weights for query, key,
+            and value in differnt forms. If false, in_proj_weight will be used, which is
+            a combination of q_proj_weight, k_proj_weight, v_proj_weight.
+        q_proj_weight, k_proj_weight, v_proj_weight, in_proj_bias: input projection weight and bias.
+        static_k, static_v: static key and value used for attention operators.
+    Shape:
+        Inputs:
+        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
+          the embedding dimension.
+        - key_padding_mask: :math:`(N, S)`, ByteTensor, where N is the batch size, S is the source sequence length.
+        - attn_mask: :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+        - static_k: :math:`(N*num_heads, S, E/num_heads)`, where S is the source sequence length,
+          N is the batch size, E is the embedding dimension. E/num_heads is the head dimension.
+        - static_v: :math:`(N*num_heads, S, E/num_heads)`, where S is the source sequence length,
+          N is the batch size, E is the embedding dimension. E/num_heads is the head dimension.
+        Outputs:
+        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
+          E is the embedding dimension.
+        - attn_output_weights: :math:`(N, L, S)` where N is the batch size,
+          L is the target sequence length, S is the source sequence length.
+    """
+
+    qkv_same = torch.equal(query, key) and torch.equal(key, value)
+    kv_same = torch.equal(key, value)
+
+    tgt_len, bsz, embed_dim = query.size()
+    assert embed_dim == embed_dim_to_check
+    assert list(query.size()) == [tgt_len, bsz, embed_dim]
+    assert key.size() == value.size()
+
+    head_dim = embed_dim // num_heads
+    assert head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"
+
+    if use_separate_proj_weight is not True:
+        if qkv_same:
+            # self-attention
+            q, k, v = F.linear(query, in_proj_weight, in_proj_bias).chunk(3, dim=-1)
+
+        elif kv_same:
+            # encoder-decoder attention
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = 0
+            _end = embed_dim
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            q = F.linear(query, _w, _b)
+
+            if key is None:
+                assert value is None
+                k = None
+                v = None
+            else:
+
+                # This is inline in_proj function with in_proj_weight and in_proj_bias
+                _b = in_proj_bias
+                _start = embed_dim
+                _end = None
+                _w = in_proj_weight[_start:, :]
+                if _b is not None:
+                    _b = _b[_start:]
+                k, v = F.linear(key, _w, _b).chunk(2, dim=-1)
+
+        else:
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = 0
+            _end = embed_dim
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            q = F.linear(query, _w, _b)
+
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = embed_dim
+            _end = embed_dim * 2
+            _w = in_proj_weight[_start:_end, :]
+            if _b is not None:
+                _b = _b[_start:_end]
+            k = F.linear(key, _w, _b)
+
+            # This is inline in_proj function with in_proj_weight and in_proj_bias
+            _b = in_proj_bias
+            _start = embed_dim * 2
+            _end = None
+            _w = in_proj_weight[_start:, :]
+            if _b is not None:
+                _b = _b[_start:]
+            v = F.linear(value, _w, _b)
+    else:
+        q_proj_weight_non_opt = torch.jit._unwrap_optional(q_proj_weight)
+        len1, len2 = q_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == query.size(-1)
+
+        k_proj_weight_non_opt = torch.jit._unwrap_optional(k_proj_weight)
+        len1, len2 = k_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == key.size(-1)
+
+        v_proj_weight_non_opt = torch.jit._unwrap_optional(v_proj_weight)
+        len1, len2 = v_proj_weight_non_opt.size()
+        assert len1 == embed_dim and len2 == value.size(-1)
+
+        if in_proj_bias is not None:
+            q = F.linear(query, q_proj_weight_non_opt, in_proj_bias[0:embed_dim])
+            k = F.linear(key, k_proj_weight_non_opt, in_proj_bias[embed_dim:(embed_dim * 2)])
+            v = F.linear(value, v_proj_weight_non_opt, in_proj_bias[(embed_dim * 2):])
+        else:
+            q = F.linear(query, q_proj_weight_non_opt, in_proj_bias)
+            k = F.linear(key, k_proj_weight_non_opt, in_proj_bias)
+            v = F.linear(value, v_proj_weight_non_opt, in_proj_bias)
+
+    if bias_k is not None and bias_v is not None:
+        if static_k is None and static_v is None:
+            k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask,
+                                       torch.zeros((attn_mask.size(0), 1),
+                                                   dtype=attn_mask.dtype,
+                                                   device=attn_mask.device)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, torch.zeros((key_padding_mask.size(0), 1),
+                                                   dtype=key_padding_mask.dtype,
+                                                   device=key_padding_mask.device)], dim=1)
+        else:
+            assert static_k is None, "bias cannot be added to static key."
+            assert static_v is None, "bias cannot be added to static value."
+    else:
+        assert bias_k is None
+        assert bias_v is None
+
+    if rotary_pe is not None:  # rotary pe ROPE disentangeld
+        qp, kvp = rotary_pe
+        q_cos, q_sin = qp[..., 0], qp[..., 1]
+        k_cos, k_sin = kvp[..., 0], kvp[..., 1]
+        q = RotaryPositionEncoding.embed_rotary(q.transpose(0, 1), q_cos, q_sin).transpose(0, 1)
+        k = RotaryPositionEncoding.embed_rotary(k.transpose(0, 1), k_cos, k_sin).transpose(0, 1)
+
+    q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)
+    if k is not None:
+        k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
+    if v is not None:
+        v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
+
+    if static_k is not None:
+        assert static_k.size(0) == bsz * num_heads
+        assert static_k.size(2) == head_dim
+        k = static_k
+
+    if static_v is not None:
+        assert static_v.size(0) == bsz * num_heads
+        assert static_v.size(2) == head_dim
+        v = static_v
+
+    src_len = k.size(1)
+
+    if key_padding_mask is not None:
+        assert key_padding_mask.size(0) == bsz
+        assert key_padding_mask.size(1) == src_len
+
+    if add_zero_attn:
+        src_len += 1
+        k = torch.cat([k, torch.zeros((k.size(0), 1) + k.size()[2:], dtype=k.dtype, device=k.device)], dim=1)
+        v = torch.cat([v, torch.zeros((v.size(0), 1) + v.size()[2:], dtype=v.dtype, device=v.device)], dim=1)
+        if attn_mask is not None:
+            attn_mask = torch.cat([attn_mask, torch.zeros((attn_mask.size(0), 1),
+                                                          dtype=attn_mask.dtype,
+                                                          device=attn_mask.device)], dim=1)
+        if key_padding_mask is not None:
+            key_padding_mask = torch.cat(
+                [key_padding_mask, torch.zeros((key_padding_mask.size(0), 1),
+                                               dtype=key_padding_mask.dtype,
+                                               device=key_padding_mask.device)], dim=1)
+    
+    q = q.unflatten(0, (bsz, num_heads)).transpose(1, 2).to(torch.float16)
+    k = k.unflatten(0, (bsz, num_heads)).transpose(1, 2).to(torch.float16)
+    v = v.unflatten(0, (bsz, num_heads)).transpose(1, 2).to(torch.float16)
+    attn_output = flash_attn_func(
+        q, k, v, dropout_p=dropout_p if training else 0.0
+    ).to(query.dtype) # (bs, tgt_len, nheads, dim)
+    attn_output = attn_output.flatten(-2) # (bs, tgt_len, nheads * dim)
+
+    attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+    attn_output = F.linear(attn_output, out_proj_weight, out_proj_bias)
+
+    # Return None to be compatible with MultiheadCustomAttention
+    return attn_output, None
diff --git a/diffuser_actor/utils/position_encodings.py b/diffuser_actor/utils/position_encodings.py
new file mode 100644
index 0000000..1f9b1b7
--- /dev/null
+++ b/diffuser_actor/utils/position_encodings.py
@@ -0,0 +1,143 @@
+import math
+
+import torch
+import torch.nn as nn
+
+
+class SinusoidalPosEmb(nn.Module):
+
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x):
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
+        emb = x[:, None] * emb[None, :]
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+
+
+class RotaryPositionEncoding(nn.Module):
+    def __init__(self, feature_dim, pe_type='Rotary1D'):
+        super().__init__()
+
+        self.feature_dim = feature_dim
+        self.pe_type = pe_type
+
+    @staticmethod
+    def embed_rotary(x, cos, sin):
+        x2 = torch.stack([-x[..., 1::2], x[..., ::2]], dim=-1).reshape_as(x).contiguous()
+        x = x * cos + x2 * sin
+        return x
+
+    def forward(self, x_position):
+        bsize, npoint = x_position.shape
+        div_term = torch.exp(
+            torch.arange(0, self.feature_dim, 2, dtype=torch.float, device=x_position.device)
+            * (-math.log(10000.0) / (self.feature_dim)))
+        div_term = div_term.view(1, 1, -1) # [1, 1, d]
+
+        sinx = torch.sin(x_position * div_term)  # [B, N, d]
+        cosx = torch.cos(x_position * div_term)
+
+        sin_pos, cos_pos = map(
+            lambda feat: torch.stack([feat, feat], dim=-1).view(bsize, npoint, -1),
+            [sinx, cosx]
+        )
+        position_code = torch.stack([cos_pos, sin_pos] , dim=-1)
+
+        if position_code.requires_grad:
+            position_code = position_code.detach()
+
+        return position_code
+
+
+class RotaryPositionEncoding3D(RotaryPositionEncoding):
+
+    def __init__(self, feature_dim, pe_type='Rotary3D'):
+        super().__init__(feature_dim, pe_type)
+
+    @torch.no_grad()
+    def forward(self, XYZ):
+        '''
+        @param XYZ: [B,N,3]
+        @return:
+        '''
+        bsize, npoint, _ = XYZ.shape
+        x_position, y_position, z_position = XYZ[..., 0:1], XYZ[..., 1:2], XYZ[..., 2:3]
+        div_term = torch.exp(
+            torch.arange(0, self.feature_dim // 3, 2, dtype=torch.float, device=XYZ.device)
+            * (-math.log(10000.0) / (self.feature_dim // 3))
+        )
+        div_term = div_term.view(1, 1, -1)  # [1, 1, d//6]
+
+        sinx = torch.sin(x_position * div_term)  # [B, N, d//6]
+        cosx = torch.cos(x_position * div_term)
+        siny = torch.sin(y_position * div_term)
+        cosy = torch.cos(y_position * div_term)
+        sinz = torch.sin(z_position * div_term)
+        cosz = torch.cos(z_position * div_term)
+
+        sinx, cosx, siny, cosy, sinz, cosz = map(
+            lambda feat: torch.stack([feat, feat], -1).view(bsize, npoint, -1),
+            [sinx, cosx, siny, cosy, sinz, cosz]
+        )
+
+        position_code = torch.stack([
+            torch.cat([cosx, cosy, cosz], dim=-1),  # cos_pos
+            torch.cat([sinx, siny, sinz], dim=-1)  # sin_pos
+        ], dim=-1)
+
+        if position_code.requires_grad:
+            position_code = position_code.detach()
+
+        return position_code
+
+
+class LearnedAbsolutePositionEncoding3D(nn.Module):
+    def __init__(self, input_dim, embedding_dim):
+        super().__init__()
+        self.absolute_pe_layer = nn.Sequential(
+            nn.Conv1d(input_dim, embedding_dim, kernel_size=1),
+            nn.BatchNorm1d(embedding_dim),
+            nn.ReLU(inplace=True),
+            nn.Conv1d(embedding_dim, embedding_dim, kernel_size=1)
+        )
+
+    def forward(self, xyz):
+        """
+        Arguments:
+            xyz: (B, N, 3) tensor of the (x, y, z) coordinates of the points
+
+        Returns:
+            absolute_pe: (B, N, embedding_dim) tensor of the absolute position encoding
+        """
+        return self.absolute_pe_layer(xyz.permute(0, 2, 1)).permute(0, 2, 1)
+
+
+class LearnedAbsolutePositionEncoding3Dv2(nn.Module):
+    def __init__(self, input_dim, embedding_dim, norm="none"):
+        super().__init__()
+        norm_tb = {
+            "none": nn.Identity(),
+            "bn": nn.BatchNorm1d(embedding_dim),
+        }
+        self.absolute_pe_layer = nn.Sequential(
+            nn.Conv1d(input_dim, embedding_dim, kernel_size=1),
+            norm_tb[norm],
+            nn.ReLU(inplace=True),
+            nn.Conv1d(embedding_dim, embedding_dim, kernel_size=1)
+        )
+
+    def forward(self, xyz):
+        """
+        Arguments:
+            xyz: (B, N, 3) tensor of the (x, y, z) coordinates of the points
+
+        Returns:
+            absolute_pe: (B, N, embedding_dim) tensor of the absolute position encoding
+        """
+        return self.absolute_pe_layer(xyz.permute(0, 2, 1)).permute(0, 2, 1)
diff --git a/diffuser_actor/utils/resnet.py b/diffuser_actor/utils/resnet.py
new file mode 100644
index 0000000..dd305d3
--- /dev/null
+++ b/diffuser_actor/utils/resnet.py
@@ -0,0 +1,60 @@
+# Adapted from https://github.com/pytorch/vision/blob/v0.11.0/torchvision/models/resnet.py
+
+import torch
+from torchvision import transforms
+from typing import Type, Union, List, Any
+from torchvision.models.resnet import _resnet, BasicBlock, Bottleneck, ResNet
+
+
+def load_resnet50(pretrained: bool = False):
+    backbone = _resnet('resnet50', Bottleneck, [3, 4, 6, 3], pretrained=pretrained, progress=True)
+    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+    return backbone, normalize
+
+def load_resnet18(pretrained: bool = False):
+    backbone = _resnet('resnet18', Bottleneck, [2, 2, 2, 2], pretrained=pretrained, progress=True)
+    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+    return backbone, normalize
+
+def _resnet(
+    arch: str,
+    block: Type[Union[BasicBlock, Bottleneck]],
+    layers: List[int],
+    pretrained: bool,
+    progress: bool,
+    **kwargs: Any
+) -> ResNet:
+    model = ResNetFeatures(block, layers, **kwargs)
+    if pretrained:
+        if int(torch.__version__[0]) <= 1:
+            from torch.hub import load_state_dict_from_url
+            from torchvision.models.resnet import model_urls
+            state_dict = load_state_dict_from_url(model_urls[arch], progress=progress)
+            model.load_state_dict(state_dict)
+        else:
+            raise NotImplementedError("Pretrained models not supported in PyTorch 2.0+")
+    return model
+
+
+class ResNetFeatures(ResNet):
+    def __init__(self, block, layers, **kwargs):
+        super().__init__(block, layers, **kwargs)
+
+    def _forward_impl(self, x: torch.Tensor):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x0 = self.relu(x)
+        x = self.maxpool(x0)
+
+        x1 = self.layer1(x)
+        x2 = self.layer2(x1)
+        x3 = self.layer3(x2)
+        x4 = self.layer4(x3)
+
+        return {
+            "res1": x0,
+            "res2": x1,
+            "res3": x2,
+            "res4": x3,
+            "res5": x4,
+        }
diff --git a/diffuser_actor/utils/utils.py b/diffuser_actor/utils/utils.py
new file mode 100644
index 0000000..b87993d
--- /dev/null
+++ b/diffuser_actor/utils/utils.py
@@ -0,0 +1,264 @@
+import numpy as np
+import einops
+import torch
+import torch.nn.functional as F
+
+
+def normalise_quat(x: torch.Tensor):
+    return x / torch.clamp(x.square().sum(dim=-1).sqrt().unsqueeze(-1), min=1e-10)
+
+
+def norm_tensor(tensor: torch.Tensor) -> torch.Tensor:
+    return tensor / torch.linalg.norm(tensor, ord=2, dim=-1, keepdim=True)
+
+
+def sample_ghost_points_grid(bounds, num_points_per_dim=10):
+    x_ = np.linspace(bounds[0][0], bounds[1][0], num_points_per_dim)
+    y_ = np.linspace(bounds[0][1], bounds[1][1], num_points_per_dim)
+    z_ = np.linspace(bounds[0][2], bounds[1][2], num_points_per_dim)
+    x, y, z = np.meshgrid(x_, y_, z_, indexing='ij')
+    ghost_points = einops.rearrange(np.stack([x, y, z]), "n x y z -> (x y z) n")
+    return ghost_points
+
+
+def sample_ghost_points_uniform_cube(bounds, num_points=1000):
+    x = np.random.uniform(bounds[0][0], bounds[1][0], num_points)
+    y = np.random.uniform(bounds[0][1], bounds[1][1], num_points)
+    z = np.random.uniform(bounds[0][2], bounds[1][2], num_points)
+    ghost_points = np.stack([x, y, z], axis=1)
+    return ghost_points
+
+
+def sample_ghost_points_uniform_sphere(center, radius, bounds, num_points=1000):
+    """Sample points uniformly within a sphere through rejection sampling."""
+    ghost_points = np.empty((0, 3))
+    tries = 0
+    while ghost_points.shape[0] < num_points:
+        points = sample_ghost_points_uniform_cube(bounds, num_points)
+        l2 = np.linalg.norm(points - center, axis=1)
+        ghost_points = np.concatenate([ghost_points, points[l2 < radius]])
+        tries += 1
+        if tries > 100:
+            raise Exception('cannot find valid action sample')
+    ghost_points = ghost_points[:num_points]
+    return ghost_points
+
+
+"""
+Below is a continuous 6D rotation representation adapted from
+On the Continuity of Rotation Representations in Neural Networks
+https://arxiv.org/pdf/1812.07035.pdf
+https://github.com/papagina/RotationContinuity/blob/master/sanity_test/code/tools.py
+"""
+
+
+def normalize_vector(v, return_mag=False):
+    device = v.device
+    batch = v.shape[0]
+    v_mag = torch.sqrt(v.pow(2).sum(1))
+    v_mag = torch.max(v_mag, torch.autograd.Variable(torch.FloatTensor([1e-8]).to(device)))
+    v_mag = v_mag.view(batch, 1).expand(batch, v.shape[1])
+    v = v / v_mag
+    if return_mag:
+        return v, v_mag[:, 0]
+    else:
+        return v
+
+
+def cross_product(u, v):
+    batch = u.shape[0]
+    i = u[:, 1] * v[:, 2] - u[:, 2] * v[:, 1]
+    j = u[:, 2] * v[:, 0] - u[:, 0] * v[:, 2]
+    k = u[:, 0] * v[:, 1] - u[:, 1] * v[:, 0]
+    out = torch.cat((i.view(batch, 1), j.view(batch, 1), k.view(batch, 1)), 1)
+    return out  # batch*3
+
+
+def compute_rotation_matrix_from_ortho6d(ortho6d):
+    x_raw = ortho6d[:, 0:3]  # batch*3
+    y_raw = ortho6d[:, 3:6]  # batch*3
+
+    x = normalize_vector(x_raw)  # batch*3
+    z = cross_product(x, y_raw)  # batch*3
+    z = normalize_vector(z)  # batch*3
+    y = cross_product(z, x)  # batch*3
+
+    x = x.view(-1, 3, 1)
+    y = y.view(-1, 3, 1)
+    z = z.view(-1, 3, 1)
+    matrix = torch.cat((x, y, z), 2)  # batch*3*3
+    return matrix
+
+
+def get_ortho6d_from_rotation_matrix(matrix):
+    # The orhto6d represents the first two column vectors a1 and a2 of the
+    # rotation matrix: [ | , |,  | ]
+    #                  [ a1, a2, a3]
+    #                  [ | , |,  | ]
+    ortho6d = matrix[:, :, :2].permute(0, 2, 1).flatten(-2)
+    return ortho6d
+
+
+def orthonormalize_by_gram_schmidt(matrix):
+    """Post-processing a 9D matrix with Gram-Schmidt orthogonalization.
+
+    Args:
+        matrix: A tensor of shape (..., 3, 3)
+
+    Returns:
+        A tensor of shape (..., 3, 3) with orthogonal rows.
+    """
+    a1, a2, a3 = matrix[..., :, 0], matrix[..., :, 1], matrix[..., :, 2]
+    b1 = F.normalize(a1, dim=-1)
+
+    b2 = a2 - (b1 * a2).sum(-1, keepdim=True) * b1
+    b2 = F.normalize(b2, dim=-1)
+
+    b3 = a3 - (b1 * a3).sum(-1, keepdim=True) * b1 - (b2 * a3).sum(-1, keepdim=True) * b2
+    b3 = F.normalize(b3, dim=-1)
+
+    return torch.stack([b1, b2, b3], dim=-1)
+
+
+def rotation_from_svd(points1, points2, center1=None, center2=None):
+    """Compute rotation matrix from two point clouds using SVD.
+
+    Args:
+        points1: A tensor of shape (..., npts, 3)
+        points2: A tensor of shape (..., npts, 3)
+        cetner1: A tensor of shape (..., 3) representing the center of points1.
+        cetner2: A tensor of shape (..., 3) representing the center of points2.
+
+    Returns:
+        A tensor of shape (..., 3, 3) representing the rotation matrix.
+    """
+    if center1 is None:
+        points1 = points1 - points1.mean(dim=-2, keepdim=True)
+    else:
+        points1 = points1 - center1.unsqueeze(-2)
+
+    if center2 is None:
+        points2 = points2 - points2.mean(dim=-2, keepdim=True)
+    else:
+        points2 = points2 - center2.unsqueeze(-2)
+
+    # compute svd
+    H = points2.transpose(-2, -1) @ points1
+    U, S, Vh = torch.linalg.svd(H)
+    V = Vh.transpose(-2, -1)
+    R = V @ U.transpose(-2, -1)
+
+    # if the determinant(R) < 0, multiply the 3rd column of V with -1
+    inverse_V = torch.stack([
+        V[..., 0], V[..., 1], -V[..., 2]
+    ], dim=-1)
+    V = torch.where(torch.linalg.det(R).unsqueeze(-1).unsqueeze(-1) < 0,
+                    inverse_V, V)
+    R = V @ U.transpose(-2, -1)
+
+    return R
+
+
+def quaternion_to_matrix(quaternions: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as quaternions to rotation matrices.
+
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    r, i, j, k = torch.unbind(quaternions, -1)
+    # pyre-fixme[58]: `/` is not supported for operand types `float` and `Tensor`.
+    two_s = 2.0 / (quaternions * quaternions).sum(-1)
+
+    o = torch.stack(
+        (
+            1 - two_s * (j * j + k * k),
+            two_s * (i * j - k * r),
+            two_s * (i * k + j * r),
+            two_s * (i * j + k * r),
+            1 - two_s * (i * i + k * k),
+            two_s * (j * k - i * r),
+            two_s * (i * k - j * r),
+            two_s * (j * k + i * r),
+            1 - two_s * (i * i + j * j),
+        ),
+        -1,
+    )
+    return o.reshape(quaternions.shape[:-1] + (3, 3))
+
+
+def _sqrt_positive_part(x: torch.Tensor) -> torch.Tensor:
+    """
+    Returns torch.sqrt(torch.max(0, x))
+    but with a zero subgradient where x is 0.
+    """
+    ret = torch.zeros_like(x)
+    positive_mask = x > 0
+    ret[positive_mask] = torch.sqrt(x[positive_mask])
+    return ret
+
+
+def matrix_to_quaternion(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as rotation matrices to quaternions.
+
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix shape {matrix.shape}.")
+
+    batch_dim = matrix.shape[:-2]
+    m00, m01, m02, m10, m11, m12, m20, m21, m22 = torch.unbind(
+        matrix.reshape(batch_dim + (9,)), dim=-1
+    )
+
+    q_abs = _sqrt_positive_part(
+        torch.stack(
+            [
+                1.0 + m00 + m11 + m22,
+                1.0 + m00 - m11 - m22,
+                1.0 - m00 + m11 - m22,
+                1.0 - m00 - m11 + m22,
+            ],
+            dim=-1,
+        )
+    )
+
+    # we produce the desired quaternion multiplied by each of r, i, j, k
+    quat_by_rijk = torch.stack(
+        [
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([q_abs[..., 0] ** 2, m21 - m12, m02 - m20, m10 - m01], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m21 - m12, q_abs[..., 1] ** 2, m10 + m01, m02 + m20], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m02 - m20, m10 + m01, q_abs[..., 2] ** 2, m12 + m21], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m10 - m01, m20 + m02, m21 + m12, q_abs[..., 3] ** 2], dim=-1),
+        ],
+        dim=-2,
+    )
+
+    # We floor here at 0.1 but the exact level is not important; if q_abs is small,
+    # the candidate won't be picked.
+    flr = torch.tensor(0.1).to(dtype=q_abs.dtype, device=q_abs.device)
+    quat_candidates = quat_by_rijk / (2.0 * q_abs[..., None].max(flr))
+
+    # if not for numerical problems, quat_candidates[i] should be same (up to a sign),
+    # forall i; we pick the best-conditioned one (with the largest denominator)
+
+    return quat_candidates[
+        F.one_hot(q_abs.argmax(dim=-1), num_classes=4) > 0.5, :
+    ].reshape(batch_dim + (4,))
diff --git a/docs/DATA_PREPARATION_CALVIN.md b/docs/DATA_PREPARATION_CALVIN.md
new file mode 100644
index 0000000..7f4bcfa
--- /dev/null
+++ b/docs/DATA_PREPARATION_CALVIN.md
@@ -0,0 +1,32 @@
+# Prepare data on CALVIN
+
+* Download the play demonstrations from [Calvin](https://github.com/mees/calvin) repo.
+```
+> cd calvin/dataset
+> sh download_data.sh ABC
+```
+
+* Package the demonstrations for training
+```
+> python data_preprocessing/package_calvin.py --split training
+> python data_preprocessing/package_calvin.py --split validation
+```
+
+### Expected directory layout
+```
+./calvin/dataset/task_ABC_D
+                     |------- training/
+                     |------- validation/
+
+./data/calvin/packaged_ABC_D
+                     |------- training/
+                     |            |------- A+0/
+                     |            |          |------- ann_1.dat
+                     |            |          |------- ...
+                     |            |
+                     |            |------- B+0/
+                     |            |------- C+0/
+                     |
+                     |------- validation/
+                                  |------- D+0/
+```
diff --git a/docs/DATA_PREPARATION_RLBENCH.md b/docs/DATA_PREPARATION_RLBENCH.md
new file mode 100644
index 0000000..c6e50c3
--- /dev/null
+++ b/docs/DATA_PREPARATION_RLBENCH.md
@@ -0,0 +1,65 @@
+# Prepare data on RLBench
+
+## PerAct setup
+
+We use exactly the same train/test set on RLBench as [PerAct](https://github.com/peract/peract).  We re-render the same episodes with higher camera resolution ($256 \times 256$ compared $128 \times 128$ of PerAct).
+
+### Prepare for testing episodes
+
+1. Download the testing episodes from [PerAct](https://github.com/peract/peract?tab=readme-ov-file#pre-generated-datasets) repo.  Extract the zip files to `./data/peract/raw/test`
+2. Rearrange episodes by their variantions
+```
+# For each task, separate episodes in all_variations/ to variations0/ ... variationsN/
+> python data_preprocessing/rearrange_rlbench_demos.py --root_dir $(pwd)/data/peract/raw/test
+```
+
+
+### Prepare for training/validation episodes
+
+1. Download our packaged demonstrations for training from [here](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/Peract_packaged.zip).  Extract the zip file to `./data/peract/`
+
+
+### Optional: Re-render training/validation episodes in higher resolution
+
+1. Download the training/validation episodes from [PerAct](https://github.com/peract/peract?tab=readme-ov-file#pre-generated-datasets) repo.  Extract the zip files to `./data/peract/raw/train` or `./data/peract/raw/val`
+2. Run this bashscript for re-rendering and packaging them into `.dat` files
+```
+# set SPLIT=train for training episodes
+# set SPLIT=val for validation episodes
+> bash scripts/rerender_highres_cameraview.sh
+```
+
+
+### Expected directory layout
+```
+./data/peract
+         |------ raw/test
+         |             |------ close_jar/
+         |             |           |------ variation0/
+         |             |           |            |------ variation_descriptions.pkl
+         |             |           |            |------ episodes/
+         |             |           |                      |------ episode0/
+         |             |           |                      |         |------ low_dim_obs.pkl
+         |             |           |                      |         |------ front_depth/
+         |             |           |                      |         |------ front_rgb/
+         |             |           |                      |         |------ wrist_depth/
+         |             |           |                      |         |------ wrist_rgb/
+         |             |           |                      |         |------ left_shoulder_depth/
+         |             |           |                      |         |------ left_shoulder_rgb/
+         |             |           |                      |         |------ right_shoulder_depth/
+         |             |           |                      |         |------ right_shoulder_rgb/
+         |             |           |                      |
+         |             |           |                      |------ episode0/...
+         |             |           |------ variation1/...
+         |             |------ push_buttons/a
+         |
+         |------ Peract_packaged/
+                      |------ train/
+                      |          |------ close_jar+0/
+                      |          |            |------ ep0.dat
+                      |          |            |------ ep1.dat
+                      |          |
+                      |          |------ close_jar+0/...
+                      |
+                      |------ val/...
+```
diff --git a/docs/GETTING_STARTED_CALVIN.md b/docs/GETTING_STARTED_CALVIN.md
new file mode 100644
index 0000000..15a1b07
--- /dev/null
+++ b/docs/GETTING_STARTED_CALVIN.md
@@ -0,0 +1,22 @@
+# Getting Started with Calvin
+
+### Step 0: Install CALVIN
+```
+> git clone --recurse-submodules https://github.com/mees/calvin.git
+> export CALVIN_ROOT=$(pwd)/calvin
+> cd calvin
+> cd calvin_env; git checkout -b main --track origin/main
+> cd ..
+> ./install.sh
+```
+
+### Step 1: Prepare data on CALVIN
+See [Preparing CALVIN dataset](./DATA_PREPARATION_CALVIN.md)
+
+### Step 2: Train the policy
+
+* Train and test 3D Diffuser Actor on CALVIN
+
+```
+> bash scripts/train_trajectory_calvin.sh
+```
diff --git a/docs/GETTING_STARTED_RLBENCH.md b/docs/GETTING_STARTED_RLBENCH.md
new file mode 100644
index 0000000..3eec53b
--- /dev/null
+++ b/docs/GETTING_STARTED_RLBENCH.md
@@ -0,0 +1,76 @@
+# Getting Started with RLBench
+
+There are three simulation setups in RLBench: 1) [PerAct](https://github.com/peract/peract), 2) [GNFactor](https://github.com/YanjieZe/GNFactor), and 3) [Hiveformer](https://github.com/vlc-robot/hiveformer).  GNFactor uses exactly the same setup as PerAct.  Both have different succes conditions and 3D object models than Hiveformer.
+
+Before training/testing on each setup, please install the RLBench library correspondingly.
+
+## Train and evaluate on RLBench with the Peract/GNFactor setup
+
+### Step 0: Prepare data on RLBench
+See [Preparing RLBench dataset](./DATA_PREPARATION_RLBENCH.md)
+
+### Step 1: Install RLBench with the PerAct setup
+```
+> git clone https://github.com/MohitShridhar/RLBench.git
+> git checkout -b peract --track origin/peract
+> pip install -r requirements.txt
+> pip install -e .
+```
+
+Remember to modify the success condition of `close_jar` task in RLBench, as the original condition is incorrect.  See this [pull request](https://github.com/MohitShridhar/RLBench/pull/1) for more detail.  
+
+### Step 2: Train the policy
+
+* Train 3D Diffuser Actor with the PerAct setup
+
+```
+> bash scripts/train_keypose_peract.sh
+```
+
+* Train 3D Diffuser Actor with the GNFactor setup
+
+```
+> bash scripts/train_keypose_gnfactor.sh
+```
+
+We also provide training scripts for [Act3D](https://arxiv.org/abs/2306.17817).
+
+* Train Act3D with the PerAct setup
+
+```
+> bash scripts/train_act3d_peract.sh
+```
+
+* Train Act3D with the GNFactor setup
+
+```
+> bash scripts/train_act3d_gnfactor.sh
+```
+
+### Step 3: Test the policy
+
+* Test 3D Diffuser Actor with the PerAct setup
+
+```
+> bash online_evaluation_rlbench/eval_peract.sh
+```
+
+* Test 3D Diffuser Actor with the GNFactor setup
+
+```
+> bash online_evaluation_rlbench/eval_gnfactor.sh
+```
+
+We also provide testing scripts for [Act3D](https://arxiv.org/abs/2306.17817).
+
+* Test Act3D with the PerAct setup
+
+```
+> bash online_evaluation_rlbench/eval_act3d_peract.sh
+```
+
+* Test Act3D with the GNFactor setup
+
+```
+> bash online_evaluation_rlbench/eval_act3d_gnfactor.sh
+```
diff --git a/docs/OVERVIEW.md b/docs/OVERVIEW.md
new file mode 100644
index 0000000..bbfb28f
--- /dev/null
+++ b/docs/OVERVIEW.md
@@ -0,0 +1,74 @@
+# Model overview
+In this code base, we provide our implementation of [3D Diffuser Actor](../model/trajectory_optimization/diffuser_actor.py) and [Act3D](../model/keypose_optimization/act3d.py).  We provide an overview of input and output of both models.
+
+## Common input format
+Both models take the following inputs:
+
+1. `RGB observations`: a tensor of shape (batch_size, num_cameras, 3, H, W).  The pixel values are in the range of [0, 1]
+2. `Point cloud observation`: a tensor of shape (batch_size, num_cameras, 3, H, W).
+3. `Instruction encodings`: a tensor of shape (batch_size, max_instruction_length, C).  In this code base, the embedding dimension `C` is set to 512.
+
+
+
+
+
+
+## 3D Diffuser Actor
+3D Diffuser Actor is a diffusion model that takes proprioception history into account.  3D Diffuser Actor is flexible to predict either keyposes or trajectories.  This model uses continuous `6D` rotation representations by default.
+
+### Additional inputs
+* `curr_gripper`: a tensor of shape (batch_size, history_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).
+* `trajectory_mask`: a tensor of shape (batch_size, trajectory_length), which is only used to indicate the length of each trajectory.  To predict keyposes, we just need to set its shape to (batch_size, 1).
+* `gt_trajectory`: a tensor of shape (batch_size, trajectory_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).  The input is only used during training, you can safely set it to `None` otherwise.
+
+### Output
+The model returns the diffusion loss, when `run_inference=False`, otherwise, it returns pose trajectories of shape (batch_size, trajectory_length, 8) when `run_inference=True`.
+
+### Usage 
+For training, forward 3D Diffuser Actor with `run_inference=False`
+```
+> loss = model.forward(gt_trajectory,
+                       trajectory_mask,
+                       rgb_obs,
+                       pcd_obs,
+                       instruction,
+                       curr_gripper,
+                       run_inference=False)
+```
+
+For evaluation, forward 3D Diffuser Actor with `run_inference=True`
+```
+> fake_gt_trajectory = None
+> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
+> trajectory = model.forward(fake_gt_trajectory,
+                             trajectory_mask,
+                             rgb_obs,
+                             pcd_obs,
+                             instruction,
+                             curr_gripper,
+                             run_inference=True)
+```
+
+
+## Act3D
+Act3D does not consider proprioception history and only predicts keyposes.   The model uses `quarternion` as the rotation representation by default.
+
+### Input
+* `curr_gripper`: a tensor of shape (batch_size, 8), where the last channel denotes xyz-action (3D), quarternion (4D), and end-effector openess (1D). 
+
+### Output
+
+* `position`: a tensor of shape (batch_size, 3)
+* `rotation`: a tensor of shape (batch_size, 4)
+* `gripper`: a tensor of shape (batch_size, 1)
+
+### Usage 
+Forward Act3D, the model returns a dictionary
+
+```
+> out_dict = Act3D.forward(rgb_obs,
+                           pcd_obs,
+                           instruction,
+                           curr_gripper)
+```
+
diff --git a/engine.py b/engine.py
new file mode 100644
index 0000000..e9fbbb2
--- /dev/null
+++ b/engine.py
@@ -0,0 +1,320 @@
+"""Shared utilities for all main scripts."""
+
+import os
+import pickle
+import random
+
+import numpy as np
+import torch
+import torch.optim as optim
+from torch.utils.data import DataLoader, default_collate
+from torch.utils.data.distributed import DistributedSampler
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel
+from torch.utils.tensorboard import SummaryWriter
+from tqdm import trange
+
+
+class BaseTrainTester:
+    """Basic train/test class to be inherited."""
+
+    def __init__(self, args):
+        """Initialize."""
+        if dist.get_rank() == 0:
+            args.save(str(args.log_dir / "hparams.json"))
+
+        self.args = args
+
+        if dist.get_rank() == 0:
+            self.writer = SummaryWriter(log_dir=args.log_dir)
+
+    @staticmethod
+    def get_datasets():
+        """Initialize datasets."""
+        train_dataset = None
+        test_dataset = None
+        return train_dataset, test_dataset
+
+    def get_loaders(self, collate_fn=default_collate):
+        """Initialize data loaders."""
+        def seed_worker(worker_id):
+            worker_seed = torch.initial_seed() % 2**32
+            np.random.seed(worker_seed)
+            random.seed(worker_seed)
+            np.random.seed(np.random.get_state()[1][0] + worker_id)
+        # Datasets
+        train_dataset, test_dataset = self.get_datasets()
+        # Samplers and loaders
+        g = torch.Generator()
+        g.manual_seed(0)
+        train_sampler = DistributedSampler(train_dataset)
+        train_loader = DataLoader(
+            train_dataset,
+            batch_size=self.args.batch_size,
+            shuffle=False,
+            num_workers=self.args.num_workers,
+            worker_init_fn=seed_worker,
+            collate_fn=collate_fn,
+            pin_memory=True,
+            sampler=train_sampler,
+            drop_last=True,
+            generator=g
+        )
+        test_sampler = DistributedSampler(test_dataset, shuffle=True)
+        test_loader = DataLoader(
+            test_dataset,
+            batch_size=self.args.batch_size_val,
+            shuffle=False,
+            num_workers=0,
+            worker_init_fn=seed_worker,
+            collate_fn=collate_fn,
+            pin_memory=True,
+            sampler=test_sampler,
+            drop_last=False,
+            generator=g
+        )
+        return train_loader, test_loader
+
+    @staticmethod
+    def get_model():
+        """Initialize the model."""
+        return None
+
+    @staticmethod
+    def get_criterion():
+        """Get loss criterion for training."""
+        # criterion is a class, must have compute_loss and compute_metrics
+        return None
+
+    def get_optimizer(self, model):
+        """Initialize optimizer."""
+        optimizer_grouped_parameters = [
+            {"params": [], "weight_decay": 0.0, "lr": self.args.lr},
+            {"params": [], "weight_decay": 5e-4, "lr": self.args.lr}
+        ]
+        no_decay = ["bias", "LayerNorm.weight", "LayerNorm.bias"]
+        for name, param in model.named_parameters():
+            if any(nd in name for nd in no_decay):
+                optimizer_grouped_parameters[0]["params"].append(param)
+            else:
+                optimizer_grouped_parameters[1]["params"].append(param)
+        optimizer = optim.AdamW(optimizer_grouped_parameters)
+        return optimizer
+
+    def main(self, collate_fn=default_collate):
+        """Run main training/testing pipeline."""
+        # Get loaders
+        train_loader, test_loader = self.get_loaders(collate_fn)
+
+        # Get model
+        model = self.get_model()
+
+        # Get criterion
+        criterion = self.get_criterion()
+
+        # Get optimizer
+        optimizer = self.get_optimizer(model)
+
+        # Move model to devices
+        if torch.cuda.is_available():
+            model = model.cuda()
+        model = DistributedDataParallel(
+            model, device_ids=[self.args.local_rank],
+            broadcast_buffers=False, find_unused_parameters=True
+        )
+
+        # Check for a checkpoint
+        start_iter, best_loss = 0, None
+        if self.args.checkpoint:
+            assert os.path.isfile(self.args.checkpoint)
+            start_iter, best_loss = self.load_checkpoint(model, optimizer)
+
+        # Eval only
+        if bool(self.args.eval_only):
+            print("Test evaluation.......")
+            model.eval()
+            new_loss = self.evaluate_nsteps(
+                model, criterion, test_loader, step_id=-1,
+                val_iters=max(
+                    5,
+                    int(4 * len(self.args.tasks)/self.args.batch_size_val)
+                )
+            )
+            return model
+
+        # Training loop
+        iter_loader = iter(train_loader)
+        model.train()
+        for step_id in trange(start_iter, self.args.train_iters):
+            try:
+                sample = next(iter_loader)
+            except StopIteration:
+                iter_loader = iter(train_loader)
+                sample = next(iter_loader)
+
+            # import matplotlib.pyplot as plt
+            # for i in range(10):
+            #     plt.figure()
+            #     plt.imshow(sample['rgbs'][i][1].cpu().detach().permute(1, 2, 0))
+            # plt.show()
+
+            self.train_one_step(model, criterion, optimizer, step_id, sample)
+            if (step_id + 1) % self.args.val_freq == 0:
+                print("Train evaluation.......")
+                model.eval()
+                new_loss = self.evaluate_nsteps(
+                    model, criterion, train_loader, step_id,
+                    val_iters=max(
+                        5,
+                        int(4 * len(self.args.tasks)/self.args.batch_size_val)
+                    ),
+                    split='train'
+                )
+                print("Test evaluation.......")
+                model.eval()
+                new_loss = self.evaluate_nsteps(
+                    model, criterion, test_loader, step_id,
+                    val_iters=max(
+                        5,
+                        int(4 * len(self.args.tasks)/self.args.batch_size_val)
+                    )
+                )
+                if dist.get_rank() == 0:  # save model
+                    best_loss = self.save_checkpoint(
+                        model, optimizer, step_id,
+                        new_loss, best_loss
+                    )
+                model.train()
+
+        return model
+
+    def train_one_step(self, model, criterion, optimizer, step_id, sample):
+        """Run a single training step."""
+        pass
+
+    @torch.no_grad()
+    def evaluate_nsteps(self, model, criterion, loader, step_id, val_iters,
+                        split='val'):
+        """Run a given number of evaluation steps."""
+        return None
+
+    def load_checkpoint(self, model, optimizer):
+        """Load from checkpoint."""
+        print("=> loading checkpoint '{}'".format(self.args.checkpoint))
+
+        model_dict = torch.load(self.args.checkpoint, map_location="cpu")
+        model.load_state_dict(model_dict["weight"])
+        if 'optimizer' in model_dict:
+            optimizer.load_state_dict(model_dict["optimizer"])
+            for p in range(len(optimizer.param_groups)):
+                optimizer.param_groups[p]['lr'] = self.args.lr
+        start_iter = model_dict.get("iter", 0)
+        best_loss = model_dict.get("best_loss", None)
+
+        print("=> loaded successfully '{}' (step {})".format(
+            self.args.checkpoint, model_dict.get("iter", 0)
+        ))
+        del model_dict
+        torch.cuda.empty_cache()
+        return start_iter, best_loss
+
+    def save_checkpoint(self, model, optimizer, step_id, new_loss, best_loss):
+        """Save checkpoint if requested."""
+        if new_loss is None or best_loss is None or new_loss <= best_loss:
+            best_loss = new_loss
+            torch.save({
+                "weight": model.state_dict(),
+                "optimizer": optimizer.state_dict(),
+                "iter": step_id + 1,
+                "best_loss": best_loss
+            }, self.args.log_dir / "best.pth")
+        # torch.save({
+        #     "weight": model.state_dict(),
+        #     "optimizer": optimizer.state_dict(),
+        #     "iter": step_id + 1,
+        #     "best_loss": best_loss
+        # }, self.args.log_dir / "last.pth")
+        torch.save({
+            "weight": model.state_dict(),
+            "optimizer": optimizer.state_dict(),
+            "iter": step_id + 1,
+            "best_loss": best_loss
+        }, self.args.log_dir / (str(step_id+1)+".pth"))
+        return best_loss
+
+    def synchronize_between_processes(self, a_dict):
+        all_dicts = all_gather(a_dict)
+
+        if not is_dist_avail_and_initialized() or dist.get_rank() == 0:
+            merged = {}
+            for key in all_dicts[0].keys():
+                device = all_dicts[0][key].device
+                merged[key] = torch.cat([
+                    p[key].to(device) for p in all_dicts
+                    if key in p
+                ])
+            a_dict = merged
+        return a_dict
+
+
+def all_gather(data):
+    """
+    Run all_gather on arbitrary picklable data (not necessarily tensors)
+
+    Args:
+        data: any picklable object
+    Returns:
+        list[data]: list of data gathered from each rank
+    """
+    world_size = get_world_size()
+    if world_size == 1:
+        return [data]
+
+    # serialized to a Tensor
+    buffer = pickle.dumps(data)
+    storage = torch.ByteStorage.from_buffer(buffer)
+    tensor = torch.ByteTensor(storage).to("cuda")
+
+    # obtain Tensor size of each rank
+    local_size = torch.tensor([tensor.numel()], device="cuda")
+    size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)]
+    dist.all_gather(size_list, local_size)
+    size_list = [int(size.item()) for size in size_list]
+    max_size = max(size_list)
+
+    # receiving Tensor from all ranks
+    # we pad the tensor because torch all_gather does not support
+    # gathering tensors of different shapes
+    tensor_list = []
+    for _ in size_list:
+        tensor_list.append(torch.empty(
+            (max_size,), dtype=torch.uint8, device="cuda"
+        ))
+    if local_size != max_size:
+        padding = torch.empty(
+            size=(max_size - local_size,),
+            dtype=torch.uint8, device="cuda"
+        )
+        tensor = torch.cat((tensor, padding), dim=0)
+    dist.all_gather(tensor_list, tensor)
+
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+
+    return data_list
+
+
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+
+
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
diff --git a/environment.yaml b/environment.yaml
new file mode 100644
index 0000000..117930c
--- /dev/null
+++ b/environment.yaml
@@ -0,0 +1,44 @@
+name: equ_act
+channels:
+  - pytorch
+  - nvidia
+  - conda-forge
+  - defaults
+dependencies:
+  - python=3.8
+  - pip
+  - pytorch
+  - torchvision
+  - torchaudio
+  - pip:
+    - torch-cluster
+    - torch-geometric
+    - torch-scatter
+    - e3nn
+    - healpy
+    - git+https://github.com/openai/CLIP.git
+    - numpy==1.23.5
+    - pillow
+    - einops
+    - typed-argument-parser
+    - tqdm
+    - transformers
+    - absl-py
+    - matplotlib
+    - scipy
+    - tensorboard
+    - opencv-python
+    - blosc
+    - setuptools==57.5.0
+    - beautifulsoup4
+    - bleach>=6.0.0
+    - defusedxml
+    - jinja2>=3.0
+    - jupyter-core>=4.7
+    - jupyterlab-pygments
+    - mistune==2.0.5
+    - nbclient>=0.5.0
+    - nbformat>=5.7
+    - pandocfilters>=1.4.1
+    - tinycss2
+    - traitlets>=5.1
\ No newline at end of file
diff --git a/fig/sota_calvin.png b/fig/sota_calvin.png
new file mode 100644
index 0000000..2d35914
Binary files /dev/null and b/fig/sota_calvin.png differ
diff --git a/fig/sota_rlbench.png b/fig/sota_rlbench.png
new file mode 100644
index 0000000..2f85e1b
Binary files /dev/null and b/fig/sota_rlbench.png differ
diff --git a/fig/teaser.gif b/fig/teaser.gif
new file mode 100644
index 0000000..0ad4ef2
Binary files /dev/null and b/fig/teaser.gif differ
diff --git a/main_equ_act.py b/main_equ_act.py
new file mode 100644
index 0000000..4877db8
--- /dev/null
+++ b/main_equ_act.py
@@ -0,0 +1,517 @@
+"""Main script for keypose optimization."""
+
+import os
+from pathlib import Path
+import random
+from typing import Tuple, Optional
+
+import numpy as np
+import tap
+import torch
+import torch.distributed as dist
+from torch.nn import functional as F
+
+from datasets.dataset_engine import RLBenchDataset
+from diffuser_actor.equ_act_optimization.equ_act import EquAct
+from engine import BaseTrainTester
+from diffuser_actor import Act3D
+from utils.common_utils import (
+    load_instructions, count_parameters, get_gripper_loc_bounds
+)
+
+
+class Arguments(tap.Tap):
+    # cameras: Tuple[str, ...] = ("wrist", "left_shoulder", "right_shoulder")
+    cameras: Tuple[str, ...] = ("left_shoulder", "right_shoulder", "wrist", "front")
+    image_size: str = "256,256"
+    max_episodes_per_task: int = 100
+    instructions: Optional[Path] = "instructions.pkl"
+    seed: int = 0
+    # tasks: Tuple[str, ...] = ('place_cups', 'close_jar', 'insert_onto_square_peg', 'light_bulb_in', 'meat_off_grill',
+    #                           'open_drawer', 'place_shape_in_shape_sorter', 'place_wine_at_rack_location',
+    #                           'push_buttons', 'put_groceries_in_cupboard', 'put_item_in_drawer', 'put_money_in_safe',
+    #                           'reach_and_drag', 'slide_block_to_color_target', 'stack_blocks', 'stack_cups',
+    #                           'sweep_to_dustpan_of_size', 'turn_tap')
+    tasks: Tuple[str, ...] = ('place_wine_at_rack_location')
+    variations: int = 0
+    checkpoint: Optional[Path] = None
+    accumulate_grad_batches: int = 1
+    val_freq: int = 500
+    gripper_loc_bounds: Optional[str] = None
+    gripper_loc_bounds_buffer: float = 0.04
+    eval_only: int = 0
+
+    # Training and validation datasets
+    dataset: Path
+    valset: Path
+
+    # Logging to base_log_dir/exp_log_dir/run_log_dir
+    base_log_dir: Path = Path(__file__).parent / "train_logs"
+    exp_log_dir: str = "exp"
+    run_log_dir: str = "run"
+
+    # Main training parameters
+    num_workers: int = 1
+    batch_size: int = 16
+    batch_size_val: int = 4
+    cache_size: int = 100
+    cache_size_val: int = 100
+    lr: float = 1e-4
+    train_iters: int = 200_000
+    # max_episode_length: int = 5  # -1 for no limit
+    max_episode_length: int = 1  # -1 for no limit
+
+    # Data augmentations
+    image_rescale: str = "1.0,1.0"  # (min, max), "1.0,1.0" for no rescaling
+
+    # Loss
+    position_loss: str = "ce"  # one of "ce" (our model), "mse" (HiveFormer)
+    trans_temperature: float = 0.01  # used for converting expert trans to Boltzmann distribution
+    compute_loss_at_all_layers: int = 0
+    position_loss_coeff: float = 1.0
+    position_offset_loss_coeff: float = 10000.0
+    rotation_loss_coeff: float = 10.0
+    symmetric_rotation_loss: int = 0
+    gripper_loss_coeff: float = 1.0
+    label_smoothing: float = 0.0
+    regress_position_offset: int = 0
+
+    # Ghost points
+    num_sampling_level: int = 3
+    fine_sampling_ball_diameter: float = 0.16
+    weight_tying: int = 1
+    gp_emb_tying: int = 1
+    num_ghost_points: int = 1000
+    num_ghost_points_val: int = 10000
+    use_ground_truth_position_for_sampling_train: int = 1  # considerably speeds up training
+
+    # Model
+    action_dim: int = 8  # xyz Cartesian coordinate + xyzw Quaternion + 1 gripper open
+    backbone: str = "clip"  # one of "resnet", "clip"
+    embedding_dim: int = 120
+    num_ghost_point_cross_attn_layers: int = 2
+    num_query_cross_attn_layers: int = 2
+    num_vis_ins_attn_layers: int = 2
+    rotation_parametrization: str = "quat_from_query"
+    use_instruction: int = 0
+
+
+class TrainTester(BaseTrainTester):
+    """Train/test a keypose optimization algorithm."""
+
+    def __init__(self, args):
+        """Initialize."""
+        super().__init__(args)
+
+    def get_datasets(self):
+        """Initialize datasets."""
+        # Load instruction, based on which we load tasks/variations
+        instruction = load_instructions(
+            self.args.instructions,
+            tasks=self.args.tasks,
+            variations=tuple(i for i in range(self.args.variations))
+        )
+        if instruction is None:
+            raise NotImplementedError()
+        else:
+            taskvar = [
+                (task, var)
+                for task, var_instr in instruction.items()
+                for var in var_instr.keys()
+            ]
+
+        # Initialize datasets with arguments
+        train_dataset = RLBenchDataset(
+            root=self.args.dataset,
+            instructions=instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            num_iters=self.args.train_iters,
+            cameras=self.args.cameras,
+            training=True,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=False,
+            dense_interpolation=False,
+            interpolation_length=0
+        )
+        test_dataset = RLBenchDataset(
+            root=self.args.valset,
+            instructions=instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size_val,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            cameras=self.args.cameras,
+            training=False,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=False,
+            dense_interpolation=False,
+            interpolation_length=0
+        )
+        return train_dataset, test_dataset
+
+    def get_model(self):
+        """Initialize the model."""
+        # Initialize model with arguments
+        args = self.args
+        _model = EquAct(
+            backbone=args.backbone,
+            image_size=tuple(int(x) for x in args.image_size.split(",")),
+            embedding_dim=args.embedding_dim,
+            num_ghost_point_cross_attn_layers=args.num_ghost_point_cross_attn_layers,
+            num_query_cross_attn_layers=args.num_query_cross_attn_layers,
+            num_vis_ins_attn_layers=args.num_vis_ins_attn_layers,
+            rotation_parametrization=args.rotation_parametrization,
+            gripper_loc_bounds=self.args.gripper_loc_bounds,
+            num_ghost_points=args.num_ghost_points,
+            num_ghost_points_val=args.num_ghost_points_val,
+            weight_tying=bool(args.weight_tying),
+            gp_emb_tying=bool(args.gp_emb_tying),
+            num_sampling_level=args.num_sampling_level,
+            fine_sampling_ball_diameter=args.fine_sampling_ball_diameter,
+            regress_position_offset=bool(args.regress_position_offset),
+            use_instruction=bool(args.use_instruction),
+            trans_temperature=args.trans_temperature
+        )
+        print("Model parameters:", count_parameters(_model))
+
+        return _model
+
+    def get_criterion(self):
+        args = self.args
+        return LossAndMetrics(
+            rotation_parametrization=args.rotation_parametrization,
+            position_loss=args.position_loss,
+            compute_loss_at_all_layers=bool(args.compute_loss_at_all_layers),
+            ground_truth_gaussian_spread=args.trans_temperature,
+            label_smoothing=args.label_smoothing,
+            position_loss_coeff=args.position_loss_coeff,
+            position_offset_loss_coeff=args.position_offset_loss_coeff,
+            rotation_loss_coeff=args.rotation_loss_coeff,
+            gripper_loss_coeff=args.gripper_loss_coeff,
+            symmetric_rotation_loss=bool(args.symmetric_rotation_loss)
+        )
+
+    def train_one_step(self, model, criterion, optimizer, step_id, sample):
+        """Run a single training step."""
+        if step_id % self.args.accumulate_grad_batches == 0:
+            optimizer.zero_grad()
+
+        # Forward pass
+        info = model(
+            sample["rgbs"],
+            sample["pcds"],
+            sample["instr"],
+            sample["curr_gripper"],
+            # Provide ground-truth action to bias ghost point sampling at training time
+            gt_action=sample["action"] if self.args.use_ground_truth_position_for_sampling_train else None
+        )
+
+        # Backward pass
+        loss = info['loss']
+        loss = sum(list(loss.values()))
+        loss.backward()
+
+        # Update
+        if step_id % self.args.accumulate_grad_batches == self.args.accumulate_grad_batches - 1:
+            optimizer.step()
+
+        # Log
+        if dist.get_rank() == 0 and (step_id + 1) % self.args.val_freq == 0:
+            self.writer.add_scalar("lr", self.args.lr, step_id)
+            self.writer.add_scalar("train-loss/total_loss", loss, step_id)
+
+    @torch.no_grad()
+    def evaluate_nsteps(self, model, criterion, loader, step_id, val_iters,
+                        split='val'):
+        """Run a given number of evaluation steps."""
+        values = {}
+        device = next(model.parameters()).device
+        model.eval()
+
+        for i, sample in enumerate(loader):
+            if i == val_iters:
+                break
+
+            info = model(
+                sample["rgbs"],
+                sample["pcds"],
+                sample["instr"],
+                sample["curr_gripper"],
+                # DO NOT provide ground-truth action to sample ghost points at validation time
+                gt_action=None
+            )
+            losses = criterion.compute_metrics(
+                info,
+                sample
+            )
+
+            # Gather global statistics
+            for n, l in losses.items():
+                key = f"{split}-losses/{n}"
+                if key not in values:
+                    values[key] = torch.Tensor([]).to(device)
+                values[key] = torch.cat([values[key], l.unsqueeze(0)])
+
+        # Log all statistics
+        values = {
+            k: torch.as_tensor(v).mean().item() for k, v in values.items()
+        }
+        if dist.get_rank() == 0:
+            for key, val in values.items():
+                self.writer.add_scalar(key, val, step_id)
+
+            # Also log to terminal
+            print(f"Step {step_id}:")
+            for key, value in values.items():
+                print(f"{key}: {value:.03f}")
+
+        return values.get('val-losses/action_mse', None)
+
+
+def keypose_collate_fn(batch):
+    # Unfold multi-step demos to form a longer batch
+    keys = ["rgbs", "pcds", "curr_gripper", "action", "instr"]
+    ret_dict = {key: torch.cat([item[key] for item in batch]) for key in keys}
+
+    ret_dict["task"] = []
+    for item in batch:
+        ret_dict["task"] += item['task']
+    return ret_dict
+
+
+class LossAndMetrics:
+    """
+    Each method expects two dictionaries:
+     - pred: {
+        'position': (B, 3) gripper position,
+        'rotation': (B, 4) gripper rotation,
+        'gripper': (B, 1) whether gripper should open/close (0/1),
+        'position_pyramid': list of 3 elements, (B, 1, 3) interm gripper pos,
+        'visible_rgb_mask_pyramid': not used in loss,
+        'ghost_pcd_masks_pyramid',
+        'ghost_pcd_pyramid',
+        'fine_ghost_pcd_offsets',
+        'task'
+     }
+     - sample: {
+        'frame_id',
+        'task_id',
+        'task',
+        'variation',
+        'rgbs',
+        'pcds',
+        'action': (B, 1, 8),
+        'padding_mask': (B, 1),
+        'instr',
+        'gripper'
+     }
+    """
+    def __init__(
+        self,
+        position_loss,
+        rotation_parametrization,
+        ground_truth_gaussian_spread,
+        compute_loss_at_all_layers=False,
+        label_smoothing=0.0,
+        position_loss_coeff=1.0,
+        position_offset_loss_coeff=10000.0,
+        rotation_loss_coeff=10.0,
+        gripper_loss_coeff=1.0,
+        symmetric_rotation_loss=False,
+    ):
+        assert position_loss in ["mse", "ce", "ce+mse"]
+        assert rotation_parametrization in [
+            "quat_from_top_ghost", "quat_from_query",
+            "6D_from_top_ghost", "6D_from_query"
+        ]
+        self.position_loss = position_loss
+        self.rotation_parametrization = rotation_parametrization
+        self.compute_loss_at_all_layers = compute_loss_at_all_layers
+        self.ground_truth_gaussian_spread = ground_truth_gaussian_spread
+        self.label_smoothing = label_smoothing
+        self.position_loss_coeff = position_loss_coeff
+        self.position_offset_loss_coeff = position_offset_loss_coeff
+        self.rotation_loss_coeff = rotation_loss_coeff
+        self.gripper_loss_coeff = gripper_loss_coeff
+        self.symmetric_rotation_loss = symmetric_rotation_loss
+
+    def compute_loss(self, pred, sample):
+        device = pred["position"].device
+        # padding_mask = sample["padding_mask"].to(device)
+        gt_action = sample["action"].to(device)  # [padding_mask]
+
+        losses = {}
+
+        self._compute_position_loss(pred, gt_action[:, :3], losses)
+
+        self._compute_rotation_loss(pred, gt_action[:, 3:7], losses)
+
+        losses["gripper"] = F.binary_cross_entropy(pred["gripper"], gt_action[:, 7:8])
+        losses["gripper"] *= self.gripper_loss_coeff
+
+        return losses
+
+    def _compute_rotation_loss(self, pred, gt_quat, losses):
+        if "quat" in self.rotation_parametrization:
+            if self.symmetric_rotation_loss:
+                gt_quat_ = -gt_quat.clone()
+                quat_loss = F.mse_loss(pred["rotation"], gt_quat, reduction='none').mean(1)
+                quat_loss_ = F.mse_loss(pred["rotation"], gt_quat_, reduction='none').mean(1)
+                select_mask = (quat_loss < quat_loss_).float()
+                losses['rotation'] = (select_mask * quat_loss + (1 - select_mask) * quat_loss_).mean()
+            else:
+                losses["rotation"] = F.mse_loss(pred["rotation"], gt_quat)
+
+        losses["rotation"] *= self.rotation_loss_coeff
+
+    def _compute_position_loss(self, pred, gt_position, losses):
+        if self.position_loss == "mse":
+            # Only used for original HiveFormer
+            losses["position_mse"] = F.mse_loss(pred["position"], gt_position) * self.position_loss_coeff
+
+        elif self.position_loss in ["ce", "ce+mse"]:
+            # Select a normalized Gaussian ball around the ground-truth
+            # as a proxy label for a soft cross-entropy loss
+            l2_pyramid = []
+            label_pyramid = []
+            for ghost_pcd_i in pred['ghost_pcd_pyramid']:
+                l2_i = ((ghost_pcd_i - gt_position.unsqueeze(-1)) ** 2).sum(1).sqrt()
+                label_i = torch.softmax(-l2_i / self.ground_truth_gaussian_spread, dim=-1).detach()
+                l2_pyramid.append(l2_i)
+                label_pyramid.append(label_i)
+
+            loss_layers = range(len(pred['ghost_pcd_masks_pyramid'][0])) if self.compute_loss_at_all_layers else [-1]
+
+            for j in loss_layers:
+                for i, ghost_pcd_masks_i in enumerate(pred["ghost_pcd_masks_pyramid"]):
+                    losses[f"position_ce_level{i}"] = F.cross_entropy(
+                        ghost_pcd_masks_i[j], label_pyramid[i],
+                        label_smoothing=self.label_smoothing
+                    ).mean() * self.position_loss_coeff / len(pred["ghost_pcd_masks_pyramid"])
+
+            # Supervise offset from the ghost point's position to the predicted position
+            num_sampling_level = len(pred['ghost_pcd_masks_pyramid'])
+            if pred.get("fine_ghost_pcd_offsets") is not None:
+                if pred["ghost_pcd_pyramid"][-1].shape[-1] != pred["ghost_pcd_pyramid"][0].shape[-1]:
+                    npts = pred["ghost_pcd_pyramid"][-1].shape[-1] // num_sampling_level
+                    pred_with_offset = (pred["ghost_pcd_pyramid"][-1] + pred["fine_ghost_pcd_offsets"])[:, :, -npts:]
+                else:
+                    pred_with_offset = (pred["ghost_pcd_pyramid"][-1] + pred["fine_ghost_pcd_offsets"])
+                losses["position_offset"] = F.mse_loss(
+                    pred_with_offset,
+                    gt_position.unsqueeze(-1).repeat(1, 1, pred_with_offset.shape[-1])
+                )
+                losses["position_offset"] *= (self.position_offset_loss_coeff * self.position_loss_coeff)
+
+            if self.position_loss == "ce":
+                # Clear gradient on pred["position"] to avoid a memory leak since we don't
+                # use it in the loss
+                pred["position"] = pred["position"].detach()
+            else:
+                losses["position_mse"] = (
+                    F.mse_loss(pred["position"], gt_position)
+                    * self.position_loss_coeff
+                )
+
+    def compute_metrics(self, pred, sample):
+        device = pred["position"].device
+        dtype = pred["position"].dtype
+        # padding_mask = sample["padding_mask"].to(device)
+        outputs = sample["action"].to(device)  # [padding_mask]
+
+        metrics = {}
+
+        tasks = np.array(sample["task"])
+
+        final_pos_l2 = ((pred["position"] - outputs[:, :3]) ** 2).sum(1).sqrt()
+        metrics["mean/pos_l2_final"] = final_pos_l2.to(dtype).mean()
+        metrics["mean/pos_l2_final<0.01"] = (final_pos_l2 < 0.01).to(dtype).mean()
+
+        for i in range(len(pred["position_pyramid"])):
+            pos_l2_i = ((pred["position_pyramid"][i].squeeze(1) - outputs[:, :3]) ** 2).sum(1).sqrt()
+            metrics[f"mean/pos_l2_level{i}"] = pos_l2_i.to(dtype).mean()
+
+        for task in np.unique(tasks):
+            task_l2 = final_pos_l2[tasks == task]
+            metrics[f"{task}/pos_l2_final"] = task_l2.to(dtype).mean()
+            metrics[f"{task}/pos_l2_final<0.01"] = (task_l2 < 0.01).to(dtype).mean()
+
+        # Gripper accuracy
+        pred_gripper = (pred["gripper"] > 0.5).squeeze(-1)
+        true_gripper = outputs[:, 7].bool()
+        acc = pred_gripper == true_gripper
+        metrics["gripper"] = acc.to(dtype).mean()
+
+        # Rotation accuracy
+        gt_quat = outputs[:, 3:7]
+        if "quat" in self.rotation_parametrization:
+            if self.symmetric_rotation_loss:
+                gt_quat_ = -gt_quat.clone()
+                l1 = (pred["rotation"] - gt_quat).abs().sum(1)
+                l1_ = (pred["rotation"] - gt_quat_).abs().sum(1)
+                select_mask = (l1 < l1_).float()
+                l1 = (select_mask * l1 + (1 - select_mask) * l1_)
+            else:
+                l1 = ((pred["rotation"] - gt_quat).abs().sum(1))
+
+        metrics["mean/rot_l1"] = l1.to(dtype).mean()
+        metrics["mean/rot_l1<0.05"] = (l1 < 0.05).to(dtype).mean()
+        metrics["mean/rot_l1<0.025"] = (l1 < 0.025).to(dtype).mean()
+
+        for task in np.unique(tasks):
+            task_l1 = l1[tasks == task]
+            metrics[f"{task}/rot_l1"] = task_l1.to(dtype).mean()
+            metrics[f"{task}/rot_l1<0.05"] = (task_l1 < 0.05).to(dtype).mean()
+            metrics[f"{task}/rot_l1<0.025"] = (task_l1 < 0.025).to(dtype).mean()
+
+        return metrics
+
+
+if __name__ == '__main__':
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    # Arguments
+    args = Arguments().parse_args()
+    print("Arguments:")
+    print(args)
+    print("-" * 100)
+    if args.gripper_loc_bounds is None:
+        args.gripper_loc_bounds = np.array([[-2, -2, -2], [2, 2, 2]]) * 1.0
+    else:
+        args.gripper_loc_bounds = get_gripper_loc_bounds(
+            args.gripper_loc_bounds,
+            task=args.tasks[0] if len(args.tasks) == 1 else None,
+            buffer=args.gripper_loc_bounds_buffer
+        )
+    log_dir = args.base_log_dir / args.exp_log_dir / args.run_log_dir
+    args.log_dir = log_dir
+    log_dir.mkdir(exist_ok=True, parents=True)
+    print("Logging:", log_dir)
+    print(
+        "Available devices (CUDA_VISIBLE_DEVICES):",
+        os.environ.get("CUDA_VISIBLE_DEVICES")
+    )
+    print("Device count", torch.cuda.device_count())
+    args.local_rank = int(os.environ["LOCAL_RANK"])
+
+    # Seeds
+    torch.manual_seed(args.seed)
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    # DDP initialization
+    torch.cuda.set_device(args.local_rank)
+    torch.distributed.init_process_group(backend='nccl', init_method='env://')
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cudnn.deterministic = True
+
+    # Run
+    train_tester = TrainTester(args)
+    train_tester.main(collate_fn=keypose_collate_fn)
diff --git a/main_keypose.py b/main_keypose.py
new file mode 100644
index 0000000..f049d5e
--- /dev/null
+++ b/main_keypose.py
@@ -0,0 +1,515 @@
+"""Main script for keypose optimization."""
+
+import os
+from pathlib import Path
+import random
+from typing import Tuple, Optional
+
+import numpy as np
+import tap
+import torch
+import torch.distributed as dist
+from torch.nn import functional as F
+
+from datasets.dataset_engine import RLBenchDataset
+from engine import BaseTrainTester
+from diffuser_actor import Act3D
+from utils.common_utils import (
+    load_instructions, count_parameters, get_gripper_loc_bounds
+)
+
+
+class Arguments(tap.Tap):
+    # cameras: Tuple[str, ...] = ("wrist", "left_shoulder", "right_shoulder")
+    cameras: Tuple[str, ...] = ("left_shoulder", "right_shoulder", "wrist", "front")
+    image_size: str = "256,256"
+    max_episodes_per_task: int = 100
+    instructions: Optional[Path] = "instructions.pkl"
+    seed: int = 0
+    # tasks: Tuple[str, ...] = ('place_cups', 'close_jar', 'insert_onto_square_peg', 'light_bulb_in', 'meat_off_grill',
+    #                           'open_drawer', 'place_shape_in_shape_sorter', 'place_wine_at_rack_location',
+    #                           'push_buttons', 'put_groceries_in_cupboard', 'put_item_in_drawer', 'put_money_in_safe',
+    #                           'reach_and_drag', 'slide_block_to_color_target', 'stack_blocks', 'stack_cups',
+    #                           'sweep_to_dustpan_of_size', 'turn_tap')
+    tasks: Tuple[str, ...] = ('insert_onto_square_peg')
+    variations: int = 0
+    checkpoint: Optional[Path] = None
+    accumulate_grad_batches: int = 1
+    val_freq: int = 500
+    gripper_loc_bounds: Optional[str] = None
+    gripper_loc_bounds_buffer: float = 0.04
+    eval_only: int = 0
+
+    # Training and validation datasets
+    dataset: Path
+    valset: Path
+
+    # Logging to base_log_dir/exp_log_dir/run_log_dir
+    base_log_dir: Path = Path(__file__).parent / "train_logs"
+    exp_log_dir: str = "exp"
+    run_log_dir: str = "run"
+
+    # Main training parameters
+    num_workers: int = 1
+    batch_size: int = 16
+    batch_size_val: int = 4
+    cache_size: int = 100
+    cache_size_val: int = 100
+    lr: float = 1e-4
+    train_iters: int = 200_000
+    # max_episode_length: int = 5  # -1 for no limit
+    max_episode_length: int = 1  # -1 for no limit
+
+    # Data augmentations
+    image_rescale: str = "0.75,1.25"  # (min, max), "1.0,1.0" for no rescaling
+
+    # Loss
+    position_loss: str = "ce"  # one of "ce" (our model), "mse" (HiveFormer)
+    ground_truth_gaussian_spread: float = 0.01
+    compute_loss_at_all_layers: int = 0
+    position_loss_coeff: float = 1.0
+    position_offset_loss_coeff: float = 10000.0
+    rotation_loss_coeff: float = 10.0
+    symmetric_rotation_loss: int = 0
+    gripper_loss_coeff: float = 1.0
+    label_smoothing: float = 0.0
+    regress_position_offset: int = 0
+
+    # Ghost points
+    num_sampling_level: int = 3
+    fine_sampling_ball_diameter: float = 0.16
+    weight_tying: int = 1
+    gp_emb_tying: int = 1
+    num_ghost_points: int = 1000
+    num_ghost_points_val: int = 10000
+    use_ground_truth_position_for_sampling_train: int = 1  # considerably speeds up training
+
+    # Model
+    action_dim: int = 8
+    backbone: str = "clip"  # one of "resnet", "clip"
+    embedding_dim: int = 120
+    num_ghost_point_cross_attn_layers: int = 2
+    num_query_cross_attn_layers: int = 2
+    num_vis_ins_attn_layers: int = 2
+    rotation_parametrization: str = "quat_from_query"
+    use_instruction: int = 0
+
+
+class TrainTester(BaseTrainTester):
+    """Train/test a keypose optimization algorithm."""
+
+    def __init__(self, args):
+        """Initialize."""
+        super().__init__(args)
+
+    def get_datasets(self):
+        """Initialize datasets."""
+        # Load instruction, based on which we load tasks/variations
+        instruction = load_instructions(
+            self.args.instructions,
+            tasks=self.args.tasks,
+            variations=tuple(i for i in range(self.args.variations))
+        )
+        if instruction is None:
+            raise NotImplementedError()
+        else:
+            taskvar = [
+                (task, var)
+                for task, var_instr in instruction.items()
+                for var in var_instr.keys()
+            ]
+
+        # Initialize datasets with arguments
+        train_dataset = RLBenchDataset(
+            root=self.args.dataset,
+            instructions=instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            num_iters=self.args.train_iters,
+            cameras=self.args.cameras,
+            training=True,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=False,
+            dense_interpolation=False,
+            interpolation_length=0
+        )
+        test_dataset = RLBenchDataset(
+            root=self.args.valset,
+            instructions=instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size_val,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            cameras=self.args.cameras,
+            training=False,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=False,
+            dense_interpolation=False,
+            interpolation_length=0
+        )
+        return train_dataset, test_dataset
+
+    def get_model(self):
+        """Initialize the model."""
+        # Initialize model with arguments
+        args = self.args
+        _model = Act3D(
+            backbone=args.backbone,
+            image_size=tuple(int(x) for x in args.image_size.split(",")),
+            embedding_dim=args.embedding_dim,
+            num_ghost_point_cross_attn_layers=args.num_ghost_point_cross_attn_layers,
+            num_query_cross_attn_layers=args.num_query_cross_attn_layers,
+            num_vis_ins_attn_layers=args.num_vis_ins_attn_layers,
+            rotation_parametrization=args.rotation_parametrization,
+            gripper_loc_bounds=self.args.gripper_loc_bounds,
+            num_ghost_points=args.num_ghost_points,
+            num_ghost_points_val=args.num_ghost_points_val,
+            weight_tying=bool(args.weight_tying),
+            gp_emb_tying=bool(args.gp_emb_tying),
+            num_sampling_level=args.num_sampling_level,
+            fine_sampling_ball_diameter=args.fine_sampling_ball_diameter,
+            regress_position_offset=bool(args.regress_position_offset),
+            use_instruction=bool(args.use_instruction)
+        )
+        print("Model parameters:", count_parameters(_model))
+
+        return _model
+
+    def get_criterion(self):
+        args = self.args
+        return LossAndMetrics(
+            rotation_parametrization=args.rotation_parametrization,
+            position_loss=args.position_loss,
+            compute_loss_at_all_layers=bool(args.compute_loss_at_all_layers),
+            ground_truth_gaussian_spread=args.ground_truth_gaussian_spread,
+            label_smoothing=args.label_smoothing,
+            position_loss_coeff=args.position_loss_coeff,
+            position_offset_loss_coeff=args.position_offset_loss_coeff,
+            rotation_loss_coeff=args.rotation_loss_coeff,
+            gripper_loss_coeff=args.gripper_loss_coeff,
+            symmetric_rotation_loss=bool(args.symmetric_rotation_loss)
+        )
+
+    def train_one_step(self, model, criterion, optimizer, step_id, sample):
+        """Run a single training step."""
+        if step_id % self.args.accumulate_grad_batches == 0:
+            optimizer.zero_grad()
+
+        # Forward pass
+        out = model(
+            sample["rgbs"],
+            sample["pcds"],
+            sample["instr"],
+            sample["curr_gripper"],
+            # Provide ground-truth action to bias ghost point sampling at training time
+            gt_action=sample["action"] if self.args.use_ground_truth_position_for_sampling_train else None
+        )
+
+        # Backward pass
+        loss = criterion.compute_loss(out, sample)
+        loss = sum(list(loss.values()))
+        loss.backward()
+
+        # Update
+        if step_id % self.args.accumulate_grad_batches == self.args.accumulate_grad_batches - 1:
+            optimizer.step()
+
+        # Log
+        if dist.get_rank() == 0 and (step_id + 1) % self.args.val_freq == 0:
+            self.writer.add_scalar("lr", self.args.lr, step_id)
+            self.writer.add_scalar("train-loss/noise_mse", loss, step_id)
+
+    @torch.no_grad()
+    def evaluate_nsteps(self, model, criterion, loader, step_id, val_iters,
+                        split='val'):
+        """Run a given number of evaluation steps."""
+        values = {}
+        device = next(model.parameters()).device
+        model.eval()
+
+        for i, sample in enumerate(loader):
+            if i == val_iters:
+                break
+
+            action = model(
+                sample["rgbs"],
+                sample["pcds"],
+                sample["instr"],
+                sample["curr_gripper"],
+                # DO NOT provide ground-truth action to sample ghost points at validation time
+                gt_action=None
+            )
+            losses = criterion.compute_metrics(
+                action,
+                sample
+            )
+
+            # Gather global statistics
+            for n, l in losses.items():
+                key = f"{split}-losses/{n}"
+                if key not in values:
+                    values[key] = torch.Tensor([]).to(device)
+                values[key] = torch.cat([values[key], l.unsqueeze(0)])
+
+        # Log all statistics
+        values = {
+            k: torch.as_tensor(v).mean().item() for k, v in values.items()
+        }
+        if dist.get_rank() == 0:
+            for key, val in values.items():
+                self.writer.add_scalar(key, val, step_id)
+
+            # Also log to terminal
+            print(f"Step {step_id}:")
+            for key, value in values.items():
+                print(f"{key}: {value:.03f}")
+
+        return values.get('val-losses/action_mse', None)
+
+
+def keypose_collate_fn(batch):
+    # Unfold multi-step demos to form a longer batch
+    keys = ["rgbs", "pcds", "curr_gripper", "action", "instr"]
+    ret_dict = {key: torch.cat([item[key] for item in batch]) for key in keys}
+
+    ret_dict["task"] = []
+    for item in batch:
+        ret_dict["task"] += item['task']
+    return ret_dict
+
+
+class LossAndMetrics:
+    """
+    Each method expects two dictionaries:
+     - pred: {
+        'position': (B, 3) gripper position,
+        'rotation': (B, 4) gripper rotation,
+        'gripper': (B, 1) whether gripper should open/close (0/1),
+        'position_pyramid': list of 3 elements, (B, 1, 3) interm gripper pos,
+        'visible_rgb_mask_pyramid': not used in loss,
+        'ghost_pcd_masks_pyramid',
+        'ghost_pcd_pyramid',
+        'fine_ghost_pcd_offsets',
+        'task'
+     }
+     - sample: {
+        'frame_id',
+        'task_id',
+        'task',
+        'variation',
+        'rgbs',
+        'pcds',
+        'action': (B, 1, 8),
+        'padding_mask': (B, 1),
+        'instr',
+        'gripper'
+     }
+    """
+    def __init__(
+        self,
+        position_loss,
+        rotation_parametrization,
+        ground_truth_gaussian_spread,
+        compute_loss_at_all_layers=False,
+        label_smoothing=0.0,
+        position_loss_coeff=1.0,
+        position_offset_loss_coeff=10000.0,
+        rotation_loss_coeff=10.0,
+        gripper_loss_coeff=1.0,
+        symmetric_rotation_loss=False,
+    ):
+        assert position_loss in ["mse", "ce", "ce+mse"]
+        assert rotation_parametrization in [
+            "quat_from_top_ghost", "quat_from_query",
+            "6D_from_top_ghost", "6D_from_query"
+        ]
+        self.position_loss = position_loss
+        self.rotation_parametrization = rotation_parametrization
+        self.compute_loss_at_all_layers = compute_loss_at_all_layers
+        self.ground_truth_gaussian_spread = ground_truth_gaussian_spread
+        self.label_smoothing = label_smoothing
+        self.position_loss_coeff = position_loss_coeff
+        self.position_offset_loss_coeff = position_offset_loss_coeff
+        self.rotation_loss_coeff = rotation_loss_coeff
+        self.gripper_loss_coeff = gripper_loss_coeff
+        self.symmetric_rotation_loss = symmetric_rotation_loss
+
+    def compute_loss(self, pred, sample):
+        device = pred["position"].device
+        # padding_mask = sample["padding_mask"].to(device)
+        gt_action = sample["action"].to(device)  # [padding_mask]
+
+        losses = {}
+
+        self._compute_position_loss(pred, gt_action[:, :3], losses)
+
+        self._compute_rotation_loss(pred, gt_action[:, 3:7], losses)
+
+        losses["gripper"] = F.binary_cross_entropy(pred["gripper"], gt_action[:, 7:8])
+        losses["gripper"] *= self.gripper_loss_coeff
+
+        return losses
+
+    def _compute_rotation_loss(self, pred, gt_quat, losses):
+        if "quat" in self.rotation_parametrization:
+            if self.symmetric_rotation_loss:
+                gt_quat_ = -gt_quat.clone()
+                quat_loss = F.mse_loss(pred["rotation"], gt_quat, reduction='none').mean(1)
+                quat_loss_ = F.mse_loss(pred["rotation"], gt_quat_, reduction='none').mean(1)
+                select_mask = (quat_loss < quat_loss_).float()
+                losses['rotation'] = (select_mask * quat_loss + (1 - select_mask) * quat_loss_).mean()
+            else:
+                losses["rotation"] = F.mse_loss(pred["rotation"], gt_quat)
+
+        losses["rotation"] *= self.rotation_loss_coeff
+
+    def _compute_position_loss(self, pred, gt_position, losses):
+        if self.position_loss == "mse":
+            # Only used for original HiveFormer
+            losses["position_mse"] = F.mse_loss(pred["position"], gt_position) * self.position_loss_coeff
+
+        elif self.position_loss in ["ce", "ce+mse"]:
+            # Select a normalized Gaussian ball around the ground-truth
+            # as a proxy label for a soft cross-entropy loss
+            l2_pyramid = []
+            label_pyramid = []
+            for ghost_pcd_i in pred['ghost_pcd_pyramid']:
+                l2_i = ((ghost_pcd_i - gt_position.unsqueeze(-1)) ** 2).sum(1).sqrt()
+                label_i = torch.softmax(-l2_i / self.ground_truth_gaussian_spread, dim=-1).detach()
+                l2_pyramid.append(l2_i)
+                label_pyramid.append(label_i)
+
+            loss_layers = range(len(pred['ghost_pcd_masks_pyramid'][0])) if self.compute_loss_at_all_layers else [-1]
+
+            for j in loss_layers:
+                for i, ghost_pcd_masks_i in enumerate(pred["ghost_pcd_masks_pyramid"]):
+                    losses[f"position_ce_level{i}"] = F.cross_entropy(
+                        ghost_pcd_masks_i[j], label_pyramid[i],
+                        label_smoothing=self.label_smoothing
+                    ).mean() * self.position_loss_coeff / len(pred["ghost_pcd_masks_pyramid"])
+
+            # Supervise offset from the ghost point's position to the predicted position
+            num_sampling_level = len(pred['ghost_pcd_masks_pyramid'])
+            if pred.get("fine_ghost_pcd_offsets") is not None:
+                if pred["ghost_pcd_pyramid"][-1].shape[-1] != pred["ghost_pcd_pyramid"][0].shape[-1]:
+                    npts = pred["ghost_pcd_pyramid"][-1].shape[-1] // num_sampling_level
+                    pred_with_offset = (pred["ghost_pcd_pyramid"][-1] + pred["fine_ghost_pcd_offsets"])[:, :, -npts:]
+                else:
+                    pred_with_offset = (pred["ghost_pcd_pyramid"][-1] + pred["fine_ghost_pcd_offsets"])
+                losses["position_offset"] = F.mse_loss(
+                    pred_with_offset,
+                    gt_position.unsqueeze(-1).repeat(1, 1, pred_with_offset.shape[-1])
+                )
+                losses["position_offset"] *= (self.position_offset_loss_coeff * self.position_loss_coeff)
+
+            if self.position_loss == "ce":
+                # Clear gradient on pred["position"] to avoid a memory leak since we don't
+                # use it in the loss
+                pred["position"] = pred["position"].detach()
+            else:
+                losses["position_mse"] = (
+                    F.mse_loss(pred["position"], gt_position)
+                    * self.position_loss_coeff
+                )
+
+    def compute_metrics(self, pred, sample):
+        device = pred["position"].device
+        dtype = pred["position"].dtype
+        # padding_mask = sample["padding_mask"].to(device)
+        outputs = sample["action"].to(device)  # [padding_mask]
+
+        metrics = {}
+
+        tasks = np.array(sample["task"])
+
+        final_pos_l2 = ((pred["position"] - outputs[:, :3]) ** 2).sum(1).sqrt()
+        metrics["mean/pos_l2_final"] = final_pos_l2.to(dtype).mean()
+        metrics["mean/pos_l2_final<0.01"] = (final_pos_l2 < 0.01).to(dtype).mean()
+
+        for i in range(len(pred["position_pyramid"])):
+            pos_l2_i = ((pred["position_pyramid"][i].squeeze(1) - outputs[:, :3]) ** 2).sum(1).sqrt()
+            metrics[f"mean/pos_l2_level{i}"] = pos_l2_i.to(dtype).mean()
+
+        for task in np.unique(tasks):
+            task_l2 = final_pos_l2[tasks == task]
+            metrics[f"{task}/pos_l2_final"] = task_l2.to(dtype).mean()
+            metrics[f"{task}/pos_l2_final<0.01"] = (task_l2 < 0.01).to(dtype).mean()
+
+        # Gripper accuracy
+        pred_gripper = (pred["gripper"] > 0.5).squeeze(-1)
+        true_gripper = outputs[:, 7].bool()
+        acc = pred_gripper == true_gripper
+        metrics["gripper"] = acc.to(dtype).mean()
+
+        # Rotation accuracy
+        gt_quat = outputs[:, 3:7]
+        if "quat" in self.rotation_parametrization:
+            if self.symmetric_rotation_loss:
+                gt_quat_ = -gt_quat.clone()
+                l1 = (pred["rotation"] - gt_quat).abs().sum(1)
+                l1_ = (pred["rotation"] - gt_quat_).abs().sum(1)
+                select_mask = (l1 < l1_).float()
+                l1 = (select_mask * l1 + (1 - select_mask) * l1_)
+            else:
+                l1 = ((pred["rotation"] - gt_quat).abs().sum(1))
+
+        metrics["mean/rot_l1"] = l1.to(dtype).mean()
+        metrics["mean/rot_l1<0.05"] = (l1 < 0.05).to(dtype).mean()
+        metrics["mean/rot_l1<0.025"] = (l1 < 0.025).to(dtype).mean()
+
+        for task in np.unique(tasks):
+            task_l1 = l1[tasks == task]
+            metrics[f"{task}/rot_l1"] = task_l1.to(dtype).mean()
+            metrics[f"{task}/rot_l1<0.05"] = (task_l1 < 0.05).to(dtype).mean()
+            metrics[f"{task}/rot_l1<0.025"] = (task_l1 < 0.025).to(dtype).mean()
+
+        return metrics
+
+
+if __name__ == '__main__':
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    # Arguments
+    args = Arguments().parse_args()
+    print("Arguments:")
+    print(args)
+    print("-" * 100)
+    if args.gripper_loc_bounds is None:
+        args.gripper_loc_bounds = np.array([[-2, -2, -2], [2, 2, 2]]) * 1.0
+    else:
+        args.gripper_loc_bounds = get_gripper_loc_bounds(
+            args.gripper_loc_bounds,
+            task=args.tasks[0] if len(args.tasks) == 1 else None,
+            buffer=args.gripper_loc_bounds_buffer
+        )
+    log_dir = args.base_log_dir / args.exp_log_dir / args.run_log_dir
+    args.log_dir = log_dir
+    log_dir.mkdir(exist_ok=True, parents=True)
+    print("Logging:", log_dir)
+    print(
+        "Available devices (CUDA_VISIBLE_DEVICES):",
+        os.environ.get("CUDA_VISIBLE_DEVICES")
+    )
+    print("Device count", torch.cuda.device_count())
+    args.local_rank = int(os.environ["LOCAL_RANK"])
+
+    # Seeds
+    torch.manual_seed(args.seed)
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    # DDP initialization
+    torch.cuda.set_device(args.local_rank)
+    torch.distributed.init_process_group(backend='nccl', init_method='env://')
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cudnn.deterministic = True
+
+    # Run
+    train_tester = TrainTester(args)
+    train_tester.main(collate_fn=keypose_collate_fn)
diff --git a/main_trajectory.py b/main_trajectory.py
new file mode 100644
index 0000000..685b8d1
--- /dev/null
+++ b/main_trajectory.py
@@ -0,0 +1,470 @@
+"""Main script for trajectory optimization."""
+
+import io
+import os
+from pathlib import Path
+import random
+from typing import Tuple, Optional, Union
+
+import cv2
+from matplotlib import pyplot as plt
+import numpy as np
+import tap
+import torch
+import torch.distributed as dist
+from torch.nn import functional as F
+
+from datasets.dataset_engine import RLBenchDataset
+from engine import BaseTrainTester
+from diffuser_actor import DiffuserActor
+
+from utils.common_utils import (
+    load_instructions, count_parameters, get_gripper_loc_bounds
+)
+
+
+class Arguments(tap.Tap):
+    # cameras: Tuple[str, ...] = ("wrist", "left_shoulder", "right_shoulder")
+    cameras: Tuple[str, ...] = ("left_shoulder", "right_shoulder", "wrist", "front")
+    image_size: str = "256,256"
+    max_episodes_per_task: int = 100
+    instructions: Optional[Path] = "instructions.pkl"
+    seed: int = 0
+    # tasks: Tuple[str, ...] = ('place_cups', 'close_jar', 'insert_onto_square_peg', 'light_bulb_in', 'meat_off_grill',
+    #                           'open_drawer', 'place_shape_in_shape_sorter', 'place_wine_at_rack_location',
+    #                           'push_buttons', 'put_groceries_in_cupboard', 'put_item_in_drawer', 'put_money_in_safe',
+    #                           'reach_and_drag', 'slide_block_to_color_target', 'stack_blocks', 'stack_cups',
+    #                           'sweep_to_dustpan_of_size', 'turn_tap')
+    tasks: Tuple[str, ...] = ('insert_onto_square_peg')
+    variations: int = 0
+    checkpoint: Optional[Path] = None
+    accumulate_grad_batches: int = 1
+    val_freq: int = 500
+    gripper_loc_bounds: Optional[str] = None
+    gripper_loc_bounds_buffer: float = 0.04
+    eval_only: int = 0
+
+    # Training and validation datasets
+    dataset: Path
+    valset: Path
+    dense_interpolation: int = 0
+    interpolation_length: int = 100
+
+    # Logging to base_log_dir/exp_log_dir/run_log_dir
+    base_log_dir: Path = Path(__file__).parent / "train_logs"
+    exp_log_dir: str = "exp"
+    run_log_dir: str = "run"
+
+    # Main training parameters
+    num_workers: int = 1
+    batch_size: int = 16
+    batch_size_val: int = 4
+    cache_size: int = 100
+    cache_size_val: int = 100
+    lr: float = 1e-4
+    wd: float = 5e-3  # used only for CALVIN
+    train_iters: int = 200_000
+    val_iters: int = -1  # -1 means heuristically-defined
+    max_episode_length: int = 5  # -1 for no limit
+
+    # Data augmentations
+    image_rescale: str = "0.75,1.25"  # (min, max), "1.0,1.0" for no rescaling
+
+    # Model
+    backbone: str = "clip"  # one of "resnet", "clip"
+    embedding_dim: int = 120
+    num_vis_ins_attn_layers: int = 2
+    use_instruction: int = 0
+    rotation_parametrization: str = 'quat'
+    quaternion_format: str = 'wxyz'
+    diffusion_timesteps: int = 100
+    keypose_only: int = 0
+    num_history: int = 0
+    relative_action: int = 0
+    lang_enhanced: int = 0
+    fps_subsampling_factor: int = 5
+
+    # # DDP
+    # master_addr: str = "localhost"
+    # master_port: int = 29500
+    # visible_devices: int = 1
+
+
+class TrainTester(BaseTrainTester):
+    """Train/test a trajectory optimization algorithm."""
+
+    def __init__(self, args):
+        """Initialize."""
+        super().__init__(args)
+
+    def get_datasets(self):
+        """Initialize datasets."""
+        # Load instruction, based on which we load tasks/variations
+        instruction = load_instructions(
+            self.args.instructions,
+            tasks=self.args.tasks,
+            variations=tuple(i for i in range(self.args.variations))
+        )
+        if instruction is None:
+            raise NotImplementedError()
+        else:
+            taskvar = [
+                (task, var)
+                for task, var_instr in instruction.items()
+                for var in var_instr.keys()
+            ]
+
+        # Initialize datasets with arguments
+        train_dataset = RLBenchDataset(
+            root=self.args.dataset,
+            instructions=instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            num_iters=self.args.train_iters,
+            cameras=self.args.cameras,
+            training=True,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=True,
+            dense_interpolation=bool(self.args.dense_interpolation),
+            interpolation_length=self.args.interpolation_length
+        )
+        test_dataset = RLBenchDataset(
+            root=self.args.valset,
+            instructions=instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size_val,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            cameras=self.args.cameras,
+            training=False,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=True,
+            dense_interpolation=bool(self.args.dense_interpolation),
+            interpolation_length=self.args.interpolation_length
+        )
+        return train_dataset, test_dataset
+
+    def get_model(self):
+        """Initialize the model."""
+        # Initialize model with arguments
+        _model = DiffuserActor(
+            backbone=self.args.backbone,
+            image_size=tuple(int(x) for x in self.args.image_size.split(",")),
+            embedding_dim=self.args.embedding_dim,
+            num_vis_ins_attn_layers=self.args.num_vis_ins_attn_layers,
+            use_instruction=bool(self.args.use_instruction),
+            fps_subsampling_factor=self.args.fps_subsampling_factor,
+            gripper_loc_bounds=self.args.gripper_loc_bounds,
+            rotation_parametrization=self.args.rotation_parametrization,
+            quaternion_format=self.args.quaternion_format,
+            diffusion_timesteps=self.args.diffusion_timesteps,
+            nhist=self.args.num_history,
+            relative=bool(self.args.relative_action),
+            lang_enhanced=bool(self.args.lang_enhanced)
+        )
+        print("Model parameters:", count_parameters(_model))
+
+        return _model
+
+    @staticmethod
+    def get_criterion():
+        return TrajectoryCriterion()
+
+    def train_one_step(self, model, criterion, optimizer, step_id, sample):
+        """Run a single training step."""
+        if step_id % self.args.accumulate_grad_batches == 0:
+            optimizer.zero_grad()
+
+        if self.args.keypose_only:
+            sample["trajectory"] = sample["trajectory"][:, [-1]]
+            sample["trajectory_mask"] = sample["trajectory_mask"][:, [-1]]
+        else:
+            sample["trajectory"] = sample["trajectory"][:, 1:]
+            sample["trajectory_mask"] = sample["trajectory_mask"][:, 1:]
+
+        # Forward pass
+        curr_gripper = (
+            sample["curr_gripper"] if self.args.num_history < 1
+            else sample["curr_gripper_history"][:, -self.args.num_history:]
+        )
+        out = model(
+            sample["trajectory"],
+            sample["trajectory_mask"],
+            sample["rgbs"],
+            sample["pcds"],
+            sample["instr"],
+            curr_gripper
+        )
+
+        # Backward pass
+        loss = criterion.compute_loss(out)
+        loss.backward()
+
+        # Update
+        if step_id % self.args.accumulate_grad_batches == self.args.accumulate_grad_batches - 1:
+            optimizer.step()
+
+        # Log
+        if dist.get_rank() == 0 and (step_id + 1) % self.args.val_freq == 0:
+            self.writer.add_scalar("lr", self.args.lr, step_id)
+            self.writer.add_scalar("train-loss/noise_mse", loss, step_id)
+
+    @torch.no_grad()
+    def evaluate_nsteps(self, model, criterion, loader, step_id, val_iters,
+                        split='val'):
+        """Run a given number of evaluation steps."""
+        if self.args.val_iters != -1:
+            val_iters = self.args.val_iters
+        values = {}
+        device = next(model.parameters()).device
+        model.eval()
+
+        for i, sample in enumerate(loader):
+            if i == val_iters:
+                break
+
+            if self.args.keypose_only:
+                sample["trajectory"] = sample["trajectory"][:, [-1]]
+                sample["trajectory_mask"] = sample["trajectory_mask"][:, [-1]]
+            else:
+                sample["trajectory"] = sample["trajectory"][:, 1:]
+                sample["trajectory_mask"] = sample["trajectory_mask"][:, 1:]
+
+            curr_gripper = (
+                sample["curr_gripper"] if self.args.num_history < 1
+                else sample["curr_gripper_history"][:, -self.args.num_history:]
+            )
+            action = model(
+                sample["trajectory"].to(device),
+                sample["trajectory_mask"].to(device),
+                sample["rgbs"].to(device),
+                sample["pcds"].to(device),
+                sample["instr"].to(device),
+                curr_gripper.to(device),
+                run_inference=True
+            )
+            losses, losses_B = criterion.compute_metrics(
+                action,
+                sample["trajectory"].to(device),
+                sample["trajectory_mask"].to(device)
+            )
+
+            # Gather global statistics
+            for n, l in losses.items():
+                key = f"{split}-losses/mean/{n}"
+                if key not in values:
+                    values[key] = torch.Tensor([]).to(device)
+                values[key] = torch.cat([values[key], l.unsqueeze(0)])
+
+            # Gather per-task statistics
+            tasks = np.array(sample["task"])
+            for n, l in losses_B.items():
+                for task in np.unique(tasks):
+                    key = f"{split}-loss/{task}/{n}"
+                    l_task = l[tasks == task].mean()
+                    if key not in values:
+                        values[key] = torch.Tensor([]).to(device)
+                    values[key] = torch.cat([values[key], l_task.unsqueeze(0)])
+
+            # Generate visualizations
+            if i == 0 and dist.get_rank() == 0 and step_id > -1:
+                viz_key = f'{split}-viz/viz'
+                viz = generate_visualizations(
+                    action,
+                    sample["trajectory"].to(device),
+                    sample["trajectory_mask"].to(device)
+                )
+                self.writer.add_image(viz_key, viz, step_id)
+
+        # Log all statistics
+        values = self.synchronize_between_processes(values)
+        values = {k: v.mean().item() for k, v in values.items()}
+        if dist.get_rank() == 0:
+            if step_id > -1:
+                for key, val in values.items():
+                    self.writer.add_scalar(key, val, step_id)
+
+            # Also log to terminal
+            print(f"Step {step_id}:")
+            for key, value in values.items():
+                print(f"{key}: {value:.03f}")
+
+        return values.get('val-losses/traj_pos_acc_001', None)
+
+
+def traj_collate_fn(batch):
+    keys = [
+        "trajectory", "trajectory_mask",
+        "rgbs", "pcds",
+        "curr_gripper", "curr_gripper_history", "action", "instr"
+    ]
+    ret_dict = {
+        key: torch.cat([
+            item[key].float() if key != 'trajectory_mask' else item[key]
+            for item in batch
+        ]) for key in keys
+    }
+
+    ret_dict["task"] = []
+    for item in batch:
+        ret_dict["task"] += item['task']
+    return ret_dict
+
+
+class TrajectoryCriterion:
+
+    def __init__(self):
+        pass
+
+    def compute_loss(self, pred, gt=None, mask=None, is_loss=True):
+        if not is_loss:
+            assert gt is not None and mask is not None
+            return self.compute_metrics(pred, gt, mask)[0]['action_mse']
+        return pred
+
+    @staticmethod
+    def compute_metrics(pred, gt, mask):
+        # pred/gt are (B, L, 7), mask (B, L)
+        pos_l2 = ((pred[..., :3] - gt[..., :3]) ** 2).sum(-1).sqrt()
+        # symmetric quaternion eval
+        quat_l1 = (pred[..., 3:7] - gt[..., 3:7]).abs().sum(-1)
+        quat_l1_ = (pred[..., 3:7] + gt[..., 3:7]).abs().sum(-1)
+        select_mask = (quat_l1 < quat_l1_).float()
+        quat_l1 = (select_mask * quat_l1 + (1 - select_mask) * quat_l1_)
+        # gripper openess
+        openess = ((pred[..., 7:] >= 0.5) == (gt[..., 7:] > 0.0)).bool()
+        tr = 'traj_'
+
+        # Trajectory metrics
+        ret_1, ret_2 = {
+            tr + 'action_mse': F.mse_loss(pred, gt),
+            tr + 'pos_l2': pos_l2.mean(),
+            tr + 'pos_acc_001': (pos_l2 < 0.01).float().mean(),
+            tr + 'rot_l1': quat_l1.mean(),
+            tr + 'rot_acc_0025': (quat_l1 < 0.025).float().mean(),
+            tr + 'gripper': openess.flatten().float().mean()
+        }, {
+            tr + 'pos_l2': pos_l2.mean(-1),
+            tr + 'pos_acc_001': (pos_l2 < 0.01).float().mean(-1),
+            tr + 'rot_l1': quat_l1.mean(-1),
+            tr + 'rot_acc_0025': (quat_l1 < 0.025).float().mean(-1)
+        }
+
+        # Keypose metrics
+        pos_l2 = ((pred[:, -1, :3] - gt[:, -1, :3]) ** 2).sum(-1).sqrt()
+        quat_l1 = (pred[:, -1, 3:7] - gt[:, -1, 3:7]).abs().sum(-1)
+        quat_l1_ = (pred[:, -1, 3:7] + gt[:, -1, 3:7]).abs().sum(-1)
+        select_mask = (quat_l1 < quat_l1_).float()
+        quat_l1 = (select_mask * quat_l1 + (1 - select_mask) * quat_l1_)
+        ret_1.update({
+            'pos_l2_final': pos_l2.mean(),
+            'pos_l2_final<0.01': (pos_l2 < 0.01).float().mean(),
+            'rot_l1': quat_l1.mean(),
+            'rot_l1<0025': (quat_l1 < 0.025).float().mean()
+        })
+        ret_2.update({
+            'pos_l2_final': pos_l2,
+            'pos_l2_final<0.01': (pos_l2 < 0.01).float(),
+            'rot_l1': quat_l1,
+            'rot_l1<0.025': (quat_l1 < 0.025).float(),
+        })
+
+        return ret_1, ret_2
+
+
+def fig_to_numpy(fig, dpi=60):
+    buf = io.BytesIO()
+    fig.savefig(buf, format="png", dpi=dpi)
+    buf.seek(0)
+    img_arr = np.frombuffer(buf.getvalue(), dtype=np.uint8)
+    buf.close()
+    img = cv2.imdecode(img_arr, 1)
+    return img
+
+
+def generate_visualizations(pred, gt, mask, box_size=0.3):
+    batch_idx = 0
+    pred = pred[batch_idx].detach().cpu().numpy()
+    gt = gt[batch_idx].detach().cpu().numpy()
+    mask = mask[batch_idx].detach().cpu().numpy()
+
+    fig = plt.figure(figsize=(10, 10))
+    ax = plt.axes(projection='3d')
+    ax.scatter3D(
+        pred[~mask][:, 0], pred[~mask][:, 1], pred[~mask][:, 2],
+        color='red', label='pred'
+    )
+    ax.scatter3D(
+        gt[~mask][:, 0], gt[~mask][:, 1], gt[~mask][:, 2],
+        color='blue', label='gt'
+    )
+
+    center = gt[~mask].mean(0)
+    ax.set_xlim(center[0] - box_size, center[0] + box_size)
+    ax.set_ylim(center[1] - box_size, center[1] + box_size)
+    ax.set_zlim(center[2] - box_size, center[2] + box_size)
+    ax.set_xticklabels([])
+    ax.set_yticklabels([])
+    ax.set_zticklabels([])
+    plt.legend()
+    fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
+
+    img = fig_to_numpy(fig, dpi=120)
+    plt.close()
+    return img.transpose(2, 0, 1)
+
+
+if __name__ == '__main__':
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    # Arguments
+    args = Arguments().parse_args()
+    print("Arguments:")
+    print(args)
+    print("-" * 100)
+    # os.environ['MASTER_ADDR'] = str(args.master_addr)
+    # os.environ['MASTER_PORT'] = str(args.master_port)
+    # os.environ['LOCAL_RANK'] = str(0)
+    # os.environ['RANK'] = str(0)
+    # os.environ['WORLD_SIZE'] = str(0)
+    # os.environ['CUDA_VISIBLE_DEVICES'] = str(args.visible_devices)
+    if args.gripper_loc_bounds is None:
+        args.gripper_loc_bounds = np.array([[-2, -2, -2], [2, 2, 2]]) * 1.0
+    else:
+        args.gripper_loc_bounds = get_gripper_loc_bounds(
+            args.gripper_loc_bounds,
+            task=args.tasks[0] if len(args.tasks) == 1 else None,
+            buffer=args.gripper_loc_bounds_buffer,
+        )
+    log_dir = args.base_log_dir / args.exp_log_dir / args.run_log_dir
+    args.log_dir = log_dir
+    log_dir.mkdir(exist_ok=True, parents=True)
+    print("Logging:", log_dir)
+    print(
+        "Available devices (CUDA_VISIBLE_DEVICES):",
+        os.environ.get("CUDA_VISIBLE_DEVICES")
+    )
+    print("Device count", torch.cuda.device_count())
+    # print("Local rank", os.environ["LOCAL_RANK"])
+    args.local_rank = int(os.environ["LOCAL_RANK"])
+
+    # Seeds
+    torch.manual_seed(args.seed)
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    # DDP initialization
+    torch.cuda.set_device(args.local_rank)
+    torch.distributed.init_process_group(backend='nccl', init_method='env://')
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cudnn.deterministic = True
+
+    # Run
+    train_tester = TrainTester(args)
+    train_tester.main(collate_fn=traj_collate_fn)
diff --git a/main_trajectory_calvin.py b/main_trajectory_calvin.py
new file mode 100644
index 0000000..482c3c0
--- /dev/null
+++ b/main_trajectory_calvin.py
@@ -0,0 +1,205 @@
+"""Main script for trajectory optimization."""
+
+import os
+import random
+import pickle
+
+import torch
+import torch.optim as optim
+from matplotlib import pyplot as plt
+import numpy as np
+
+from datasets.dataset_calvin import CalvinDataset
+from main_trajectory import TrainTester as BaseTrainTester
+from main_trajectory import traj_collate_fn, fig_to_numpy, Arguments
+from utils.common_utils import (
+    load_instructions, get_gripper_loc_bounds
+)
+
+
+def load_instructions(instructions, split):
+    instructions = pickle.load(
+        open(f"{instructions}/{split}.pkl", "rb")
+    )['embeddings']
+    return instructions
+
+
+class TrainTester(BaseTrainTester):
+    """Train/test a trajectory optimization algorithm."""
+
+    def __init__(self, args):
+        """Initialize."""
+        super().__init__(args)
+
+    def get_datasets(self):
+        """Initialize datasets."""
+        # Load instruction, based on which we load tasks/variations
+        train_instruction = load_instructions(
+            self.args.instructions, 'training'
+        )
+        test_instruction = load_instructions(
+            self.args.instructions, 'validation'
+        )
+        taskvar = [
+            ("A", 0), ("B", 0), ("C", 0), ("D", 0),
+        ]
+
+        # Initialize datasets with arguments
+        train_dataset = CalvinDataset(
+            root=self.args.dataset,
+            instructions=train_instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            num_iters=self.args.train_iters,
+            cameras=self.args.cameras,
+            training=True,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=True,
+            dense_interpolation=bool(self.args.dense_interpolation),
+            interpolation_length=self.args.interpolation_length,
+            relative_action=bool(self.args.relative_action)
+        )
+        test_dataset = CalvinDataset(
+            root=self.args.valset,
+            instructions=test_instruction,
+            taskvar=taskvar,
+            max_episode_length=self.args.max_episode_length,
+            cache_size=self.args.cache_size_val,
+            max_episodes_per_task=self.args.max_episodes_per_task,
+            cameras=self.args.cameras,
+            training=False,
+            image_rescale=tuple(
+                float(x) for x in self.args.image_rescale.split(",")
+            ),
+            return_low_lvl_trajectory=True,
+            dense_interpolation=bool(self.args.dense_interpolation),
+            interpolation_length=self.args.interpolation_length,
+            relative_action=bool(self.args.relative_action)
+        )
+        return train_dataset, test_dataset
+
+    def save_checkpoint(self, model, optimizer, step_id, new_loss, best_loss):
+        """Save checkpoint if requested."""
+        if new_loss is None or best_loss is None or new_loss <= best_loss:
+            best_loss = new_loss
+            torch.save({
+                "weight": model.state_dict(),
+                "optimizer": optimizer.state_dict(),
+                "iter": step_id + 1,
+                "best_loss": best_loss
+            }, self.args.log_dir / "best.pth")
+        torch.save({
+            "weight": model.state_dict(),
+            "optimizer": optimizer.state_dict(),
+            "iter": step_id + 1,
+            "best_loss": best_loss
+        }, self.args.log_dir / '{:07d}.pth'.format(step_id))
+        torch.save({
+            "weight": model.state_dict(),
+            "optimizer": optimizer.state_dict(),
+            "iter": step_id + 1,
+            "best_loss": best_loss
+        }, self.args.log_dir / "last.pth")
+        return best_loss
+
+    def get_optimizer(self, model):
+        """Initialize optimizer."""
+        optimizer_grouped_parameters = [
+            {"params": [], "weight_decay": 0.0, "lr": self.args.lr},
+            {"params": [], "weight_decay": self.args.wd, "lr": self.args.lr}
+        ]
+        no_decay = ["bias", "LayerNorm.weight", "LayerNorm.bias"]
+        for name, param in model.named_parameters():
+            if any(nd in name for nd in no_decay):
+                optimizer_grouped_parameters[0]["params"].append(param)
+            else:
+                optimizer_grouped_parameters[1]["params"].append(param)
+        optimizer = optim.AdamW(optimizer_grouped_parameters)
+        return optimizer
+
+
+def generate_visualizations(pred, gt, mask, box_size=0.05):
+    batch_idx = 0
+    images = []
+    for batch_idx in range(min(pred.shape[0], 5)):
+        cur_pred = pred[batch_idx].detach().cpu().numpy()
+        cur_gt = gt[batch_idx].detach().cpu().numpy()
+        cur_mask = mask[batch_idx].detach().cpu().numpy()
+
+        fig = plt.figure(figsize=(5, 5))
+        ax = plt.axes(projection='3d')
+        ax.scatter3D(
+            cur_pred[~cur_mask][:, 0],
+            cur_pred[~cur_mask][:, 1],
+            cur_pred[~cur_mask][:, 2],
+            color='red', label='pred'
+        )
+        ax.scatter3D(
+            cur_gt[~cur_mask][:, 0],
+            cur_gt[~cur_mask][:, 1],
+            cur_gt[~cur_mask][:, 2],
+            color='blue', label='gt'
+        )
+
+        center = cur_gt[~cur_mask].mean(0)
+        ax.set_xlim(center[0] - box_size, center[0] + box_size)
+        ax.set_ylim(center[1] - box_size, center[1] + box_size)
+        ax.set_zlim(center[2] - box_size, center[2] + box_size)
+        ax.set_xticklabels([])
+        ax.set_yticklabels([])
+        ax.set_zticklabels([])
+        plt.legend()
+        fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
+
+        img = fig_to_numpy(fig, dpi=120)
+        plt.close()
+        images.append(img)
+    images = np.concatenate(images, axis=1)
+    return images.transpose(2, 0, 1)
+
+
+if __name__ == '__main__':
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    # Arguments
+    args = Arguments().parse_args()
+    print("Arguments:")
+    print(args)
+    print("-" * 100)
+    if args.gripper_loc_bounds is None:
+        args.gripper_loc_bounds = np.array([[-2, -2, -2], [2, 2, 2]]) * 1.0
+    else:
+        args.gripper_loc_bounds = get_gripper_loc_bounds(
+            args.gripper_loc_bounds,
+            task=args.tasks[0] if len(args.tasks) == 1 else None,
+            buffer=args.gripper_loc_bounds_buffer,
+        )
+    log_dir = args.base_log_dir / args.exp_log_dir / args.run_log_dir
+    args.log_dir = log_dir
+    log_dir.mkdir(exist_ok=True, parents=True)
+    print("Logging:", log_dir)
+    print(
+        "Available devices (CUDA_VISIBLE_DEVICES):",
+        os.environ.get("CUDA_VISIBLE_DEVICES")
+    )
+    print("Device count", torch.cuda.device_count())
+    args.local_rank = int(os.environ["LOCAL_RANK"])
+
+    # Seeds
+    torch.manual_seed(args.seed)
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    # DDP initialization
+    torch.cuda.set_device(args.local_rank)
+    torch.distributed.init_process_group(backend='nccl', init_method='env://')
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cudnn.deterministic = True
+
+    # Run
+    train_tester = TrainTester(args)
+    train_tester.main(collate_fn=traj_collate_fn)
diff --git a/online_evaluation_calvin/evaluate_model.py b/online_evaluation_calvin/evaluate_model.py
new file mode 100644
index 0000000..49a8797
--- /dev/null
+++ b/online_evaluation_calvin/evaluate_model.py
@@ -0,0 +1,199 @@
+import logging
+
+import transformers
+import torch
+import numpy as np
+
+# This is for using the locally installed repo clone when using slurm
+from calvin_agent.models.calvin_base_model import CalvinBaseModel
+from diffuser_actor.trajectory_optimization.diffuser_actor import DiffuserActor
+from online_evaluation_calvin.evaluate_utils import convert_action
+from utils.utils_with_calvin import relative_to_absolute
+
+
+logger = logging.getLogger(__name__)
+
+
+def create_model(args, pretrained=True):
+    model = DiffusionModel(args)
+    if pretrained:
+        model.load_pretrained_weights()
+    return model
+
+
+class DiffusionModel(CalvinBaseModel):
+    """A wrapper for the DiffuserActor model, which handles
+            1. Model initialization
+            2. Encodings of instructions
+            3. Model inference
+            4. Action post-processing
+                - quaternion to Euler angles
+                - relative to absolute action
+    """
+    def __init__(self, args):
+        self.args = args
+        self.policy = self.get_policy()
+        self.text_tokenizer, self.text_model = self.get_text_encoder()
+        self.reset()
+
+    def get_policy(self):
+        """Initialize the model."""
+        # Initialize model with arguments
+        _model = DiffuserActor(
+            backbone=self.args.backbone,
+            image_size=tuple(int(x) for x in self.args.image_size.split(",")),
+            embedding_dim=self.args.embedding_dim,
+            num_vis_ins_attn_layers=self.args.num_vis_ins_attn_layers,
+            use_instruction=bool(self.args.use_instruction),
+            fps_subsampling_factor=self.args.fps_subsampling_factor,
+            gripper_loc_bounds=self.args.gripper_loc_bounds,
+            rotation_parametrization=self.args.rotation_parametrization,
+            quaternion_format=self.args.quaternion_format,
+            diffusion_timesteps=self.args.diffusion_timesteps,
+            nhist=self.args.num_history,
+            relative=bool(self.args.relative_action),
+            lang_enhanced=bool(self.args.lang_enhanced),
+        )
+
+        return _model
+
+    def get_text_encoder(self):
+        def load_model(encoder) -> transformers.PreTrainedModel:
+            if encoder == "bert":
+                model = transformers.BertModel.from_pretrained("bert-base-uncased")
+            elif encoder == "clip":
+                model = transformers.CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
+            else:
+                raise ValueError(f"Unexpected encoder {encoder}")
+            if not isinstance(model, transformers.PreTrainedModel):
+                raise ValueError(f"Unexpected encoder {encoder}")
+            return model
+
+
+        def load_tokenizer(encoder) -> transformers.PreTrainedTokenizer:
+            if encoder == "bert":
+                tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
+            elif encoder == "clip":
+                tokenizer = transformers.CLIPTokenizer.from_pretrained(
+                    "openai/clip-vit-base-patch32"
+                )
+            else:
+                raise ValueError(f"Unexpected encoder {encoder}")
+            if not isinstance(tokenizer, transformers.PreTrainedTokenizer):
+                raise ValueError(f"Unexpected encoder {encoder}")
+            return tokenizer
+
+
+        tokenizer = load_tokenizer(self.args.text_encoder)
+        tokenizer.model_max_length = self.args.text_max_length
+
+        model = load_model(self.args.text_encoder)
+    
+        return tokenizer, model
+
+    def reset(self):
+        """Set model to evaluation mode.
+        """
+        device = self.args.device
+        self.policy.eval()
+        self.text_model.eval()
+
+        self.policy = self.policy.to(device)
+        self.text_model = self.text_model.to(device)
+
+    def load_pretrained_weights(self, state_dict=None):
+        if state_dict is None:
+            state_dict = torch.load(self.args.checkpoint, map_location="cpu")["weight"]
+        model_weights = {}
+        for key in state_dict:
+            _key = key[7:]
+            model_weights[_key] = state_dict[key]
+        print(f'Loading weights from {self.args.checkpoint}')
+        self.policy.load_state_dict(model_weights)
+
+    def encode_instruction(self, instruction, device="cuda"):
+        """Encode string instruction to latent embeddings.
+
+        Args:
+            instruction: a string of instruction
+            device: a string of device
+        
+        Returns:
+            pred: a tensor of latent embeddings of shape (text_max_length, 512)
+        """
+        instr = instruction + '.'
+        tokens = self.text_tokenizer(instr, padding="max_length")["input_ids"]
+
+        tokens = torch.tensor(tokens).to(device)
+        tokens = tokens.view(1, -1)
+        with torch.no_grad():
+            pred = self.text_model(tokens).last_hidden_state
+
+        return pred
+
+    def step(self, obs, instruction):
+        """
+        Args:
+            obs: a dictionary of observations
+                - rgb_obs: a dictionary of RGB images
+                - depth_obs: a dictionary of depth images
+                - robot_obs: a dictionary of proprioceptive states
+            lang_annotation: a string indicates the instruction of the task
+
+        Returns:
+            action: predicted action
+        """
+        device = self.args.device
+
+        # Organize inputs
+        trajectory_mask = torch.full(
+            [1, self.args.interpolation_length - 1], False
+        ).to(device)
+        fake_trajectory = torch.full(
+            [1, self.args.interpolation_length - 1, self.args.action_dim], 0
+        ).to(device)
+        rgbs = np.stack([
+            obs["rgb_obs"]["rgb_static"], obs["rgb_obs"]["rgb_gripper"]
+        ], axis=0).transpose(0, 3, 1, 2) # [ncam, 3, H, W]
+        pcds = np.stack([
+            obs["pcd_obs"]["pcd_static"], obs["pcd_obs"]["pcd_gripper"]
+        ], axis=0).transpose(0, 3, 1, 2) # [ncam, 3, H, W]
+
+        rgbs = torch.as_tensor(rgbs).to(device).unsqueeze(0)
+        pcds = torch.as_tensor(pcds).to(device).unsqueeze(0)
+
+        # Crop the images.  See Line 165-166 in datasets/dataset_calvin.py
+        rgbs = rgbs[..., 20:180, 20:180]
+        pcds = pcds[..., 20:180, 20:180]
+
+        # history of actions
+        gripper = torch.as_tensor(obs["proprio"]).to(device).unsqueeze(0)
+
+        trajectory = self.policy(
+            fake_trajectory.float(),
+            trajectory_mask,
+            rgbs.float(),
+            pcds.float(),
+            instruction.float(),
+            curr_gripper=gripper[..., :7].float(),
+            run_inference=True
+        )
+
+        # Convert quaternion to Euler angles
+        trajectory = convert_action(trajectory)
+
+        if self.args.relative_action:
+            # Convert quaternion to Euler angles
+            gripper = convert_action(gripper[:, [-1], :])
+            # Convert relative action to absolute action
+            trajectory = relative_to_absolute(trajectory, gripper)
+
+        # Bound final action by CALVIN statistics
+        if self.args.calvin_gripper_loc_bounds is not None:
+            trajectory[:, :, :3] = np.clip(
+                trajectory[:, :, :3],
+                a_min=self.args.calvin_gripper_loc_bounds[0].reshape(1, 1, 3),
+                a_max=self.args.calvin_gripper_loc_bounds[1].reshape(1, 1, 3)
+            )
+
+        return trajectory
diff --git a/online_evaluation_calvin/evaluate_policy.py b/online_evaluation_calvin/evaluate_policy.py
new file mode 100644
index 0000000..9ae5eeb
--- /dev/null
+++ b/online_evaluation_calvin/evaluate_policy.py
@@ -0,0 +1,321 @@
+"""Modified from
+https://github.com/mees/calvin/blob/main/calvin_models/calvin_agent/evaluation/evaluate_policy.py
+"""
+import os
+import gc
+from typing import Tuple, Optional, List
+import random
+import logging
+from pathlib import Path
+
+import tap
+import hydra
+from omegaconf import OmegaConf
+import torch
+import numpy as np
+import yaml
+from tqdm import tqdm
+
+from utils.common_utils import get_gripper_loc_bounds
+from online_evaluation_calvin.evaluate_model import create_model
+from online_evaluation_calvin.evaluate_utils import (
+    prepare_visual_states,
+    prepare_proprio_states,
+    count_success,
+    get_env_state_for_initial_condition,
+    collect_results,
+    write_results,
+    get_log_dir
+)
+from online_evaluation_calvin.multistep_sequences import get_sequences
+from online_evaluation_calvin.evaluate_utils import get_env
+
+logger = logging.getLogger(__name__)
+
+EP_LEN = 60
+NUM_SEQUENCES = 1000
+EXECUTE_LEN = 20
+
+
+class Arguments(tap.Tap):
+    # Online enviornment
+    calvin_dataset_path: Path = "/home/tsungwek/repos/calvin/dataset/task_ABC_D"
+    calvin_model_path: Path = "/home/tsungwek/repos/calvin/calvin_models"
+    calvin_demo_tasks: Optional[List[str]] = None
+    device: str = "cuda"
+    text_encoder: str = "clip"
+    text_max_length: int = 16
+    save_video: int = 0
+
+    # Offline data loader
+    seed: int = 0
+    tasks: Tuple[str, ...] # indicates the environment
+    checkpoint: Path
+    gripper_loc_bounds: Optional[str] = None
+    gripper_loc_bounds_buffer: float = 0.04
+    calvin_gripper_loc_bounds: Optional[str] = None
+    relative_action: int = 0
+
+    # Logging to base_log_dir/exp_log_dir/run_log_dir
+    base_log_dir: Path = Path(__file__).parent / "eval_logs" / "calvin"
+
+    # Model
+    action_dim: int = 7 # dummy, as DiffuserActor assumes action_dim is 7
+    image_size: str = "256,256" # decides the FPN architecture
+    backbone: str = "clip"  # one of "resnet", "clip"
+    embedding_dim: int = 120
+    num_vis_ins_attn_layers: int = 2
+    use_instruction: int = 0
+    rotation_parametrization: str = 'quat'
+    quaternion_format: str = 'wxyz'
+    diffusion_timesteps: int = 100
+    lang_enhanced: int = 0
+    fps_subsampling_factor: int = 3
+    num_history: int = 0
+    interpolation_length: int = 2 # the number of steps to reach keypose
+
+
+def make_env(dataset_path, show_gui=True, split="validation", scene=None):
+    val_folder = Path(dataset_path) / f"{split}"
+    if scene is not None:
+        env = get_env(val_folder, show_gui=show_gui, scene=scene)
+    else:
+        env = get_env(val_folder, show_gui=show_gui)
+
+    return env
+
+
+def evaluate_policy(model, env, conf_dir, eval_log_dir=None, save_video=False,
+                    sequence_indices=[]):
+    """
+    Run this function to evaluate a model on the CALVIN challenge.
+
+    Args:
+        model: an instance of CalvinBaseModel
+        env: an instance of CALVIN_ENV
+        conf_dir: Path to the directory containing the config files of CALVIN
+        eval_log_dir: Path where to log evaluation results
+        save_video: a boolean indicates whether to save the video
+        sequence_indices: a list of integers indicates the indices of the
+            instruction chains to evaluate
+
+    Returns:
+        results: a list of integers indicates the number of tasks completed
+    """
+    task_cfg = OmegaConf.load(conf_dir / "callbacks/rollout/tasks/new_playtable_tasks.yaml")
+    task_oracle = hydra.utils.instantiate(task_cfg)
+    val_annotations = OmegaConf.load(conf_dir / "annotations/new_playtable_validation.yaml")
+
+    eval_log_dir = get_log_dir(eval_log_dir)
+
+    eval_sequences = get_sequences(NUM_SEQUENCES)
+
+    results, tested_sequence_indices = collect_results(eval_log_dir)
+
+    for seq_ind, (initial_state, eval_sequence) in enumerate(eval_sequences):
+        if sequence_indices and seq_ind not in sequence_indices:
+            continue
+        if seq_ind in tested_sequence_indices:
+            continue
+        result, videos = evaluate_sequence(
+            env, model, task_oracle, initial_state,
+            eval_sequence, val_annotations, save_video
+        )
+        write_results(eval_log_dir, seq_ind, result)
+        results.append(result)
+        str_results = (
+            " ".join([f"{i + 1}/5 : {v * 100:.1f}% |"
+            for i, v in enumerate(count_success(results))]) + "|"
+        )
+        print(str_results + "\n")
+
+        if save_video:
+            import moviepy.video.io.ImageSequenceClip
+            from moviepy.editor import vfx
+            clip = []
+            import cv2
+            for task_ind, (subtask, video) in enumerate(zip(eval_sequence, videos)):
+                for img_ind, img in enumerate(video):
+                    cv2.putText(img,
+                                f'{task_ind}: {subtask}',
+                                (10, 180),
+                                cv2.FONT_HERSHEY_SIMPLEX, 
+                                0.5,
+                                (0, 0, 0),
+                                1,
+                                2)
+                    video[img_ind] = img
+                clip.extend(video)
+            clip = moviepy.video.io.ImageSequenceClip.ImageSequenceClip(clip, fps=30)
+            clip.write_videofile(f"calvin_seq{seq_ind}.mp4")
+
+
+    return results
+
+
+def evaluate_sequence(env, model, task_checker, initial_state, eval_sequence,
+                      val_annotations, save_video):
+    """
+    Evaluates a sequence of language instructions.
+
+    Args:
+        env: an instance of CALVIN_ENV
+        model: an instance of CalvinBaseModel
+        task_checker: an indicator of whether the current task is completed
+        initial_state: a tuple of `robot_obs` and `scene_obs`
+            see: https://github.com/mees/calvin/blob/main/dataset/README.md#state-observation
+        eval_sequence: a list indicates the instruction chain
+        val_annotations: a dictionary of task instructions
+        save_video: a boolean indicates whether to save the video
+
+    Returns:
+        success_counter: an integer indicates the number of tasks completed
+        video_aggregator: a list of lists of images that shows the trajectory
+            of the robot
+
+    """
+    robot_obs, scene_obs = get_env_state_for_initial_condition(initial_state)
+    env.reset(robot_obs=robot_obs, scene_obs=scene_obs)
+
+    success_counter, video_aggregators = 0, []
+    for subtask in eval_sequence:
+        # get lang annotation for subtask
+        lang_annotation = val_annotations[subtask][0]
+        success, video = rollout(env, model, task_checker,
+                                 subtask, lang_annotation)
+        video_aggregators.append(video)
+
+        if success:
+            success_counter += 1
+        else:
+            return success_counter, video_aggregators
+    return success_counter, video_aggregators
+
+
+def rollout(env, model, task_oracle, subtask, lang_annotation):
+    """
+    Run the actual rollout on one subtask (which is one natural language instruction).
+
+    Args:
+        env: an instance of CALVIN_ENV
+        model: an instance of CalvinBaseModel
+        task_oracle: an indicator of whether the current task is completed
+        subtask: a string indicates the task name
+        lang_annotation: a string indicates the instruction of the task
+
+    Returns:
+        Success/Fail: a boolean indicates whether the task is completed
+        video: a list of images that shows the trajectory of the robot
+    """
+    video = [] # show video for debugging
+    obs = env.get_obs()
+
+    model.reset()
+    start_info = env.get_info()
+
+    print('------------------------------')
+    print(f'task: {lang_annotation}')
+    video.append(obs["rgb_obs"]["rgb_static"])
+
+    pbar = tqdm(range(EP_LEN))
+    for step in pbar:
+        obs = prepare_visual_states(obs, env)
+        obs = prepare_proprio_states(obs, env)
+        lang_embeddings = model.encode_instruction(lang_annotation, model.args.device)
+        with torch.cuda.amp.autocast():
+            trajectory = model.step(obs, lang_embeddings)
+        for act_ind in range(min(trajectory.shape[1], EXECUTE_LEN)):
+            # calvin_env executes absolute action in the format of:
+            # [[x, y, z], [euler_x, euler_y, euler_z], [open]]
+            curr_action = [
+                trajectory[0, act_ind, :3],
+                trajectory[0, act_ind, 3:6],
+                trajectory[0, act_ind, [6]]
+            ]
+            pbar.set_description(f"step: {step}")
+            curr_proprio = obs['proprio']
+            obs, _, _, current_info = env.step(curr_action)
+            obs['proprio'] = curr_proprio
+
+            # check if current step solves a task
+            current_task_info = task_oracle.get_task_info_for_set(
+                start_info, current_info, {subtask}
+            )
+
+            video.append(obs["rgb_obs"]["rgb_static"])
+
+            if len(current_task_info) > 0:
+                return True, video
+
+    return False, video
+
+
+def get_calvin_gripper_loc_bounds(args):
+    with open(args.calvin_gripper_loc_bounds, "r") as stream:
+       bounds = yaml.safe_load(stream)
+       min_bound = bounds['act_min_bound'][:3]
+       max_bound = bounds['act_max_bound'][:3]
+       gripper_loc_bounds = np.stack([min_bound, max_bound])
+
+    return gripper_loc_bounds
+
+
+def main(args):
+
+    # These location bounds are extracted from language-annotated episodes
+    if args.gripper_loc_bounds is None:
+        args.gripper_loc_bounds = np.array([[-2, -2, -2], [2, 2, 2]]) * 1.0
+    else:
+        args.gripper_loc_bounds = get_gripper_loc_bounds(
+            args.gripper_loc_bounds,
+            task=args.tasks[0] if len(args.tasks) == 1 else None,
+            buffer=args.gripper_loc_bounds_buffer,
+        )
+
+    # These location bounds are extracted from every episode in play trajectory
+    if args.calvin_gripper_loc_bounds is not None:
+        args.calvin_gripper_loc_bounds = get_calvin_gripper_loc_bounds(args)
+
+    # set random seeds
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+    # evaluate a custom model
+    model = create_model(args)
+
+    sequence_indices = [
+        i for i in range(args.local_rank, NUM_SEQUENCES, int(os.environ["WORLD_SIZE"]))
+    ]
+
+    env = make_env(args.calvin_dataset_path, show_gui=False)
+    evaluate_policy(model, env,
+                    conf_dir=Path(args.calvin_model_path) / "conf",
+                    eval_log_dir=args.base_log_dir,
+                    sequence_indices=sequence_indices,
+                    save_video=args.save_video)
+
+    results, sequence_inds = collect_results(args.base_log_dir)
+    str_results = (
+        " ".join([f"{i + 1}/5 : {v * 100:.1f}% |"
+        for i, v in enumerate(count_success(results))]) + "|"
+    )
+    print(f'Load {len(results)}/1000 episodes...')
+    print(str_results + "\n")
+
+    del env
+    gc.collect()
+
+if __name__ == "__main__":
+    args = Arguments().parse_args()
+    args.local_rank = int(os.environ["LOCAL_RANK"])
+
+    # DDP initialization
+    torch.cuda.set_device(args.local_rank)
+    torch.distributed.init_process_group(backend='nccl', init_method='env://')
+    torch.backends.cudnn.enabled = True
+    torch.backends.cudnn.benchmark = True
+    torch.backends.cudnn.deterministic = True
+
+    main(args)
diff --git a/online_evaluation_calvin/evaluate_utils.py b/online_evaluation_calvin/evaluate_utils.py
new file mode 100644
index 0000000..d2d34d2
--- /dev/null
+++ b/online_evaluation_calvin/evaluate_utils.py
@@ -0,0 +1,322 @@
+from typing import Dict, Any
+import os
+from pathlib import Path
+import contextlib
+from collections import Counter
+import glob
+
+import numpy as np
+from numpy import pi
+import torch
+import torch.nn.functional as F
+import utils.pytorch3d_transforms as pytorch3d_transforms
+import pybullet
+import hydra
+
+import calvin_env
+from calvin_env.envs.play_table_env import PlayTableSimEnv
+from utils.utils_with_calvin import (
+    deproject,
+    get_gripper_camera_view_matrix,
+    convert_rotation
+)
+
+
+############################################################
+# Functions to prepare inputs/outputs of 3D diffuser Actor #
+############################################################
+def prepare_visual_states(obs: Dict[str, Dict[str, Any]],
+                          env: PlayTableSimEnv):
+
+    """Prepare point cloud given RGB-D observations.  In-place add point clouds
+    to the observation dictionary.
+
+    Args:
+        obs: a dictionary of observations
+            - rgb_obs: a dictionary of RGB images
+            - depth_obs: a dictionary of depth images
+            - robot_obs: a dictionary of proprioceptive states
+        env: a PlayTableSimEnv instance which contains camera information
+    """
+    rgb_static = obs["rgb_obs"]["rgb_static"]
+    rgb_gripper = obs["rgb_obs"]["rgb_gripper"]
+    depth_static = obs["depth_obs"]["depth_static"]
+    depth_gripper = obs["depth_obs"]["depth_gripper"]
+
+    static_cam = env.cameras[0]
+    gripper_cam = env.cameras[1]
+    gripper_cam.viewMatrix = get_gripper_camera_view_matrix(gripper_cam)
+
+    static_pcd = deproject(
+        static_cam, depth_static,
+        homogeneous=False, sanity_check=False
+    ).transpose(1, 0)
+    static_pcd = np.reshape(
+        static_pcd, (depth_static.shape[0], depth_static.shape[1], 3)
+    )
+    gripper_pcd = deproject(
+        gripper_cam, depth_gripper,
+        homogeneous=False, sanity_check=False
+    ).transpose(1, 0)
+    gripper_pcd = np.reshape(
+        gripper_pcd, (depth_gripper.shape[0], depth_gripper.shape[1], 3)
+    )
+
+    # map RGB to [0, 1]
+    rgb_static = rgb_static / 255.
+    rgb_gripper = rgb_gripper / 255.
+
+    h, w = rgb_static.shape[:2]
+    rgb_gripper = F.interpolate(
+        torch.as_tensor(rgb_gripper).permute(2, 0, 1).unsqueeze(0),
+        size=(h, w), mode='bilinear', align_corners=False
+    ).squeeze(0).permute(1, 2, 0).numpy()
+    gripper_pcd = F.interpolate(
+        torch.as_tensor(gripper_pcd).permute(2, 0, 1).unsqueeze(0),
+        size=(h, w), mode='nearest'
+    ).squeeze(0).permute(1, 2, 0).numpy()
+
+    obs["rgb_obs"]["rgb_static"] = rgb_static
+    obs["rgb_obs"]["rgb_gripper"] = rgb_gripper
+    obs["pcd_obs"] = {}
+    obs["pcd_obs"]["pcd_static"] = static_pcd
+    obs["pcd_obs"]["pcd_gripper"] = gripper_pcd
+
+    return obs
+
+
+def prepare_proprio_states(obs: Dict[str, Dict[str, Any]],
+                           env: PlayTableSimEnv):
+    """Prepare robot proprioceptive states.  In-place add proprioceptive states
+    to the observation dictionary.
+
+    Args:
+        obs: a dictionary of observations
+            - rgb_obs: a dictionary of RGB images
+            - depth_obs: a dictionary of depth images
+            - robot_obs: a dictionary of proprioceptive states
+        env: a PlayTableSimEnv instance which contains camera information
+    """
+    # Map gripper openess to [0, 1]
+    proprio = np.concatenate([
+        obs['robot_obs'][:3],
+        convert_rotation(obs['robot_obs'][3:6]),
+        (obs['robot_obs'][[-1]] + 1) / 2
+    ], axis=-1)
+
+    if 'proprio' not in obs:
+        obs['proprio'] = np.stack([proprio] * 3, axis=0)
+    else:
+        obs['proprio'] = np.concatenate([obs['proprio'][1:], proprio[None]], axis=0)
+
+    return obs
+
+
+def convert_quaternion_to_euler(quat):
+    """Convert Euler angles to Quarternion
+    """
+    quat = torch.as_tensor(quat)
+    mat = pytorch3d_transforms.quaternion_to_matrix(quat)
+    rot = pytorch3d_transforms.matrix_to_euler_angles(mat, "XYZ")
+    rot = rot.data.cpu().numpy()
+
+    return rot
+
+
+def convert_action(trajectory):
+    """Convert [position, rotation, openess] to the same format as Calvin
+
+    Args:
+        trajectory: a torch.Tensor or np.ndarray of shape [bs, traj_len, 8]
+            - position: absolute [x, y, z] in the world coordinates
+            - rotation: absolute quarternion in the world coordinates
+            - openess: [0, 1]
+
+    Returns:
+        trajectory: a torch.Tensor or np.ndarray of shape [bs, traj_len, 8]
+            - position: absolute [x, y, z] in the world coordinates
+            - rotation: absolute 'XYZ' Euler angles in the world coordinates
+            - openess: [-1, 1]
+    """
+    assert trajectory.shape[-1] == 8
+    position, rotation, openess = (
+        trajectory[..., :3], trajectory[..., 3:7], trajectory[..., -1:]
+    )
+    position = position.data.cpu().numpy()
+    _rot = convert_quaternion_to_euler(rotation)
+    # pytorch3d.transforms does not deal with Gumbel lock, the conversion
+    # of some rotation matrix results in nan values.  We usepybullet's
+    # implementation in this case.
+    if (_rot != _rot).any():
+        # Pybullet has different convention of Quaternion.
+        _rot_shape = list(rotation.shape)[:-1] + [3]
+        _rot = rotation.reshape(-1, 4).data.cpu().numpy()
+        rotation = np.array([
+            pybullet.getEulerFromQuaternion([r[-1], r[0], r[1], r[2]])
+            for r in _rot
+        ]).reshape(_rot_shape)
+    else:
+        rotation = _rot
+    openess = (2 * (openess >= 0.5).long() - 1).data.cpu().numpy()
+
+    trajectory = np.concatenate([position, rotation, openess], axis=-1)
+    return trajectory
+
+
+######################################################
+#     Functions in calvin_agent.evaluation.utils     #
+######################################################
+def count_success(results):
+    count = Counter(results)
+    step_success = []
+    for i in range(1, 6):
+        n_success = sum(count[j] for j in reversed(range(i, 6)))
+        sr = n_success / len(results)
+        step_success.append(sr)
+    return step_success
+
+
+def get_env(dataset_path, obs_space=None, show_gui=True, **kwargs):
+    from pathlib import Path
+
+    from omegaconf import OmegaConf
+
+    render_conf = OmegaConf.load(Path(dataset_path) / ".hydra" / "merged_config.yaml")
+
+    if obs_space is not None:
+        exclude_keys = set(render_conf.cameras.keys()) - {
+            re.split("_", key)[1] for key in obs_space["rgb_obs"] + obs_space["depth_obs"]
+        }
+        for k in exclude_keys:
+            del render_conf.cameras[k]
+    if "scene" in kwargs:
+        scene_cfg = OmegaConf.load(Path(calvin_env.__file__).parents[1] / "conf/scene" / f"{kwargs['scene']}.yaml")
+        OmegaConf.update(render_conf, "scene", scene_cfg)
+    if not hydra.core.global_hydra.GlobalHydra.instance().is_initialized():
+        hydra.initialize(".")
+    env = hydra.utils.instantiate(render_conf.env, show_gui=show_gui, use_vr=False, use_scene_info=True)
+    return env
+
+
+def get_env_state_for_initial_condition(initial_condition):
+    robot_obs = np.array(
+        [
+            0.02586889,
+            -0.2313129,
+            0.5712808,
+            3.09045411,
+            -0.02908596,
+            1.50013585,
+            0.07999963,
+            -1.21779124,
+            1.03987629,
+            2.11978254,
+            -2.34205014,
+            -0.87015899,
+            1.64119093,
+            0.55344928,
+            1.0,
+        ]
+    )
+    block_rot_z_range = (pi / 2 - pi / 8, pi / 2 + pi / 8)
+    block_slider_left = np.array([-2.40851662e-01, 9.24044687e-02, 4.60990009e-01])
+    block_slider_right = np.array([7.03416330e-02, 9.24044687e-02, 4.60990009e-01])
+    block_table = [
+        np.array([5.00000896e-02, -1.20000177e-01, 4.59990009e-01]),
+        np.array([2.29995412e-01, -1.19995140e-01, 4.59990010e-01]),
+    ]
+    # we want to have a "deterministic" random seed for each initial condition
+    import pyhash
+    hasher = pyhash.fnv1_32()
+    seed = hasher(str(initial_condition.values()))
+    with temp_seed(seed):
+        np.random.shuffle(block_table)
+
+        scene_obs = np.zeros(24)
+        if initial_condition["slider"] == "left":
+            scene_obs[0] = 0.28
+        if initial_condition["drawer"] == "open":
+            scene_obs[1] = 0.22
+        if initial_condition["lightbulb"] == 1:
+            scene_obs[3] = 0.088
+        scene_obs[4] = initial_condition["lightbulb"]
+        scene_obs[5] = initial_condition["led"]
+        # red block
+        if initial_condition["red_block"] == "slider_right":
+            scene_obs[6:9] = block_slider_right
+        elif initial_condition["red_block"] == "slider_left":
+            scene_obs[6:9] = block_slider_left
+        else:
+            scene_obs[6:9] = block_table[0]
+        scene_obs[11] = np.random.uniform(*block_rot_z_range)
+        # blue block
+        if initial_condition["blue_block"] == "slider_right":
+            scene_obs[12:15] = block_slider_right
+        elif initial_condition["blue_block"] == "slider_left":
+            scene_obs[12:15] = block_slider_left
+        elif initial_condition["red_block"] == "table":
+            scene_obs[12:15] = block_table[1]
+        else:
+            scene_obs[12:15] = block_table[0]
+        scene_obs[17] = np.random.uniform(*block_rot_z_range)
+        # pink block
+        if initial_condition["pink_block"] == "slider_right":
+            scene_obs[18:21] = block_slider_right
+        elif initial_condition["pink_block"] == "slider_left":
+            scene_obs[18:21] = block_slider_left
+        else:
+            scene_obs[18:21] = block_table[1]
+        scene_obs[23] = np.random.uniform(*block_rot_z_range)
+
+    return robot_obs, scene_obs
+
+
+@contextlib.contextmanager
+def temp_seed(seed):
+    state = np.random.get_state()
+    np.random.seed(seed)
+    try:
+        yield
+    finally:
+        np.random.set_state(state)
+
+
+def get_log_dir(log_dir):
+    if log_dir is not None:
+        log_dir = Path(log_dir)
+        os.makedirs(log_dir, exist_ok=True)
+    else:
+        log_dir = Path(__file__).parents[3] / "evaluation"
+        if not log_dir.exists():
+            log_dir = Path("/tmp/evaluation")
+            os.makedirs(log_dir, exist_ok=True)
+    print(f"logging to {log_dir}")
+    return log_dir
+
+
+######################################################
+#      Functions to cache the evaluation results     #
+######################################################
+def collect_results(log_dir):
+    """Load the number of completed tasks of each instruction chain from a file.
+    """
+    if os.path.isfile(str(Path(log_dir) / "result.txt")):
+        with open(str(Path(log_dir) / "result.txt")) as f:
+            lines = f.read().split("\n")[:-1]
+    else:
+        lines = []
+
+    results, seq_inds = [], []
+    for line in lines:
+        seq, res= line.split(" ")
+        results.append(int(res))
+        seq_inds.append(int(seq))
+
+    return results, seq_inds
+
+
+def write_results(log_dir, seq_ind, result):
+    """Write the number of completed tasks of each instruction chain to a file.
+    """
+    with open(log_dir / f"result.txt", "a") as write_file:
+        write_file.write(f"{seq_ind} {result}\n")
\ No newline at end of file
diff --git a/online_evaluation_calvin/multistep_sequences.py b/online_evaluation_calvin/multistep_sequences.py
new file mode 100644
index 0000000..56f0fed
--- /dev/null
+++ b/online_evaluation_calvin/multistep_sequences.py
@@ -0,0 +1,403 @@
+from collections import Counter
+from concurrent.futures import ProcessPoolExecutor
+from copy import deepcopy
+import functools
+from itertools import product
+import logging
+import multiprocessing
+from operator import add
+
+import numpy as np
+from online_evaluation_calvin.evaluate_utils import temp_seed
+
+logger = logging.getLogger(__name__)
+
+
+task_categories = {
+    "rotate_red_block_right": 1,
+    "rotate_red_block_left": 1,
+    "rotate_blue_block_right": 1,
+    "rotate_blue_block_left": 1,
+    "rotate_pink_block_right": 1,
+    "rotate_pink_block_left": 1,
+    "push_red_block_right": 1,
+    "push_red_block_left": 1,
+    "push_blue_block_right": 1,
+    "push_blue_block_left": 1,
+    "push_pink_block_right": 1,
+    "push_pink_block_left": 1,
+    "move_slider_left": 2,
+    "move_slider_right": 2,
+    "open_drawer": 3,
+    "close_drawer": 3,
+    "lift_red_block_table": 4,
+    "lift_red_block_slider": 5,
+    "lift_red_block_drawer": 6,
+    "lift_blue_block_table": 4,
+    "lift_blue_block_slider": 5,
+    "lift_blue_block_drawer": 6,
+    "lift_pink_block_table": 4,
+    "lift_pink_block_slider": 5,
+    "lift_pink_block_drawer": 6,
+    "place_in_slider": 7,
+    "place_in_drawer": 7,
+    "turn_on_lightbulb": 8,
+    "turn_off_lightbulb": 8,
+    "turn_on_led": 8,
+    "turn_off_led": 8,
+    "push_into_drawer": 9,
+    "stack_block": 10,
+    "unstack_block": 11,
+}
+
+tasks = {
+    "rotate_red_block_right": [{"condition": {"red_block": "table", "grasped": 0}, "effect": {"red_block": "table"}}],
+    "rotate_red_block_left": [{"condition": {"red_block": "table", "grasped": 0}, "effect": {"red_block": "table"}}],
+    "rotate_blue_block_right": [
+        {"condition": {"blue_block": "table", "grasped": 0}, "effect": {"blue_block": "table"}}
+    ],
+    "rotate_blue_block_left": [{"condition": {"blue_block": "table", "grasped": 0}, "effect": {"blue_block": "table"}}],
+    "rotate_pink_block_right": [
+        {"condition": {"pink_block": "table", "grasped": 0}, "effect": {"pink_block": "table"}}
+    ],
+    "rotate_pink_block_left": [{"condition": {"pink_block": "table", "grasped": 0}, "effect": {"pink_block": "table"}}],
+    "push_red_block_right": [{"condition": {"red_block": "table", "grasped": 0}, "effect": {"red_block": "table"}}],
+    "push_red_block_left": [{"condition": {"red_block": "table", "grasped": 0}, "effect": {"red_block": "table"}}],
+    "push_blue_block_right": [{"condition": {"blue_block": "table", "grasped": 0}, "effect": {"blue_block": "table"}}],
+    "push_blue_block_left": [{"condition": {"blue_block": "table", "grasped": 0}, "effect": {"blue_block": "table"}}],
+    "push_pink_block_right": [{"condition": {"pink_block": "table", "grasped": 0}, "effect": {"pink_block": "table"}}],
+    "push_pink_block_left": [{"condition": {"pink_block": "table", "grasped": 0}, "effect": {"pink_block": "table"}}],
+    "move_slider_left": [{"condition": {"slider": "right", "grasped": 0}, "effect": {"slider": "left"}}],
+    "move_slider_right": [{"condition": {"slider": "left", "grasped": 0}, "effect": {"slider": "right"}}],
+    "open_drawer": [{"condition": {"drawer": "closed", "grasped": 0}, "effect": {"drawer": "open"}}],
+    "close_drawer": [{"condition": {"drawer": "open", "grasped": 0}, "effect": {"drawer": "closed"}}],
+    "lift_red_block_table": [
+        {"condition": {"red_block": "table", "grasped": 0}, "effect": {"red_block": "grasped", "grasped": 1}}
+    ],
+    "lift_red_block_slider": [
+        {
+            "condition": {"red_block": "slider_left", "slider": "right", "grasped": 0},
+            "effect": {"red_block": "grasped", "grasped": 1},
+        },
+        {
+            "condition": {"red_block": "slider_right", "slider": "left", "grasped": 0},
+            "effect": {"red_block": "grasped", "grasped": 1},
+        },
+    ],
+    "lift_red_block_drawer": [
+        {
+            "condition": {"red_block": "drawer", "drawer": "open", "grasped": 0},
+            "effect": {"red_block": "grasped", "grasped": 1},
+        }
+    ],
+    "lift_blue_block_table": [
+        {"condition": {"blue_block": "table", "grasped": 0}, "effect": {"blue_block": "grasped", "grasped": 1}}
+    ],
+    "lift_blue_block_slider": [
+        {
+            "condition": {"blue_block": "slider_left", "slider": "right", "grasped": 0},
+            "effect": {"blue_block": "grasped", "grasped": 1},
+        },
+        {
+            "condition": {"blue_block": "slider_right", "slider": "left", "grasped": 0},
+            "effect": {"blue_block": "grasped", "grasped": 1},
+        },
+    ],
+    "lift_blue_block_drawer": [
+        {
+            "condition": {"blue_block": "drawer", "drawer": "open", "grasped": 0},
+            "effect": {"blue_block": "grasped", "grasped": 1},
+        }
+    ],
+    "lift_pink_block_table": [
+        {"condition": {"pink_block": "table", "grasped": 0}, "effect": {"pink_block": "grasped", "grasped": 1}}
+    ],
+    "lift_pink_block_slider": [
+        {
+            "condition": {"pink_block": "slider_left", "slider": "right", "grasped": 0},
+            "effect": {"pink_block": "grasped", "grasped": 1},
+        },
+        {
+            "condition": {"pink_block": "slider_right", "slider": "left", "grasped": 0},
+            "effect": {"pink_block": "grasped", "grasped": 1},
+        },
+    ],
+    "lift_pink_block_drawer": [
+        {
+            "condition": {"pink_block": "drawer", "drawer": "open", "grasped": 0},
+            "effect": {"pink_block": "grasped", "grasped": 1},
+        }
+    ],
+    "place_in_slider": [
+        {
+            "condition": {"red_block": "grasped", "slider": "right", "grasped": 1},
+            "effect": {"red_block": "slider_right", "grasped": 0},
+        },
+        {
+            "condition": {"red_block": "grasped", "slider": "left", "grasped": 1},
+            "effect": {"red_block": "slider_left", "grasped": 0},
+        },
+        {
+            "condition": {"blue_block": "grasped", "slider": "right", "grasped": 1},
+            "effect": {"blue_block": "slider_right", "grasped": 0},
+        },
+        {
+            "condition": {"blue_block": "grasped", "slider": "left", "grasped": 1},
+            "effect": {"blue_block": "slider_left", "grasped": 0},
+        },
+        {
+            "condition": {"pink_block": "grasped", "slider": "right", "grasped": 1},
+            "effect": {"pink_block": "slider_right", "grasped": 0},
+        },
+        {
+            "condition": {"pink_block": "grasped", "slider": "left", "grasped": 1},
+            "effect": {"pink_block": "slider_left", "grasped": 0},
+        },
+    ],
+    "place_in_drawer": [
+        {
+            "condition": {"red_block": "grasped", "drawer": "open", "grasped": 1},
+            "effect": {"red_block": "drawer", "grasped": 0},
+        },
+        {
+            "condition": {"blue_block": "grasped", "drawer": "open", "grasped": 1},
+            "effect": {"blue_block": "drawer", "grasped": 0},
+        },
+        {
+            "condition": {"pink_block": "grasped", "drawer": "open", "grasped": 1},
+            "effect": {"pink_block": "drawer", "grasped": 0},
+        },
+    ],
+    "stack_block": [
+        {
+            "condition": {"red_block": "grasped", "blue_block": "table", "grasped": 1},
+            "effect": {"red_block": "stacked_top", "blue_block": "stacked_bottom", "grasped": 0},
+        },
+        {
+            "condition": {"red_block": "grasped", "pink_block": "table", "grasped": 1},
+            "effect": {"red_block": "stacked_top", "pink_block": "stacked_bottom", "grasped": 0},
+        },
+        {
+            "condition": {"blue_block": "grasped", "red_block": "table", "grasped": 1},
+            "effect": {"blue_block": "stacked_top", "red_block": "stacked_bottom", "grasped": 0},
+        },
+        {
+            "condition": {"blue_block": "grasped", "pink_block": "table", "grasped": 1},
+            "effect": {"blue_block": "stacked_top", "pink_block": "stacked_bottom", "grasped": 0},
+        },
+        {
+            "condition": {"pink_block": "grasped", "red_block": "table", "grasped": 1},
+            "effect": {"pink_block": "stacked_top", "red_block": "stacked_bottom", "grasped": 0},
+        },
+        {
+            "condition": {"pink_block": "grasped", "blue_block": "table", "grasped": 1},
+            "effect": {"pink_block": "stacked_top", "blue_block": "stacked_bottom", "grasped": 0},
+        },
+    ],
+    "unstack_block": [
+        {
+            "condition": {"red_block": "stacked_top", "blue_block": "stacked_bottom", "grasped": 0},
+            "effect": {"red_block": "table", "blue_block": "table"},
+        },
+        {
+            "condition": {"red_block": "stacked_top", "pink_block": "stacked_bottom", "grasped": 0},
+            "effect": {"red_block": "table", "pink_block": "table"},
+        },
+        {
+            "condition": {"blue_block": "stacked_top", "red_block": "stacked_bottom", "grasped": 0},
+            "effect": {"blue_block": "table", "red_block": "table"},
+        },
+        {
+            "condition": {"blue_block": "stacked_top", "pink_block": "stacked_bottom", "grasped": 0},
+            "effect": {"blue_block": "table", "pink_block": "table"},
+        },
+        {
+            "condition": {"pink_block": "stacked_top", "red_block": "stacked_bottom", "grasped": 0},
+            "effect": {"pink_block": "table", "red_block": "table"},
+        },
+        {
+            "condition": {"pink_block": "stacked_top", "blue_block": "stacked_bottom", "grasped": 0},
+            "effect": {"pink_block": "table", "blue_block": "table"},
+        },
+    ],
+    "turn_on_lightbulb": [{"condition": {"lightbulb": 0, "grasped": 0}, "effect": {"lightbulb": 1}}],
+    "turn_off_lightbulb": [{"condition": {"lightbulb": 1, "grasped": 0}, "effect": {"lightbulb": 0}}],
+    "turn_on_led": [{"condition": {"led": 0, "grasped": 0}, "effect": {"led": 1}}],
+    "turn_off_led": [{"condition": {"led": 1, "grasped": 0}, "effect": {"led": 0}}],
+    "push_into_drawer": [
+        {
+            "condition": {
+                "red_block": "table",
+                "blue_block": ["slider_right", "slider_left"],
+                "pink_block": ["slider_right", "slider_left"],
+                "drawer": "open",
+                "grasped": 0,
+            },
+            "effect": {"red_block": "drawer", "grasped": 0},
+        },
+        {
+            "condition": {
+                "blue_block": "table",
+                "red_block": ["slider_right", "slider_left"],
+                "pink_block": ["slider_right", "slider_left"],
+                "drawer": "open",
+                "grasped": 0,
+            },
+            "effect": {"blue_block": "drawer", "grasped": 0},
+        },
+        {
+            "condition": {
+                "pink_block": "table",
+                "blue_block": ["slider_right", "slider_left"],
+                "red_block": ["slider_right", "slider_left"],
+                "drawer": "open",
+                "grasped": 0,
+            },
+            "effect": {"pink_block": "drawer", "grasped": 0},
+        },
+    ],
+}
+
+
+def check_condition(state, condition):
+    for k, v in condition.items():
+        if isinstance(v, (str, int)):
+            if not state[k] == v:
+                return False
+        elif isinstance(v, list):
+            if not state[k] in v:
+                return False
+        else:
+            raise TypeError
+    return True
+
+
+def update_state(state, effect):
+    next_state = deepcopy(state)
+    for k, v in effect.items():
+        next_state[k] = v
+    return next_state
+
+
+def valid_task(curr_state, task):
+    next_states = []
+    for _task in task:
+        if check_condition(curr_state, _task["condition"]):
+            next_state = update_state(curr_state, _task["effect"])
+            next_states.append(next_state)
+    return next_states
+
+
+def get_sequences_for_state(state, num_sequences=None):
+    state = deepcopy(state)
+
+    seq_len = 5
+    valid_seqs = [[] for x in range(seq_len)]
+    with temp_seed(0):
+        for step in range(seq_len):
+            for task_name, task in tasks.items():
+                if step == 0:
+                    for next_state in valid_task(state, task):
+                        valid_seqs[0].append([(task_name, next_state)])
+                else:
+                    for seq in valid_seqs[step - 1]:
+                        curr_state = seq[-1][1]
+                        for next_state in valid_task(curr_state, task):
+                            valid_seqs[step].append([*seq, (task_name, next_state)])
+
+        results = []
+        result_set = []
+        # set the numpy seed temporarily to 0
+
+        for seq in np.random.permutation(valid_seqs[-1]):
+            _seq = list(zip(*seq))[0]
+            categories = [task_categories[name] for name in _seq]
+            if len(categories) == len(set(categories)) and set(_seq) not in result_set:
+                results.append(_seq)
+                result_set.append(set(_seq))
+    if num_sequences is not None:
+        results = results[:num_sequences]
+    return results
+
+
+def check_sequence(state, seq):
+    for task_name in seq:
+        states = valid_task(state, tasks[task_name])
+        if len(states) != 1:
+            return False
+        state = states[0]
+    categories = [task_categories[name] for name in seq]
+    return len(categories) == len(set(categories))
+
+
+def get_sequences_for_state2(args):
+    state, num_sequences, i = args
+    np.random.seed(i)
+    seq_len = 5
+    results = []
+
+    while len(results) < num_sequences:
+        seq = np.random.choice(list(tasks.keys()), size=seq_len, replace=False)
+        if check_sequence(state, seq):
+            results.append(seq)
+    return results
+
+
+def flatten(t):
+    return [tuple(item.tolist()) for sublist in t for item in sublist]
+
+
+@functools.lru_cache
+def get_sequences(num_sequences=1000, num_workers=None):
+    possible_conditions = {
+        "led": [0, 1],
+        "lightbulb": [0, 1],
+        "slider": ["right", "left"],
+        "drawer": ["closed", "open"],
+        "red_block": ["table", "slider_right", "slider_left"],
+        "blue_block": ["table", "slider_right", "slider_left"],
+        "pink_block": ["table", "slider_right", "slider_left"],
+        "grasped": [0],
+    }
+
+    f = lambda l: l.count("table") in [1, 2] and l.count("slider_right") < 2 and l.count("slider_left") < 2
+    value_combinations = filter(f, product(*possible_conditions.values()))
+    initial_states = [dict(zip(possible_conditions.keys(), vals)) for vals in value_combinations]
+
+    num_sequences_per_state = list(map(len, np.array_split(range(num_sequences), len(initial_states))))
+    logger.info("Start generating evaluation sequences.")
+    # set the numpy seed temporarily to 0
+    with temp_seed(0):
+        num_workers = multiprocessing.cpu_count() if num_workers is None else num_workers
+        with ProcessPoolExecutor(max_workers=num_workers) as executor:
+            results = flatten(
+                executor.map(
+                    get_sequences_for_state2, zip(initial_states, num_sequences_per_state, range(len(initial_states)))
+                )
+            )
+        results = list(zip(np.repeat(initial_states, num_sequences_per_state), results))
+        np.random.shuffle(results)
+    logger.info("Done generating evaluation sequences.")
+
+    return results
+
+
+if __name__ == "__main__":
+    results = get_sequences(1000)
+    counters = [Counter() for _ in range(5)]  # type: ignore
+    for initial_state, seq in results:
+        for i, task in enumerate(seq):
+            counters[i][task] += 1
+
+    for i, counter in enumerate(counters):
+        print(f"Task {i+1}")
+        print()
+        for task, freq in sorted(counter.items(), key=lambda x: x[1], reverse=True):
+            print(f"{task}: {freq / sum(counter.values()) * 100:.2f}")
+        print()
+        print()
+
+    print("overall task probability:")
+    all_counters = functools.reduce(add, counters)
+    for task, freq in sorted(all_counters.items(), key=lambda x: x[1], reverse=True):
+        print(f"{task}: {freq / sum(all_counters.values()) * 100:.2f}")
diff --git a/online_evaluation_rlbench/eval_act3d_gnfactor.sh b/online_evaluation_rlbench/eval_act3d_gnfactor.sh
new file mode 100644
index 0000000..107c507
--- /dev/null
+++ b/online_evaluation_rlbench/eval_act3d_gnfactor.sh
@@ -0,0 +1,45 @@
+exp=act3d_gnfactor
+
+tasks=(
+    close_jar insert_onto_square_peg light_bulb_in meat_off_grill open_drawer place_shape_in_shape_sorter place_wine_at_rack_location push_buttons put_groceries_in_cupboard put_item_in_drawer put_money_in_safe reach_and_drag slide_block_to_color_target stack_blocks stack_cups sweep_to_dustpan_of_size turn_tap
+)
+data_dir=./data/peract/raw/test/
+num_episodes=100
+gripper_loc_bounds_file=tasks/18_peract_tasks_location_bounds.json
+use_instruction=1
+max_tries=2
+verbose=1
+single_task_gripper_loc_bounds=0
+embedding_dim=120
+cameras="front"
+seed=0
+checkpoint=train_logs/act3d_gnfactor.pth
+
+num_ckpts=${#tasks[@]}
+for ((i=0; i<$num_ckpts; i++)); do
+    CUDA_LAUNCH_BLOCKING=1 python online_evaluation_rlbench/evaluate_policy.py \
+    --tasks ${tasks[$i]} \
+    --checkpoint $checkpoint \
+    --num_history 1 \
+    --test_model act3d \
+    --cameras $cameras \
+    --verbose $verbose \
+    --action_dim 8 \
+    --collision_checking 0 \
+    --predict_trajectory 0 \
+    --embedding_dim $embedding_dim \
+    --rotation_parametrization "quat_from_query" \
+    --single_task_gripper_loc_bounds $single_task_gripper_loc_bounds \
+    --data_dir $data_dir \
+    --num_episodes $num_episodes \
+    --output_file eval_logs/$exp/seed$seed/${tasks[$i]}.json  \
+    --use_instruction $use_instruction \
+    --instructions instructions/peract/instructions.pkl \
+    --variations {0..60} \
+    --max_tries $max_tries \
+    --max_steps 20 \
+    --seed $seed \
+    --gripper_loc_bounds_file $gripper_loc_bounds_file \
+    --gripper_loc_bounds_buffer 0.08
+done
+
diff --git a/online_evaluation_rlbench/eval_act3d_peract.sh b/online_evaluation_rlbench/eval_act3d_peract.sh
new file mode 100644
index 0000000..da32ddc
--- /dev/null
+++ b/online_evaluation_rlbench/eval_act3d_peract.sh
@@ -0,0 +1,46 @@
+exp=act3d
+
+tasks=(
+    close_jar insert_onto_square_peg light_bulb_in meat_off_grill open_drawer place_shape_in_shape_sorter place_wine_at_rack_location push_buttons put_groceries_in_cupboard put_item_in_drawer put_money_in_safe reach_and_drag slide_block_to_color_target stack_blocks stack_cups sweep_to_dustpan_of_size turn_tap place_cups
+)
+data_dir=./data/peract/raw/test/
+num_episodes=100
+gripper_loc_bounds_file=tasks/18_peract_tasks_location_bounds.json
+use_instruction=1
+max_tries=2
+verbose=1
+single_task_gripper_loc_bounds=0
+embedding_dim=120
+cameras="left_shoulder,right_shoulder,wrist,front"
+seed=0
+checkpoint=train_logs/act3d_peract.pth
+export PYTHONPATH="$PWD:$PYTHONPATH"
+
+num_ckpts=${#tasks[@]}
+for ((i=0; i<$num_ckpts; i++)); do
+    CUDA_LAUNCH_BLOCKING=1 python online_evaluation_rlbench/evaluate_policy.py \
+    --tasks ${tasks[$i]} \
+    --checkpoint $checkpoint \
+    --num_history 1 \
+    --test_model act3d \
+    --cameras $cameras \
+    --verbose $verbose \
+    --action_dim 8 \
+    --collision_checking 0 \
+    --predict_trajectory 0 \
+    --embedding_dim $embedding_dim \
+    --rotation_parametrization "quat_from_query" \
+    --single_task_gripper_loc_bounds $single_task_gripper_loc_bounds \
+    --data_dir $data_dir \
+    --num_episodes $num_episodes \
+    --output_file eval_logs/$exp/seed$seed/${tasks[$i]}.json  \
+    --use_instruction $use_instruction \
+    --instructions instructions/peract/instructions.pkl \
+    --variations {0..60} \
+    --max_tries $max_tries \
+    --max_steps 25 \
+    --seed $seed \
+    --gripper_loc_bounds_file $gripper_loc_bounds_file \
+    --gripper_loc_bounds_buffer 0.04
+done
+
diff --git a/online_evaluation_rlbench/eval_equact_peract.sh b/online_evaluation_rlbench/eval_equact_peract.sh
new file mode 100755
index 0000000..21e662f
--- /dev/null
+++ b/online_evaluation_rlbench/eval_equact_peract.sh
@@ -0,0 +1,27 @@
+tashon online_evaluation_rlbench/evaluate_policy.py \
+    --tasks ${tasks[$i]} \
+    --checkpoint $checkpoint \
+    --num_history 1 \
+    --test_model act3d \
+    --cameras $cameras \
+    --verbose $verbose \
+    --num_ghost_points_val 6000 \
+    --action_dim 8 \
+    --collision_checking 0 \
+    --predict_trajectory 0 \
+    --embedding_dim $embedding_dim \
+    --rotation_parametrization "quat_from_query" \
+    --single_task_gripper_loc_bounds $single_task_gripper_loc_bounds \
+    --data_dir $data_dir \
+    --num_episodes $num_episodes \
+    --output_file eval_logs/$exp/seed$seed/${tasks[$i]}${num_checkpoint}.json  \
+    --use_instruction $use_instruction \
+    --instructions instructions/peract/instructions.pkl \
+    --variations {0..200} \
+    --max_tries $max_tries \
+    --max_steps 25 \
+    --seed $seed \
+    --gripper_loc_bounds_file $gripper_loc_bounds_file \
+    --gripper_loc_bounds_buffer 0.04
+done
+
diff --git a/online_evaluation_rlbench/eval_gnfactor.sh b/online_evaluation_rlbench/eval_gnfactor.sh
new file mode 100644
index 0000000..9943956
--- /dev/null
+++ b/online_evaluation_rlbench/eval_gnfactor.sh
@@ -0,0 +1,57 @@
+exp=3d_diffuser_actor_gnfactor
+
+tasks=(
+    close_jar insert_onto_square_peg light_bulb_in meat_off_grill open_drawer place_shape_in_shape_sorter place_wine_at_rack_location push_buttons put_groceries_in_cupboard put_item_in_drawer put_money_in_safe reach_and_drag slide_block_to_color_target stack_blocks stack_cups sweep_to_dustpan_of_size turn_tap
+)
+data_dir=./data/peract/raw/test/
+num_episodes=100
+gripper_loc_bounds_file=tasks/18_peract_tasks_location_bounds.json
+use_instruction=1
+max_tries=2
+verbose=1
+interpolation_length=2
+single_task_gripper_loc_bounds=0
+embedding_dim=120
+cameras="front"
+fps_subsampling_factor=5
+lang_enhanced=0
+relative_action=0
+seed=0
+checkpoint=train_logs/diffuser_actor_gnfactor.pth
+quaternion_format=wxyz
+
+num_ckpts=${#tasks[@]}
+for ((i=0; i<$num_ckpts; i++)); do
+    CUDA_LAUNCH_BLOCKING=1 python online_evaluation_rlbench/evaluate_policy.py \
+    --tasks ${tasks[$i]} \
+    --checkpoint $checkpoint \
+    --diffusion_timesteps 100 \
+    --fps_subsampling_factor $fps_subsampling_factor \
+    --lang_enhanced $lang_enhanced \
+    --relative_action $relative_action \
+    --num_history 3 \
+    --test_model 3d_diffuser_actor \
+    --cameras $cameras \
+    --verbose $verbose \
+    --action_dim 8 \
+    --collision_checking 0 \
+    --predict_trajectory 1 \
+    --embedding_dim $embedding_dim \
+    --rotation_parametrization "6D" \
+    --single_task_gripper_loc_bounds $single_task_gripper_loc_bounds \
+    --data_dir $data_dir \
+    --num_episodes $num_episodes \
+    --output_file eval_logs/$exp/seed$seed/${tasks[$i]}.json  \
+    --use_instruction $use_instruction \
+    --instructions instructions/peract/instructions.pkl \
+    --variations {0..60} \
+    --max_tries $max_tries \
+    --max_steps 20 \
+    --seed $seed \
+    --gripper_loc_bounds_file $gripper_loc_bounds_file \
+    --gripper_loc_bounds_buffer 0.08 \
+    --quaternion_format $quaternion_format \
+    --interpolation_length $interpolation_length \
+    --dense_interpolation 1
+done
+
diff --git a/online_evaluation_rlbench/eval_peract.sh b/online_evaluation_rlbench/eval_peract.sh
new file mode 100755
index 0000000..87dde44
--- /dev/null
+++ b/online_evaluation_rlbench/eval_peract.sh
@@ -0,0 +1,61 @@
+tasks=(
+    insert_onto_square_peg
+    )
+data_dir=/media/zxp/large/project_data/SE3_bi_equ_data/peract/raw/test/
+num_episodes=25
+gripper_loc_bounds_file=tasks/18_peract_tasks_location_bounds.json
+use_instruction=1
+max_tries=2
+verbose=1
+interpolation_length=2
+single_task_gripper_loc_bounds=0
+embedding_dim=120
+cameras="left_shoulder,right_shoulder,wrist,front"
+fps_subsampling_factor=5
+lang_enhanced=0
+relative_action=0
+seed=0
+#checkpoint=/home/zxp/projects/3d_diffuser_actor/train_logs/Actor_18Peract_100Demo_multitask/diffusion_multitask-peg-original/last.pth
+quaternion_format=xyzw  # for local training
+quaternion_format=wxyz  # for pretrianed weight
+export PYTHONPATH="$PWD:$PYTHONPATH"
+num_ckpts=${#tasks[@]}
+exp=3DDA
+#for ckp in 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000; do
+for ckp in 0; do
+#checkpoint=train_logs/$exp/diffusion_multitask_peg_ori_60k/${ckp}.pth
+checkpoint=train_logs/Actor_18Peract_100Demo_multitask/diffusion_multitask-C120-B8-lr1e-4-DI1-2-H3-DT100/diffuser_actor_peract.pth
+for ((i=0; i<$num_ckpts; i++)); do
+    CUDA_LAUNCH_BLOCKING=1 python online_evaluation_rlbench/evaluate_policy.py \
+    --tasks ${tasks[$i]} \
+    --checkpoint $checkpoint \
+    --diffusion_timesteps 100 \
+    --fps_subsampling_factor $fps_subsampling_factor \
+    --lang_enhanced $lang_enhanced \
+    --relative_action $relative_action \
+    --num_history 3 \
+    --test_model 3d_diffuser_actor \
+    --cameras $cameras \
+    --verbose $verbose \
+    --action_dim 8 \
+    --collision_checking 0 \
+    --predict_trajectory 1 \
+    --embedding_dim $embedding_dim \
+    --rotation_parametrization "6D" \
+    --single_task_gripper_loc_bounds $single_task_gripper_loc_bounds \
+    --data_dir $data_dir \
+    --num_episodes $num_episodes \
+    --output_file eval_logs/$exp/seed$seed/${tasks[$i]}${ckp}.json  \
+    --use_instruction $use_instruction \
+    --instructions instructions/peract/instructions.pkl \
+    --variations {0..199} \
+    --max_tries $max_tries \
+    --max_steps 25 \
+    --seed $seed \
+    --gripper_loc_bounds_file $gripper_loc_bounds_file \
+    --gripper_loc_bounds_buffer 0.04 \
+    --quaternion_format $quaternion_format \
+    --interpolation_length $interpolation_length \
+    --dense_interpolation 1
+done
+done
diff --git a/online_evaluation_rlbench/evaluate_policy.py b/online_evaluation_rlbench/evaluate_policy.py
new file mode 100644
index 0000000..433deb4
--- /dev/null
+++ b/online_evaluation_rlbench/evaluate_policy.py
@@ -0,0 +1,242 @@
+"""Online evaluation script on RLBench."""
+import random
+from typing import Tuple, Optional
+from pathlib import Path
+import json
+import os
+
+import torch
+import numpy as np
+import tap
+
+from diffuser_actor.equ_act_optimization.equ_act import EquAct
+from diffuser_actor.keypose_optimization.act3d import Act3D
+from diffuser_actor.trajectory_optimization.diffuser_actor import DiffuserActor
+from utils.common_utils import (
+    load_instructions,
+    get_gripper_loc_bounds,
+    round_floats
+)
+from utils.utils_with_rlbench import RLBenchEnv, Actioner, load_episodes
+
+
+class Arguments(tap.Tap):
+    checkpoint: Path = ""
+    seed: int = 2
+    device: str = "cuda"
+    num_episodes: int = 1
+    headless: int = 0
+    max_tries: int = 10
+    tasks: Optional[Tuple[str, ...]] = None
+    instructions: Optional[Path] = "instructions.pkl"
+    variations: Tuple[int, ...] = (-1,)
+    data_dir: Path = Path(__file__).parent / "demos"
+    cameras: Tuple[str, ...] = ("left_shoulder", "right_shoulder", "wrist")
+    image_size: str = "256,256"
+    verbose: int = 0
+    output_file: Path = Path(__file__).parent / "eval.json"
+    max_steps: int = 25
+    test_model: str = "3d_diffuser_actor"
+    collision_checking: int = 0
+    gripper_loc_bounds_file: str = "tasks/74_hiveformer_tasks_location_bounds.json"
+    gripper_loc_bounds_buffer: float = 0.04
+    single_task_gripper_loc_bounds: int = 0
+    predict_trajectory: int = 1
+
+    # Act3D model parameters
+    num_query_cross_attn_layers: int = 2
+    num_ghost_point_cross_attn_layers: int = 2
+    num_ghost_points: int = 10000
+    num_ghost_points_val: int = 10000
+    weight_tying: int = 1
+    gp_emb_tying: int = 1
+    num_sampling_level: int = 3
+    fine_sampling_ball_diameter: float = 0.16
+    regress_position_offset: int = 0
+
+    # 3D Diffuser Actor model parameters
+    diffusion_timesteps: int = 100
+    num_history: int = 3
+    fps_subsampling_factor: int = 5
+    lang_enhanced: int = 0
+    dense_interpolation: int = 1
+    interpolation_length: int = 2
+    relative_action: int = 0
+
+    # Shared model parameters
+    action_dim: int = 8
+    backbone: str = "clip"  # one of "resnet", "clip"
+    embedding_dim: int = 120
+    num_vis_ins_attn_layers: int = 2
+    use_instruction: int = 1
+    rotation_parametrization: str = '6D'
+    quaternion_format: str = 'xyzw'
+
+
+def load_models(args):
+    device = torch.device(args.device)
+
+    print("Loading model from", args.checkpoint, flush=True)
+
+    # Gripper workspace is the union of workspaces for all tasks
+    if args.single_task_gripper_loc_bounds and len(args.tasks) == 1:
+        task = args.tasks[0]
+    else:
+        task = None
+    print('Gripper workspace')
+    gripper_loc_bounds = get_gripper_loc_bounds(
+        args.gripper_loc_bounds_file,
+        task=task, buffer=args.gripper_loc_bounds_buffer,
+    )
+
+    if args.test_model == "act3d":
+        model = EquAct(
+            backbone=args.backbone,
+            image_size=tuple(int(x) for x in args.image_size.split(",")),
+            embedding_dim=args.embedding_dim,
+            num_ghost_point_cross_attn_layers=(
+                args.num_ghost_point_cross_attn_layers),
+            num_query_cross_attn_layers=(
+                args.num_query_cross_attn_layers),
+            num_vis_ins_attn_layers=(
+                args.num_vis_ins_attn_layers),
+            rotation_parametrization=args.rotation_parametrization,
+            gripper_loc_bounds=gripper_loc_bounds,
+            num_ghost_points=args.num_ghost_points,
+            num_ghost_points_val=args.num_ghost_points_val,
+            weight_tying=bool(args.weight_tying),
+            gp_emb_tying=bool(args.gp_emb_tying),
+            num_sampling_level=args.num_sampling_level,
+            fine_sampling_ball_diameter=(
+                args.fine_sampling_ball_diameter),
+            regress_position_offset=bool(
+                args.regress_position_offset),
+            use_instruction=bool(args.use_instruction)
+        ).to(device)
+    elif args.test_model == "3d_diffuser_actor":
+        model = DiffuserActor(
+            backbone=args.backbone,
+            image_size=tuple(int(x) for x in args.image_size.split(",")),
+            embedding_dim=args.embedding_dim,
+            num_vis_ins_attn_layers=args.num_vis_ins_attn_layers,
+            use_instruction=bool(args.use_instruction),
+            fps_subsampling_factor=args.fps_subsampling_factor,
+            gripper_loc_bounds=gripper_loc_bounds,
+            rotation_parametrization=args.rotation_parametrization,
+            quaternion_format=args.quaternion_format,
+            diffusion_timesteps=args.diffusion_timesteps,
+            nhist=args.num_history,
+            relative=bool(args.relative_action),
+            lang_enhanced=bool(args.lang_enhanced),
+        )
+    elif args.test_model == "act3d":
+        model = Act3D(
+            backbone=args.backbone,
+            image_size=tuple(int(x) for x in args.image_size.split(",")),
+            embedding_dim=args.embedding_dim,
+            num_ghost_point_cross_attn_layers=(
+                args.num_ghost_point_cross_attn_layers),
+            num_query_cross_attn_layers=(
+                args.num_query_cross_attn_layers),
+            num_vis_ins_attn_layers=(
+                args.num_vis_ins_attn_layers),
+            rotation_parametrization=args.rotation_parametrization,
+            gripper_loc_bounds=gripper_loc_bounds,
+            num_ghost_points=args.num_ghost_points,
+            num_ghost_points_val=args.num_ghost_points_val,
+            weight_tying=bool(args.weight_tying),
+            gp_emb_tying=bool(args.gp_emb_tying),
+            num_sampling_level=args.num_sampling_level,
+            fine_sampling_ball_diameter=(
+                args.fine_sampling_ball_diameter),
+            regress_position_offset=bool(
+                args.regress_position_offset),
+            use_instruction=bool(args.use_instruction)
+        ).to(device)
+    else:
+        raise NotImplementedError
+
+    # Load model weights
+    model_dict = torch.load(args.checkpoint, map_location="cpu")
+    model_dict_weight = {}
+    for key in model_dict["weight"]:
+        _key = key[7:]
+        model_dict_weight[_key] = model_dict["weight"][key]
+    model.load_state_dict(model_dict_weight)
+    model.eval()
+
+    return model
+
+
+if __name__ == "__main__":
+    # Arguments
+    args = Arguments().parse_args()
+    args.cameras = tuple(x for y in args.cameras for x in y.split(","))
+    print("Arguments:")
+    print(args)
+    print("-" * 100)
+    # Save results here
+    os.makedirs(os.path.dirname(args.output_file), exist_ok=True)
+
+    # Seeds
+    torch.manual_seed(args.seed)
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    # Load models
+    model = load_models(args)
+
+    # Load RLBench environment
+    env = RLBenchEnv(
+        data_path=args.data_dir,
+        image_size=[int(x) for x in args.image_size.split(",")],
+        apply_rgb=True,
+        apply_pc=True,
+        headless=bool(args.headless),
+        apply_cameras=args.cameras,
+        collision_checking=bool(args.collision_checking)
+    )
+
+    instruction = load_instructions(args.instructions)
+    if instruction is None:
+        raise NotImplementedError()
+
+    actioner = Actioner(
+        policy=model,
+        instructions=instruction,
+        apply_cameras=args.cameras,
+        action_dim=args.action_dim,
+        predict_trajectory=bool(args.predict_trajectory)
+    )
+    max_eps_dict = load_episodes()["max_episode_length"]
+    task_success_rates = {}
+
+    for task_str in args.tasks:
+        var_success_rates = env.evaluate_task_on_multiple_variations(
+            task_str,
+            max_steps=(
+                max_eps_dict[task_str] if args.max_steps == -1
+                else args.max_steps
+            ),
+            num_variations=args.variations[-1] + 1,
+            num_demos=args.num_episodes,
+            actioner=actioner,
+            max_tries=args.max_tries,
+            dense_interpolation=bool(args.dense_interpolation),
+            interpolation_length=args.interpolation_length,
+            verbose=bool(args.verbose),
+            num_history=args.num_history
+        )
+        print()
+        print(
+            f"{task_str} variation success rates:",
+            round_floats(var_success_rates)
+        )
+        print(
+            f"{task_str} mean success rate:",
+            round_floats(var_success_rates["mean"])
+        )
+
+        task_success_rates[task_str] = var_success_rates
+        with open(args.output_file, "w") as f:
+            json.dump(round_floats(task_success_rates), f, indent=4)
diff --git a/online_evaluation_rlbench/server_eval_equact_peract.sh b/online_evaluation_rlbench/server_eval_equact_peract.sh
new file mode 100755
index 0000000..58169d9
--- /dev/null
+++ b/online_evaluation_rlbench/server_eval_equact_peract.sh
@@ -0,0 +1,38 @@
+tasks=(place_wine_at_rack_location)
+data_dir=./data/peract/raw/test/
+num_episodes=10
+gripper_loc_eract_tasks_location_bounds.json
+use_instruction=1
+max_tries=2
+verbose=1
+single_task_g=120
+cameras="left_ATH="$PWD:$PYTHONPATH"
+num_ckpts=${#tasks[@]}
+exp=equact
+checkpoint=train_lA_VISIBLE_DEVICES=0 python online_evaluation_rlbench/evaluate_policy.py \
+    --tasks ${tasks[$i]} \
+    --checkpoint $checkpoint \
+    --num_history 1 \
+    --test_model act3d \
+    --cameras $cameras \
+    --verbose $verbose \
+    --num_ghost_points_val 10000 \
+    --action_dim 8 \
+    --collision_checking 0 \
+    --predict_trajectory 0 \
+    --embedding_dim $embedding_dim \
+    --rotation_parametrization "quat_from_query" \
+    --single_task_gripper_loc_bounds $single_task_gripper_loc_bounds \
+    --data_dir $data_dir \
+    --num_episodes $num_episodes \
+    --output_file eval_logs/$exp/seed$seed/${tasks[$i]}.json  \
+    --use_instruction $use_instruction \
+    --instructions instructions/peract/instructions.pkl \
+    --variations {0..60} \
+    --max_tries $max_tries \
+    --max_steps 25 \
+    --seed $seed \
+    --gripper_loc_bounds_file $gripper_loc_bounds_file \
+    --gripper_loc_bounds_buffer 0.04
+done
+
diff --git a/setup.py b/setup.py
new file mode 100644
index 0000000..a9758da
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,26 @@
+import os.path
+
+from setuptools import setup
+
+
+core_requirements = [
+    "einops",
+    "torch",
+    "numpy",
+    "torchvision",
+    "diffusers",
+    "dgl",
+    "flash_attn",
+]
+
+setup(name='diffuser_actor',
+      version='0.1',
+      description='3D Diffuser Actor',
+      author='Nikolaos Gkanatsios',
+      author_email='ngkanats@cs.cmu.edu',
+      url='https://nickgkan.github.io/',
+      install_requires=core_requirements,
+      packages=[
+            'diffuser_actor',
+      ],
+)
diff --git a/tasks/18_peract_tasks_location_bounds.json b/tasks/18_peract_tasks_location_bounds.json
new file mode 100644
index 0000000..e2227d9
--- /dev/null
+++ b/tasks/18_peract_tasks_location_bounds.json
@@ -0,0 +1,206 @@
+{
+    "close_jar": [
+        [
+            -0.07010312378406525,
+            -0.33692488074302673,
+            0.761122465133667
+        ],
+        [
+            0.525260329246521,
+            0.40034544467926025,
+            0.9308611154556274
+        ]
+    ],
+    "insert_onto_square_peg": [
+        [
+            -0.059705402702093124,
+            -0.4115047752857208,
+            0.752871036529541
+        ],
+        [
+            0.4984903931617737,
+            0.37613850831985474,
+            0.9694802165031433
+        ]
+    ],
+    "light_bulb_in": [
+        [
+            0.001793097471818328,
+            -0.3829135000705719,
+            0.8298617005348206
+        ],
+        [
+            0.5069130063056946,
+            0.39087292551994324,
+            1.1727949380874634
+        ]
+    ],
+    "meat_off_grill": [
+        [
+            -0.04940209910273552,
+            -0.36016079783439636,
+            1.0511349439620972
+        ],
+        [
+            0.4341655671596527,
+            0.3369610607624054,
+            1.204141616821289
+        ]
+    ],
+    "open_drawer": [
+        [
+            0.10261330753564835,
+            -0.06472617387771606,
+            0.9140162467956543
+        ],
+        [
+            0.3657068908214569,
+            0.4357367157936096,
+            1.2285298109054565
+        ]
+    ],
+    "place_shape_in_shape_sorter": [
+        [
+            -0.024208372458815575,
+            -0.3702586591243744,
+            0.7619463205337524
+        ],
+        [
+            0.5094270706176758,
+            0.3993597626686096,
+            0.9559605121612549
+        ]
+    ],
+    "place_wine_at_rack_location": [
+        [
+            0.024810487404465675,
+            -0.4179409146308899,
+            0.9237268567085266
+        ],
+        [
+            0.4800413250923157,
+            0.21476222574710846,
+            0.9975715279579163
+        ]
+    ],
+    "push_buttons": [
+        [
+            0.025005526840686798,
+            -0.35272449254989624,
+            0.7704468369483948
+        ],
+        [
+            0.4750578701496124,
+            0.3518211841583252,
+            0.8459218144416809
+        ]
+    ],
+    "put_groceries_in_cupboard": [
+        [
+            -0.028621509671211243,
+            -0.5157904028892517,
+            0.7696740627288818
+        ],
+        [
+            0.6081326603889465,
+            0.40632256865501404,
+            1.39157235622406
+        ]
+    ],
+    "put_item_in_drawer": [
+        [
+            0.07061523199081421,
+            -0.22724466025829315,
+            0.91417396068573
+        ],
+        [
+            0.3534654974937439,
+            0.4783594608306885,
+            1.471603512763977
+        ]
+    ],
+    "put_money_in_safe": [
+        [
+            -0.05866488814353943,
+            -0.2698633372783661,
+            0.788620114326477
+        ],
+        [
+            0.2451273500919342,
+            0.41097840666770935,
+            1.2644104957580566
+        ]
+    ],
+    "reach_and_drag": [
+        [
+            0.04688065126538277,
+            -0.35880398750305176,
+            0.8532240390777588
+        ],
+        [
+            0.449143648147583,
+            0.3613968789577484,
+            0.9964548945426941
+        ]
+    ],
+    "slide_block_to_color_target": [
+        [
+            -0.018425436690449715,
+            -0.3899671733379364,
+            0.7591497898101807
+        ],
+        [
+            0.5140739679336548,
+            0.37699267268180847,
+            0.779471755027771
+        ]
+    ],
+    "stack_blocks": [
+        [
+            -0.03135331720113754,
+            -0.39724862575531006,
+            0.7737647294998169
+        ],
+        [
+            0.5072773694992065,
+            0.4063461422920227,
+            1.0446953773498535
+        ]
+    ],
+    "stack_cups": [
+        [
+            -0.0322563461959362,
+            -0.3579411506652832,
+            0.8703824281692505
+        ],
+        [
+            0.4745761752128601,
+            0.413276731967926,
+            1.068880558013916
+        ]
+    ],
+    "sweep_to_dustpan_of_size": [
+        [
+            0.0757683739066124,
+            -0.30235686898231506,
+            1.0714032649993896
+        ],
+        [
+            0.43045830726623535,
+            0.3017929792404175,
+            1.1362435817718506
+        ]
+    ],
+    "turn_tap": [
+        [
+            -0.01380503922700882,
+            -0.1964014321565628,
+            0.884225606918335
+        ],
+        [
+            0.3865160644054413,
+            0.20687243342399597,
+            0.904780924320221
+        ]
+    ]
+}
\ No newline at end of file
diff --git a/tasks/18_peract_tasks_traj_location_bounds.json b/tasks/18_peract_tasks_traj_location_bounds.json
new file mode 100644
index 0000000..f369a69
--- /dev/null
+++ b/tasks/18_peract_tasks_traj_location_bounds.json
@@ -0,0 +1,218 @@
+{
+    "place_cups": [
+        [
+            -0.1252501755952835,
+            -0.7335996031761169,
+            0.7846644520759583
+        ],
+        [
+            0.5250778794288635,
+            0.5957655310630798,
+            1.647340178489685
+        ]
+    ],
+    "close_jar": [
+        [
+            -0.06962238252162933,
+            -0.3390176296234131,
+            0.7580129504203796
+        ],
+        [
+            0.4743981659412384,
+            0.40305784344673157,
+            1.4720525741577148
+        ]
+    ],
+    "insert_onto_square_peg": [
+        [
+            -0.04692745581269264,
+            -0.34305936098098755,
+            0.7528713941574097
+        ],
+        [
+            0.4967808425426483,
+            0.3542478084564209,
+            1.4720430374145508
+        ]
+    ],
+    "light_bulb_in": [
+        [
+            -0.0338476225733757,
+            -0.533994734287262,
+            0.8294161558151245
+        ],
+        [
+            0.5137674808502197,
+            0.39591047167778015,
+            1.6218366622924805
+        ]
+    ],
+    "meat_off_grill": [
+        [
+            -0.03323213756084442,
+            -0.3421420454978943,
+            1.0510294437408447
+        ],
+        [
+            0.4484463930130005,
+            0.3519609868526459,
+            1.542190670967102
+        ]
+    ],
+    "open_drawer": [
+        [
+            0.012617605738341808,
+            -0.06152236461639404,
+            0.8662055134773254
+        ],
+        [
+            0.38047805428504944,
+            0.44173672795295715,
+            1.511168122291565
+        ]
+    ],
+    "place_shape_in_shape_sorter": [
+        [
+            -0.10743069648742676,
+            -0.389064759016037,
+            0.7574412822723389
+        ],
+        [
+            0.4931349456310272,
+            0.49367135763168335,
+            1.4727647304534912
+        ]
+    ],
+    "place_wine_at_rack_location": [
+        [
+            0.006204614881426096,
+            -0.44704610109329224,
+            0.8617715835571289
+        ],
+        [
+            0.48758140206336975,
+            0.19787049293518066,
+            1.480934977531433
+        ]
+    ],
+    "push_buttons": [
+        [
+            0.021184002980589867,
+            -0.36834508180618286,
+            0.7702418565750122
+        ],
+        [
+            0.484455406665802,
+            0.3391384184360504,
+            1.4720654487609863
+        ]
+    ],
+    "put_groceries_in_cupboard": [
+        [
+            -0.3042418658733368,
+            -0.5136904120445251,
+            0.7632160186767578
+        ],
+        [
+            0.5890765190124512,
+            0.7268779873847961,
+            1.7139211893081665
+        ]
+    ],
+    "put_item_in_drawer": [
+        [
+            -0.050217170268297195,
+            -0.23539048433303833,
+            0.9012432098388672
+        ],
+        [
+            0.3524276316165924,
+            0.6159590482711792,
+            1.5808026790618896
+        ]
+    ],
+    "put_money_in_safe": [
+        [
+            -0.04701787978410721,
+            -0.28717517852783203,
+            0.7885763049125671
+        ],
+        [
+            0.2786584496498108,
+            0.42030829191207886,
+            1.4720394611358643
+        ]
+    ],
+    "reach_and_drag": [
+        [
+            0.047469839453697205,
+            -0.351046621799469,
+            0.8530889749526978
+        ],
+        [
+            0.4512064456939697,
+            0.35267987847328186,
+            1.4765229225158691
+        ]
+    ],
+    "slide_block_to_color_target": [
+        [
+            -0.09089697897434235,
+            -0.3527182638645172,
+            0.7487191557884216
+        ],
+        [
+            0.5151700973510742,
+            0.3649747371673584,
+            1.4720333814620972
+        ]
+    ],
+    "stack_blocks": [
+        [
+            -0.2030583769083023,
+            -0.4176129400730133,
+            0.7713249325752258
+        ],
+        [
+            0.5036024451255798,
+            0.49026909470558167,
+            1.4720449447631836
+        ]
+    ],
+    "stack_cups": [
+        [
+            0.02441444993019104,
+            -0.383113831281662,
+            0.8532769680023193
+        ],
+        [
+            0.5056630969047546,
+            0.39290016889572144,
+            1.4720371961593628
+        ]
+    ],
+    "sweep_to_dustpan_of_size": [
+        [
+            0.06897379457950592,
+            -0.30288058519363403,
+            1.0708848237991333
+        ],
+        [
+            0.4295439124107361,
+            0.30295315384864807,
+            1.4720780849456787
+        ]
+    ],
+    "turn_tap": [
+        [
+            -0.03261891007423401,
+            -0.5482321381568909,
+            0.7561731934547424
+        ],
+        [
+            0.4150252938270569,
+            0.5150893926620483,
+            1.742407202720642
+        ]
+    ]
+}
\ No newline at end of file
diff --git a/tasks/74_hiveformer_tasks_location_bounds.json b/tasks/74_hiveformer_tasks_location_bounds.json
new file mode 100644
index 0000000..7d4dc6d
--- /dev/null
+++ b/tasks/74_hiveformer_tasks_location_bounds.json
@@ -0,0 +1,854 @@
+{
+    "close_drawer": [
+        [
+            0.05520801246166229,
+            -0.04362984001636505,
+            0.913456380367279
+        ],
+        [
+            0.3847907483577728,
+            0.4153033494949341,
+            0.9183140397071838
+        ]
+    ],
+    "close_fridge": [
+        [
+            0.12962226569652557,
+            -0.2261878252029419,
+            1.1533762216567993
+        ],
+        [
+            0.5694770216941833,
+            0.5198359489440918,
+            1.404355764389038
+        ]
+    ],
+    "close_microwave": [
+        [
+            -0.007069115526974201,
+            -0.32004350423812866,
+            0.9799190163612366
+        ],
+        [
+            0.514249861240387,
+            0.38787585496902466,
+            1.011217474937439
+        ]
+    ],
+    "lamp_off": [
+        [
+            -0.0027194598224014044,
+            -0.27229809761047363,
+            0.7697545289993286
+        ],
+        [
+            0.5214748978614807,
+            0.2507045269012451,
+            0.8458210229873657
+        ]
+    ],
+    "press_switch": [
+        [
+            -0.13941217958927155,
+            -0.34383150935173035,
+            1.0101796388626099
+        ],
+        [
+            0.6251123547554016,
+            0.3364018201828003,
+            1.0153617858886719
+        ]
+    ],
+    "push_button": [
+        [
+            -0.01055082306265831,
+            -0.39317312836647034,
+            0.7696011066436768
+        ],
+        [
+            0.5058262348175049,
+            0.38404375314712524,
+            0.8471043109893799
+        ]
+    ],
+    "close_door": [
+        [
+            0.343924343585968,
+            -0.26788851618766785,
+            1.2165775299072266
+        ],
+        [
+            0.6564226746559143,
+            0.39716383814811707,
+            1.2285727262496948
+        ]
+    ],
+    "lamp_on": [
+        [
+            -0.019794438034296036,
+            -0.2584077715873718,
+            0.769023597240448
+        ],
+        [
+            0.45888689160346985,
+            0.33079615235328674,
+            0.9615674018859863
+        ]
+    ],
+    "lift_numbered_block": [
+        [
+            -0.00807125959545374,
+            -0.2585907280445099,
+            0.7741650342941284
+        ],
+        [
+            0.3580719530582428,
+            0.2405797690153122,
+            0.9986135363578796
+        ]
+    ],
+    "open_box": [
+        [
+            0.07366747409105301,
+            -0.4549475610256195,
+            0.9025465250015259
+        ],
+        [
+            0.41112571954727173,
+            0.27450403571128845,
+            1.1480439901351929
+        ]
+    ],
+    "open_drawer": [
+        [
+            0.10380099713802338,
+            -0.057057853788137436,
+            0.913453221321106
+        ],
+        [
+            0.35965532064437866,
+            0.3861672878265381,
+            0.9157843589782715
+        ]
+    ],
+    "open_fridge": [
+        [
+            0.10825006663799286,
+            -0.422019362449646,
+            1.2701553106307983
+        ],
+        [
+            0.5726563930511475,
+            0.47571462392807007,
+            1.3042691946029663
+        ]
+    ],
+    "open_grill": [
+        [
+            0.058253925293684006,
+            -0.3733425438404083,
+            1.058421015739441
+        ],
+        [
+            0.44918161630630493,
+            0.30944257974624634,
+            1.2623568773269653
+        ]
+    ],
+    "open_microwave": [
+        [
+            -0.18643583357334137,
+            -0.4803631007671356,
+            0.8878881931304932
+        ],
+        [
+            0.5407922863960266,
+            0.4031131863594055,
+            0.8912981748580933
+        ]
+    ],
+    "open_wine_bottle": [
+        [
+            -0.08063027262687683,
+            -0.4649062752723694,
+            1.0382683277130127
+        ],
+        [
+            0.45242801308631897,
+            0.4818035662174225,
+            1.1218299865722656
+        ]
+    ],
+    "pick_up_cup": [
+        [
+            -0.022600434720516205,
+            -0.3359810411930084,
+            0.8704742789268494
+        ],
+        [
+            0.46342408657073975,
+            0.40098345279693604,
+            1.1301686763763428
+        ]
+    ],
+    "play_jenga": [
+        [
+            -0.006246957927942276,
+            -0.19559571146965027,
+            0.863406240940094
+        ],
+        [
+            0.4334056079387665,
+            0.4303490221500397,
+            0.865523099899292
+        ]
+    ],
+    "basketball_in_hoop": [
+        [
+            -0.023083284497261047,
+            -0.22607305645942688,
+            0.792326033115387
+        ],
+        [
+            0.4755136966705322,
+            0.25571566820144653,
+            1.345174789428711
+        ]
+    ],
+    "beat_the_buzz": [
+        [
+            -0.021758683025836945,
+            -0.38458719849586487,
+            0.8241182565689087
+        ],
+        [
+            0.45031118392944336,
+            0.37848544120788574,
+            0.9231715202331543
+        ]
+    ],
+    "change_clock": [
+        [
+            -0.0699954479932785,
+            -0.5243577361106873,
+            0.9816225171089172
+        ],
+        [
+            0.4077199697494507,
+            0.4329493045806885,
+            1.214512825012207
+        ]
+    ],
+    "close_grill": [
+        [
+            0.0524887889623642,
+            -0.3383767008781433,
+            1.0845285654067993
+        ],
+        [
+            0.47876015305519104,
+            0.3101511597633362,
+            1.472058892250061
+        ]
+    ],
+    "close_laptop_lid": [
+        [
+            0.011848852038383484,
+            -0.38256341218948364,
+            0.8772374391555786
+        ],
+        [
+            0.3688029646873474,
+            0.4147731065750122,
+            1.1045390367507935
+        ]
+    ],
+    "hang_frame_on_hanger": [
+        [
+            0.043214116245508194,
+            -0.28521138429641724,
+            0.9211452007293701
+        ],
+        [
+            0.44389939308166504,
+            0.31135404109954834,
+            1.127386212348938
+        ]
+    ],
+    "open_door": [
+        [
+            0.2319784015417099,
+            -0.23808732628822327,
+            1.2023106813430786
+        ],
+        [
+            0.663882851600647,
+            0.477662056684494,
+            1.2290065288543701
+        ]
+    ],
+    "open_window": [
+        [
+            0.3415573835372925,
+            -0.050131309777498245,
+            1.4462834596633911
+        ],
+        [
+            0.5942109823226929,
+            0.4595590829849243,
+            1.471477746963501
+        ]
+    ],
+    "pick_and_lift": [
+        [
+            -0.031366169452667236,
+            -0.49629777669906616,
+            0.7737409472465515
+        ],
+        [
+            0.5079741477966309,
+            0.4650239944458008,
+            0.9998928904533386
+        ]
+    ],
+    "pick_and_lift_small": [
+        [
+            -0.023089151829481125,
+            -0.4800979793071747,
+            0.7633732557296753
+        ],
+        [
+            0.5317438244819641,
+            0.5182185769081116,
+            1.0001109838485718
+        ]
+    ],
+    "put_knife_on_chopping_board": [
+        [
+            -0.07565971463918686,
+            -0.38134026527404785,
+            0.7774950265884399
+        ],
+        [
+            0.5557281374931335,
+            0.3550781309604645,
+            1.1006814241409302
+        ]
+    ],
+    "put_rubbish_in_bin": [
+        [
+            0.054013825953006744,
+            -0.3895466923713684,
+            0.7692569494247437
+        ],
+        [
+            0.4595278799533844,
+            0.3992043733596802,
+            1.005134105682373
+        ]
+    ],
+    "put_umbrella_in_umbrella_stand": [
+        [
+            -0.22110864520072937,
+            -0.4743722677230835,
+            0.9437026381492615
+        ],
+        [
+            0.3887127935886383,
+            0.3742111325263977,
+            1.2623322010040283
+        ]
+    ],
+    "close_box": [
+        [
+            0.11078422516584396,
+            -0.3795444071292877,
+            1.02414870262146
+        ],
+        [
+            0.38548916578292847,
+            0.35642939805984497,
+            1.1775909662246704
+        ]
+    ],
+    "insert_onto_square_peg": [
+        [
+            -0.027013996616005898,
+            -0.3489493429660797,
+            0.7529370784759521
+        ],
+        [
+            0.492866575717926,
+            0.3602404296398163,
+            0.9688409566879272
+        ]
+    ],
+    "insert_usb_in_computer": [
+        [
+            0.04595436900854111,
+            0.06526651233434677,
+            0.8526456356048584
+        ],
+        [
+            0.4724811911582947,
+            0.40549588203430176,
+            0.8763399720191956
+        ]
+    ],
+    "meat_off_grill": [
+        [
+            -0.057705171406269073,
+            -0.3351640999317169,
+            1.051164150238037
+        ],
+        [
+            0.4716114103794098,
+            0.33802372217178345,
+            1.204638123512268
+        ]
+    ],
+    "meat_on_grill": [
+        [
+            0.06563456356525421,
+            -0.27733394503593445,
+            1.0740067958831787
+        ],
+        [
+            0.447649210691452,
+            0.3214495778083801,
+            1.1979222297668457
+        ]
+    ],
+    "move_hanger": [
+        [
+            0.38034331798553467,
+            -0.3899763822555542,
+            1.1142523288726807
+        ],
+        [
+            0.46991047263145447,
+            0.3774334788322449,
+            1.1950732469558716
+        ]
+    ],
+    "open_oven": [
+        [
+            0.04770670831203461,
+            -0.24700629711151123,
+            0.84061199426651
+        ],
+        [
+            0.4726908802986145,
+            0.2337438017129898,
+            1.1464097499847412
+        ]
+    ],
+    "phone_on_base": [
+        [
+            -0.16709405183792114,
+            -0.37052440643310547,
+            0.7955707311630249
+        ],
+        [
+            0.5250019431114197,
+            0.3732898533344269,
+            0.9100998044013977
+        ]
+    ],
+    "place_hanger_on_rack": [
+        [
+            -0.003252629190683365,
+            -0.3548135757446289,
+            0.8912384510040283
+        ],
+        [
+            0.4941902160644531,
+            0.3287307024002075,
+            1.1676400899887085
+        ]
+    ],
+    "place_shape_in_shape_sorter": [
+        [
+            -0.0023695684503763914,
+            -0.3471399247646332,
+            0.7735769152641296
+        ],
+        [
+            0.5193414688110352,
+            0.34918704628944397,
+            0.9548373222351074
+        ]
+    ],
+    "plug_charger_in_power_supply": [
+        [
+            -0.046957697719335556,
+            -0.29485616087913513,
+            0.7860503196716309
+        ],
+        [
+            0.4539012908935547,
+            0.3596971035003662,
+            0.9706856608390808
+        ]
+    ],
+    "put_books_on_bookshelf": [
+        [
+            0.013554506003856659,
+            -0.3623334765434265,
+            1.0726959705352783
+        ],
+        [
+            0.5092811584472656,
+            0.35812243819236755,
+            1.4607254266738892
+        ]
+    ],
+    "put_money_in_safe": [
+        [
+            -0.05646437406539917,
+            -0.3174158036708832,
+            0.7883930206298828
+        ],
+        [
+            0.2536238431930542,
+            0.41792258620262146,
+            1.2646543979644775
+        ]
+    ],
+    "push_buttons": [
+        [
+            0.03105316124856472,
+            -0.33636635541915894,
+            0.7705590128898621
+        ],
+        [
+            0.5008437633514404,
+            0.31735536456108093,
+            0.8444609642028809
+        ]
+    ],
+    "change_channel": [
+        [
+            0.026051633059978485,
+            -0.287720650434494,
+            0.7575168609619141
+        ],
+        [
+            0.43372923135757446,
+            0.23768047988414764,
+            0.8430830240249634
+        ]
+    ],
+    "reach_target": [
+        [
+            0.02677992545068264,
+            -0.29710718989372253,
+            0.7775897979736328
+        ],
+        [
+            0.4708552956581116,
+            0.31881025433540344,
+            1.21543550491333
+        ]
+    ],
+    "slide_block_to_target": [
+        [
+            -0.03281932696700096,
+            -0.4474794864654541,
+            0.7598549723625183
+        ],
+        [
+            0.522365391254425,
+            0.4317557215690613,
+            0.7789291143417358
+        ]
+    ],
+    "take_usb_out_of_computer": [
+        [
+            0.02406327798962593,
+            -0.48011332750320435,
+            0.8749684691429138
+        ],
+        [
+            0.4298414885997772,
+            0.3747768998146057,
+            0.8763575553894043
+        ]
+    ],
+    "turn_tap": [
+        [
+            0.09151671826839447,
+            -0.20481260120868683,
+            0.8799134492874146
+        ],
+        [
+            0.40136539936065674,
+            0.199827641248703,
+            0.8992501497268677
+        ]
+    ],
+    "unplug_charger": [
+        [
+            -0.02945627272129059,
+            -0.29719048738479614,
+            0.9654761552810669
+        ],
+        [
+            0.42611706256866455,
+            0.23461900651454926,
+            0.9759531021118164
+        ]
+    ],
+    "take_lid_off_saucepan": [
+        [
+            0.1488279104232788,
+            -0.29649272561073303,
+            0.8473185300827026
+        ],
+        [
+            0.3940749168395996,
+            0.25496160984039307,
+            0.9680889844894409
+        ]
+    ],
+    "take_umbrella_out_of_umbrella_stand": [
+        [
+            -0.13834935426712036,
+            -0.4171779453754425,
+            0.9873540997505188
+        ],
+        [
+            0.4437467157840729,
+            0.4159110188484192,
+            1.2332133054733276
+        ]
+    ],
+    "toilet_seat_up": [
+        [
+            -0.12428393959999084,
+            -0.40748703479766846,
+            0.9867345094680786
+        ],
+        [
+            0.4470200538635254,
+            0.30486559867858887,
+            1.22821843624115
+        ]
+    ],
+    "turn_oven_on": [
+        [
+            -0.0005436539649963379,
+            -0.03253566473722458,
+            1.2411201000213623
+        ],
+        [
+            0.526003897190094,
+            0.39879217743873596,
+            1.2458980083465576
+        ]
+    ],
+    "scoop_with_spatula": [
+        [
+            -0.036182232201099396,
+            -0.3804638385772705,
+            0.7833231687545776
+        ],
+        [
+            0.4552796483039856,
+            0.5420633554458618,
+            0.9075736403465271
+        ]
+    ],
+    "take_frame_off_hanger": [
+        [
+            -0.07169166207313538,
+            -0.3341614902019501,
+            0.8592662811279297
+        ],
+        [
+            0.5680904388427734,
+            0.3948049247264862,
+            0.9845865964889526
+        ]
+    ],
+    "take_money_out_safe": [
+        [
+            0.2172972708940506,
+            -0.33512425422668457,
+            0.7757259607315063
+        ],
+        [
+            0.4576074182987213,
+            0.245660200715065,
+            0.7945342659950256
+        ]
+    ],
+    "take_toilet_roll_off_stand": [
+        [
+            -0.08030451834201813,
+            -0.37377458810806274,
+            0.9497026205062866
+        ],
+        [
+            0.46053266525268555,
+            0.4470222592353821,
+            0.9632148742675781
+        ]
+    ],
+    "toilet_seat_down": [
+        [
+            -0.1316622495651245,
+            -0.35816922783851624,
+            1.1636159420013428
+        ],
+        [
+            0.43321409821510315,
+            0.3083770275115967,
+            1.2223985195159912
+        ]
+    ],
+    "sweep_to_dustpan": [
+        [
+            0.06908198446035385,
+            -0.2627582848072052,
+            1.0712590217590332
+        ],
+        [
+            0.4241946041584015,
+            0.26090675592422485,
+            1.0795507431030273
+        ]
+    ],
+    "take_plate_off_colored_dish_rack": [
+        [
+            -0.08732761442661285,
+            -0.3865083158016205,
+            0.8594199419021606
+        ],
+        [
+            0.558426558971405,
+            0.36841002106666565,
+            1.0599428415298462
+        ]
+    ],
+    "water_plants": [
+        [
+            0.17294424772262573,
+            -0.3593810200691223,
+            0.8823964595794678
+        ],
+        [
+            0.33111628890037537,
+            0.3513432741165161,
+            1.4722193479537964
+        ]
+    ],
+    "reach_and_drag": [
+        [
+            0.05097993090748787,
+            -0.3523094058036804,
+            0.8531230092048645
+        ],
+        [
+            0.4526465833187103,
+            0.34976446628570557,
+            0.9966686964035034
+        ]
+    ],
+    "screw_nail": [
+        [
+            -0.14362122118473053,
+            -0.3697453737258911,
+            0.8315156698226929
+        ],
+        [
+            0.46530231833457947,
+            0.44696399569511414,
+            1.1810046434402466
+        ]
+    ],
+    "setup_checkers": [
+        [
+            -0.008768725208938122,
+            -0.422406941652298,
+            0.7525140047073364
+        ],
+        [
+            0.485668420791626,
+            0.3901132643222809,
+            0.8568703532218933
+        ]
+    ],
+    "stack_wine": [
+        [
+            0.10748039186000824,
+            -0.41677460074424744,
+            0.8907675743103027
+        ],
+        [
+            0.44981035590171814,
+            0.2031947672367096,
+            1.471533179283142
+        ]
+    ],
+    "tower3": [
+        [
+            -0.026656748726963997,
+            -0.4079053997993469,
+            0.7738525867462158
+        ],
+        [
+            0.516545295715332,
+            0.3907974660396576,
+            0.9740554094314575
+        ]
+    ],
+    "wipe_desk": [
+        [
+            -0.09986346960067749,
+            -0.45636600255966187,
+            0.7651481628417969
+        ],
+        [
+            0.5270109176635742,
+            0.4887261986732483,
+            1.4717410802841187
+        ]
+    ],
+    "straighten_rope": [
+        [
+            -0.19136671721935272,
+            -0.47088682651519775,
+            0.7506194114685059
+        ],
+        [
+            0.5068666338920593,
+            0.47348397970199585,
+            1.4716072082519531
+        ]
+    ],
+    "tv_on": [
+        [
+            0.021157406270503998,
+            -0.2944306433200836,
+            0.7575903534889221
+        ],
+        [
+            0.3788570761680603,
+            0.2701631486415863,
+            0.8426451683044434
+        ]
+    ],
+    "slide_cabinet_open_and_place_cups": [
+        [
+            -0.9038863182067871,
+            -0.4016008973121643,
+            0.7759418487548828
+        ],
+        [
+            0.4754253625869751,
+            0.3876255750656128,
+            1.1737593412399292
+        ]
+    ]
+}
\ No newline at end of file
diff --git a/tasks/all_82_tasks.csv b/tasks/all_82_tasks.csv
new file mode 100644
index 0000000..36011fb
--- /dev/null
+++ b/tasks/all_82_tasks.csv
@@ -0,0 +1,82 @@
+basketball_in_hoop
+put_rubbish_in_bin
+meat_off_grill
+meat_on_grill
+change_channel
+tv_on
+tower3
+push_buttons
+stack_wine
+slide_block_to_target
+slide_block_to_color_target
+reach_and_drag
+take_frame_off_hanger
+water_plants
+hang_frame_on_hanger
+scoop_with_spatula
+place_hanger_on_rack
+move_hanger
+sweep_to_dustpan
+sweep_to_dustpan_of_size
+take_plate_off_colored_dish_rack
+screw_nail
+wipe_desk
+stack_blocks
+take_shoes_out_of_box
+slide_cabinet_open_and_place_cups
+reach_target
+push_button
+lamp_on
+lamp_off
+pick_and_lift
+take_lid_off_saucepan
+toilet_seat_down
+close_laptop_lid
+open_box
+open_drawer
+close_drawer
+close_box
+phone_on_base
+toilet_seat_up
+put_books_on_bookshelf
+pick_up_cup
+turn_tap
+put_item_in_drawer
+lift_numbered_block
+beat_the_buzz
+stack_cups
+take_usb_out_of_computer
+play_jenga
+insert_onto_square_peg
+take_umbrella_out_of_umbrella_stand
+insert_usb_in_computer
+straighten_rope
+pick_and_lift_small
+put_knife_on_chopping_board
+place_shape_in_shape_sorter
+take_toilet_roll_off_stand
+put_umbrella_in_umbrella_stand
+setup_checkers
+turn_oven_on
+change_clock
+open_window
+open_wine_bottle
+close_microwave
+close_fridge
+close_grill
+open_grill
+unplug_charger
+press_switch
+take_money_out_safe
+open_microwave
+put_money_in_safe
+open_door
+close_door
+open_fridge
+open_oven
+plug_charger_in_power_supply
+close_jar
+light_bulb_in
+place_wine_at_rack_location
+put_groceries_in_cupboard
+place_cups
\ No newline at end of file
diff --git a/tasks/calvin_rel_traj_location_bounds_task_ABC_D.json b/tasks/calvin_rel_traj_location_bounds_task_ABC_D.json
new file mode 100644
index 0000000..ac1679b
--- /dev/null
+++ b/tasks/calvin_rel_traj_location_bounds_task_ABC_D.json
@@ -0,0 +1,50 @@
+{
+    "A": [
+        [
+            -0.2691913843154907,
+            -0.21995729207992554,
+            -0.182277649641037
+        ],
+        [
+            0.35127854347229004,
+            0.2769763469696045,
+            0.17159393429756165
+        ]
+    ],
+    "B": [
+        [
+            -0.2576896846294403,
+            -0.22244493663311005,
+            -0.20557966828346252
+        ],
+        [
+            0.32854634523391724,
+            0.2922680974006653,
+            0.17373555898666382
+        ]
+    ],
+    "C": [
+        [
+            -0.29205888509750366,
+            -0.24688798189163208,
+            -0.17577645182609558
+        ],
+        [
+            0.25053921341896057,
+            0.3277084231376648,
+            0.16431939601898193
+        ]
+    ],
+    "D": [
+        [
+            -0.25131964683532715,
+            -0.15233077108860016,
+            -0.13294968008995056
+        ],
+        [
+            0.19209328293800354,
+            0.19344553351402283,
+            0.1370421051979065
+        ]
+    ]
+}
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks.csv b/tasks/hiveformer_74_tasks.csv
new file mode 100644
index 0000000..36aed3f
--- /dev/null
+++ b/tasks/hiveformer_74_tasks.csv
@@ -0,0 +1,74 @@
+reach_target
+close_drawer
+close_fridge
+close_microwave
+lamp_off
+press_switch
+push_button
+slide_block_to_target
+take_usb_out_of_computer
+turn_tap
+unplug_charger
+close_door
+lamp_on
+lift_numbered_block
+open_box
+open_drawer
+open_fridge
+open_grill
+open_microwave
+open_wine_bottle
+pick_up_cup
+play_jenga
+take_lid_off_saucepan
+take_umbrella_out_of_umbrella_stand
+toilet_seat_up
+turn_oven_on
+basketball_in_hoop
+beat_the_buzz
+change_clock
+close_grill
+close_laptop_lid
+hang_frame_on_hanger
+open_door
+open_window
+pick_and_lift
+pick_and_lift_small
+put_knife_on_chopping_board
+put_rubbish_in_bin
+put_umbrella_in_umbrella_stand
+scoop_with_spatula
+take_frame_off_hanger
+take_money_out_safe
+take_toilet_roll_off_stand
+toilet_seat_down
+close_box
+insert_onto_square_peg
+insert_usb_in_computer
+meat_off_grill
+meat_on_grill
+move_hanger
+open_oven
+phone_on_base
+place_hanger_on_rack
+place_shape_in_shape_sorter
+plug_charger_in_power_supply
+put_books_on_bookshelf
+put_money_in_safe
+sweep_to_dustpan
+take_plate_off_colored_dish_rack
+water_plants
+push_buttons
+reach_and_drag
+screw_nail
+setup_checkers
+stack_wine
+tower3
+wipe_desk
+straighten_rope
+change_channel
+tv_on
+slide_cabinet_open_and_place_cups
+stack_cups
+take_shoes_out_of_box
+stack_blocks
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_11_20.csv b/tasks/hiveformer_74_tasks_11_20.csv
new file mode 100644
index 0000000..3fc88ed
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_11_20.csv
@@ -0,0 +1,10 @@
+unplug_charger
+close_door
+lamp_on
+lift_numbered_block
+open_box
+open_drawer
+open_fridge
+open_grill
+open_microwave
+open_wine_bottle
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_1_10.csv b/tasks/hiveformer_74_tasks_1_10.csv
new file mode 100644
index 0000000..840ba08
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_1_10.csv
@@ -0,0 +1,10 @@
+reach_target
+close_drawer
+close_fridge
+close_microwave
+lamp_off
+press_switch
+push_button
+slide_block_to_target
+take_usb_out_of_computer
+turn_tap
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_21_30.csv b/tasks/hiveformer_74_tasks_21_30.csv
new file mode 100644
index 0000000..1620eb5
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_21_30.csv
@@ -0,0 +1,10 @@
+pick_up_cup
+play_jenga
+take_lid_off_saucepan
+take_umbrella_out_of_umbrella_stand
+toilet_seat_up
+turn_oven_on
+basketball_in_hoop
+beat_the_buzz
+change_clock
+close_grill
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_31_40.csv b/tasks/hiveformer_74_tasks_31_40.csv
new file mode 100644
index 0000000..5d3a903
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_31_40.csv
@@ -0,0 +1,10 @@
+close_laptop_lid
+hang_frame_on_hanger
+open_door
+open_window
+pick_and_lift
+pick_and_lift_small
+put_knife_on_chopping_board
+put_rubbish_in_bin
+put_umbrella_in_umbrella_stand
+scoop_with_spatula
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_41_50.csv b/tasks/hiveformer_74_tasks_41_50.csv
new file mode 100644
index 0000000..7cc843c
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_41_50.csv
@@ -0,0 +1,10 @@
+take_frame_off_hanger
+take_money_out_safe
+take_toilet_roll_off_stand
+toilet_seat_down
+close_box
+insert_onto_square_peg
+insert_usb_in_computer
+meat_off_grill
+meat_on_grill
+move_hanger
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_51_60.csv b/tasks/hiveformer_74_tasks_51_60.csv
new file mode 100644
index 0000000..4da1265
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_51_60.csv
@@ -0,0 +1,10 @@
+open_oven
+phone_on_base
+place_hanger_on_rack
+place_shape_in_shape_sorter
+plug_charger_in_power_supply
+put_books_on_bookshelf
+put_money_in_safe
+sweep_to_dustpan
+take_plate_off_colored_dish_rack
+water_plants
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_61_74.csv b/tasks/hiveformer_74_tasks_61_74.csv
new file mode 100644
index 0000000..0fb58b3
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_61_74.csv
@@ -0,0 +1,14 @@
+push_buttons
+reach_and_drag
+screw_nail
+setup_checkers
+stack_wine
+tower3
+wipe_desk
+straighten_rope
+change_channel
+tv_on
+slide_cabinet_open_and_place_cups
+stack_cups
+take_shoes_out_of_box
+stack_blocks
\ No newline at end of file
diff --git a/tasks/hiveformer_74_tasks_grouped.txt b/tasks/hiveformer_74_tasks_grouped.txt
new file mode 100644
index 0000000..7f1f566
--- /dev/null
+++ b/tasks/hiveformer_74_tasks_grouped.txt
@@ -0,0 +1,92 @@
+# PLANNING
+basketball_in_hoop
+put_rubbish_in_bin
+meat_off_grill
+meat_on_grill
+change_channel
+tv_on
+tower3
+push_buttons
+stack_wine
+
+# TOOLS
+slide_block_to_target
+reach_and_drag
+take_frame_off_hanger
+water_plants
+hang_frame_on_hanger
+scoop_with_spatula
+place_hanger_on_rack
+move_hanger
+sweep_to_dustpan
+take_plate_off_colored_dish_rack
+screw_nail
+
+# LONG-TERM
+wipe_desk
+stack_blocks
+take_shoes_out_of_box
+slide_cabinet_open_and_place_cups
+
+# ROTATION-INVARIANT
+reach_target
+push_button
+lamp_on
+lamp_off
+push_buttons  # This is a duplicate of the planning task
+pick_and_lift
+take_lid_off_saucepan
+
+# MOTION_PLANNER
+toilet_seat_down
+close_laptop_lid
+open_box
+open_drawer
+close_drawer
+close_box
+phone_on_base
+toilet_seat_up
+put_books_on_bookshelf
+
+# MULTIMODAL
+pick_up_cup
+turn_tap
+lift_numbered_block
+beat_the_buzz
+stack_cups
+
+# PRECISION
+take_usb_out_of_computer
+play_jenga
+insert_onto_square_peg
+take_umbrella_out_of_umbrella_stand
+insert_usb_in_computer
+straighten_rope
+pick_and_lift_small
+put_knife_on_chopping_board
+place_shape_in_shape_sorter
+take_toilet_roll_off_stand
+put_umbrella_in_umbrella_stand
+setup_checkers
+
+# SCREW
+turn_oven_on
+change_clock
+open_window
+open_wine_bottle
+
+# VISUAL_OCCLUSION
+close_microwave
+close_fridge
+close_grill
+open_grill
+unplug_charger
+press_switch
+take_money_out_safe
+open_microwave
+put_money_in_safe
+open_door
+close_door
+open_fridge
+open_oven
+plug_charger_in_power_supply
diff --git a/tasks/hiveformer_fix_data_tasks.csv b/tasks/hiveformer_fix_data_tasks.csv
new file mode 100644
index 0000000..441ae38
--- /dev/null
+++ b/tasks/hiveformer_fix_data_tasks.csv
@@ -0,0 +1,3 @@
+unplug_charger
+close_door
+open_fridge
\ No newline at end of file
diff --git a/tasks/hiveformer_hard_10_demo_tasks.csv b/tasks/hiveformer_hard_10_demo_tasks.csv
new file mode 100644
index 0000000..2ec9705
--- /dev/null
+++ b/tasks/hiveformer_hard_10_demo_tasks.csv
@@ -0,0 +1,3 @@
+slide_block_to_target
+pick_and_lift
+put_knife_on_chopping_board
\ No newline at end of file
diff --git a/tasks/hiveformer_high_precision_tasks.csv b/tasks/hiveformer_high_precision_tasks.csv
new file mode 100644
index 0000000..ec933b3
--- /dev/null
+++ b/tasks/hiveformer_high_precision_tasks.csv
@@ -0,0 +1,8 @@
+hang_frame_on_hanger
+put_umbrella_in_umbrella_stand
+insert_onto_square_peg
+insert_usb_in_computer
+place_shape_in_shape_sorter
+plug_charger_in_power_supply
+reach_and_drag
+screw_nail
\ No newline at end of file
diff --git a/tasks/peract_18_tasks.csv b/tasks/peract_18_tasks.csv
new file mode 100644
index 0000000..f5dad9e
--- /dev/null
+++ b/tasks/peract_18_tasks.csv
@@ -0,0 +1,18 @@
+turn_tap
+open_drawer
+push_buttons
+sweep_to_dustpan_of_size
+slide_block_to_color_target
+insert_onto_square_peg
+meat_off_grill
+place_shape_in_shape_sorter
+place_wine_at_rack_location
+put_groceries_in_cupboard
+put_money_in_safe
+close_jar
+reach_and_drag
+light_bulb_in
+stack_cups
+place_cups
+put_item_in_drawer
+stack_blocks
\ No newline at end of file
diff --git a/tasks/peract_18_tasks_11_18.csv b/tasks/peract_18_tasks_11_18.csv
new file mode 100644
index 0000000..76dd600
--- /dev/null
+++ b/tasks/peract_18_tasks_11_18.csv
@@ -0,0 +1,8 @@
+put_money_in_safe
+close_jar
+reach_and_drag
+light_bulb_in
+stack_cups
+place_cups
+put_item_in_drawer
+stack_blocks
\ No newline at end of file
diff --git a/tasks/peract_18_tasks_1_10.csv b/tasks/peract_18_tasks_1_10.csv
new file mode 100644
index 0000000..ad6041a
--- /dev/null
+++ b/tasks/peract_18_tasks_1_10.csv
@@ -0,0 +1,10 @@
+turn_tap
+open_drawer
+push_buttons
+sweep_to_dustpan_of_size
+slide_block_to_color_target
+insert_onto_square_peg
+meat_off_grill
+place_shape_in_shape_sorter
+place_wine_at_rack_location
+put_groceries_in_cupboard
\ No newline at end of file
diff --git a/utils/__init__.py b/utils/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/utils/common_utils.py b/utils/common_utils.py
new file mode 100644
index 0000000..44b87fd
--- /dev/null
+++ b/utils/common_utils.py
@@ -0,0 +1,66 @@
+import pickle
+from typing import Dict, Optional, Sequence
+from pathlib import Path
+import json
+import torch
+import numpy as np
+
+
+Instructions = Dict[str, Dict[int, torch.Tensor]]
+
+
+def round_floats(o):
+    if isinstance(o, float): return round(o, 2)
+    if isinstance(o, dict): return {k: round_floats(v) for k, v in o.items()}
+    if isinstance(o, (list, tuple)): return [round_floats(x) for x in o]
+    return o
+
+
+def normalise_quat(x: torch.Tensor):
+    return x / x.square().sum(dim=-1).sqrt().unsqueeze(-1)
+
+
+def get_gripper_loc_bounds(path: str, buffer: float = 0.0, task: Optional[str] = None):
+    gripper_loc_bounds = json.load(open(path, "r"))
+    if task is not None and task in gripper_loc_bounds:
+        gripper_loc_bounds = gripper_loc_bounds[task]
+        gripper_loc_bounds_min = np.array(gripper_loc_bounds[0]) - buffer
+        gripper_loc_bounds_max = np.array(gripper_loc_bounds[1]) + buffer
+        gripper_loc_bounds = np.stack([gripper_loc_bounds_min, gripper_loc_bounds_max])
+    else:
+        # Gripper workspace is the union of workspaces for all tasks
+        gripper_loc_bounds = json.load(open(path, "r"))
+        gripper_loc_bounds_min = np.min(np.stack([bounds[0] for bounds in gripper_loc_bounds.values()]), axis=0) - buffer
+        gripper_loc_bounds_max = np.max(np.stack([bounds[1] for bounds in gripper_loc_bounds.values()]), axis=0) + buffer
+        gripper_loc_bounds = np.stack([gripper_loc_bounds_min, gripper_loc_bounds_max])
+    print("Gripper workspace size:", gripper_loc_bounds_max - gripper_loc_bounds_min)
+    return gripper_loc_bounds
+
+
+def count_parameters(model):
+    return sum(p.numel() for p in model.parameters() if p.requires_grad)
+
+
+def norm_tensor(tensor: torch.Tensor) -> torch.Tensor:
+    return tensor / torch.linalg.norm(tensor, ord=2, dim=-1, keepdim=True)
+
+
+def load_instructions(
+    instructions: Optional[Path],
+    tasks: Optional[Sequence[str]] = None,
+    variations: Optional[Sequence[int]] = None,
+) -> Optional[Instructions]:
+    if instructions is not None:
+        with open(instructions, "rb") as fid:
+            data: Instructions = pickle.load(fid)
+        if tasks is not None:
+            data = {task: var_instr for task, var_instr in data.items() if task in tasks}
+        if variations is not None:
+            data = {
+                task: {
+                    var: instr for var, instr in var_instr.items() if var in variations
+                }
+                for task, var_instr in data.items()
+            }
+        return data
+    return None
diff --git a/utils/pytorch3d_transforms.py b/utils/pytorch3d_transforms.py
new file mode 100644
index 0000000..17b7efd
--- /dev/null
+++ b/utils/pytorch3d_transforms.py
@@ -0,0 +1,618 @@
+"""In order not to install pytorch3d, we steal from:
+    https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/transforms/rotation_conversions.py
+"""
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+from typing import Optional, Union
+
+import torch
+import torch.nn.functional as F
+
+Device = Union[str, torch.device]
+
+
+"""
+The transformation matrices returned from the functions in this file assume
+the points on which the transformation will be applied are column vectors.
+i.e. the R matrix is structured as
+
+    R = [
+            [Rxx, Rxy, Rxz],
+            [Ryx, Ryy, Ryz],
+            [Rzx, Rzy, Rzz],
+        ]  # (3, 3)
+
+This matrix can be applied to column vectors by post multiplication
+by the points e.g.
+
+    points = [[0], [1], [2]]  # (3 x 1) xyz coordinates of a point
+    transformed_points = R * points
+
+To apply the same matrix to points which are row vectors, the R matrix
+can be transposed and pre multiplied by the points:
+
+e.g.
+    points = [[0, 1, 2]]  # (1 x 3) xyz coordinates of a point
+    transformed_points = points * R.transpose(1, 0)
+"""
+
+
+def quaternion_to_matrix(quaternions: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as quaternions to rotation matrices.
+
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    r, i, j, k = torch.unbind(quaternions, -1)
+    # pyre-fixme[58]: `/` is not supported for operand types `float` and `Tensor`.
+    two_s = 2.0 / (quaternions * quaternions).sum(-1)
+
+    o = torch.stack(
+        (
+            1 - two_s * (j * j + k * k),
+            two_s * (i * j - k * r),
+            two_s * (i * k + j * r),
+            two_s * (i * j + k * r),
+            1 - two_s * (i * i + k * k),
+            two_s * (j * k - i * r),
+            two_s * (i * k - j * r),
+            two_s * (j * k + i * r),
+            1 - two_s * (i * i + j * j),
+        ),
+        -1,
+    )
+    return o.reshape(quaternions.shape[:-1] + (3, 3))
+
+
+def _copysign(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Return a tensor where each element has the absolute value taken from the,
+    corresponding element of a, with sign taken from the corresponding
+    element of b. This is like the standard copysign floating-point operation,
+    but is not careful about negative 0 and NaN.
+
+    Args:
+        a: source tensor.
+        b: tensor whose signs will be used, of the same shape as a.
+
+    Returns:
+        Tensor of the same shape as a with the signs of b.
+    """
+    signs_differ = (a < 0) != (b < 0)
+    return torch.where(signs_differ, -a, a)
+
+
+def _sqrt_positive_part(x: torch.Tensor) -> torch.Tensor:
+    """
+    Returns torch.sqrt(torch.max(0, x))
+    but with a zero subgradient where x is 0.
+    """
+    ret = torch.zeros_like(x)
+    positive_mask = x > 0
+    ret[positive_mask] = torch.sqrt(x[positive_mask])
+    return ret
+
+
+def matrix_to_quaternion(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as rotation matrices to quaternions.
+
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix shape {matrix.shape}.")
+
+    batch_dim = matrix.shape[:-2]
+    m00, m01, m02, m10, m11, m12, m20, m21, m22 = torch.unbind(
+        matrix.reshape(batch_dim + (9,)), dim=-1
+    )
+
+    q_abs = _sqrt_positive_part(
+        torch.stack(
+            [
+                1.0 + m00 + m11 + m22,
+                1.0 + m00 - m11 - m22,
+                1.0 - m00 + m11 - m22,
+                1.0 - m00 - m11 + m22,
+            ],
+            dim=-1,
+        )
+    )
+
+    # we produce the desired quaternion multiplied by each of r, i, j, k
+    quat_by_rijk = torch.stack(
+        [
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([q_abs[..., 0] ** 2, m21 - m12, m02 - m20, m10 - m01], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m21 - m12, q_abs[..., 1] ** 2, m10 + m01, m02 + m20], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m02 - m20, m10 + m01, q_abs[..., 2] ** 2, m12 + m21], dim=-1),
+            # pyre-fixme[58]: `**` is not supported for operand types `Tensor` and
+            #  `int`.
+            torch.stack([m10 - m01, m20 + m02, m21 + m12, q_abs[..., 3] ** 2], dim=-1),
+        ],
+        dim=-2,
+    )
+
+    # We floor here at 0.1 but the exact level is not important; if q_abs is small,
+    # the candidate won't be picked.
+    flr = torch.tensor(0.1).to(dtype=q_abs.dtype, device=q_abs.device)
+    quat_candidates = quat_by_rijk / (2.0 * q_abs[..., None].max(flr))
+
+    # if not for numerical problems, quat_candidates[i] should be same (up to a sign),
+    # forall i; we pick the best-conditioned one (with the largest denominator)
+
+    return quat_candidates[
+        F.one_hot(q_abs.argmax(dim=-1), num_classes=4) > 0.5, :
+    ].reshape(batch_dim + (4,))
+
+
+def _axis_angle_rotation(axis: str, angle: torch.Tensor) -> torch.Tensor:
+    """
+    Return the rotation matrices for one of the rotations about an axis
+    of which Euler angles describe, for each value of the angle given.
+
+    Args:
+        axis: Axis label "X" or "Y or "Z".
+        angle: any shape tensor of Euler angles in radians
+
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+
+    cos = torch.cos(angle)
+    sin = torch.sin(angle)
+    one = torch.ones_like(angle)
+    zero = torch.zeros_like(angle)
+
+    if axis == "X":
+        R_flat = (one, zero, zero, zero, cos, -sin, zero, sin, cos)
+    elif axis == "Y":
+        R_flat = (cos, zero, sin, zero, one, zero, -sin, zero, cos)
+    elif axis == "Z":
+        R_flat = (cos, -sin, zero, sin, cos, zero, zero, zero, one)
+    else:
+        raise ValueError("letter must be either X, Y or Z.")
+
+    return torch.stack(R_flat, -1).reshape(angle.shape + (3, 3))
+
+
+def euler_angles_to_matrix(euler_angles: torch.Tensor, convention: str) -> torch.Tensor:
+    """
+    Convert rotations given as Euler angles in radians to rotation matrices.
+
+    Args:
+        euler_angles: Euler angles in radians as tensor of shape (..., 3).
+        convention: Convention string of three uppercase letters from
+            {"X", "Y", and "Z"}.
+
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    if euler_angles.dim() == 0 or euler_angles.shape[-1] != 3:
+        raise ValueError("Invalid input euler angles.")
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    matrices = [
+        _axis_angle_rotation(c, e)
+        for c, e in zip(convention, torch.unbind(euler_angles, -1))
+    ]
+    # return functools.reduce(torch.matmul, matrices)
+    return torch.matmul(torch.matmul(matrices[0], matrices[1]), matrices[2])
+
+
+def _angle_from_tan(
+    axis: str, other_axis: str, data, horizontal: bool, tait_bryan: bool
+) -> torch.Tensor:
+    """
+    Extract the first or third Euler angle from the two members of
+    the matrix which are positive constant times its sine and cosine.
+
+    Args:
+        axis: Axis label "X" or "Y or "Z" for the angle we are finding.
+        other_axis: Axis label "X" or "Y or "Z" for the middle axis in the
+            convention.
+        data: Rotation matrices as tensor of shape (..., 3, 3).
+        horizontal: Whether we are looking for the angle for the third axis,
+            which means the relevant entries are in the same row of the
+            rotation matrix. If not, they are in the same column.
+        tait_bryan: Whether the first and third axes in the convention differ.
+
+    Returns:
+        Euler Angles in radians for each matrix in data as a tensor
+        of shape (...).
+    """
+
+    i1, i2 = {"X": (2, 1), "Y": (0, 2), "Z": (1, 0)}[axis]
+    if horizontal:
+        i2, i1 = i1, i2
+    even = (axis + other_axis) in ["XY", "YZ", "ZX"]
+    if horizontal == even:
+        return torch.atan2(data[..., i1], data[..., i2])
+    if tait_bryan:
+        return torch.atan2(-data[..., i2], data[..., i1])
+    return torch.atan2(data[..., i2], -data[..., i1])
+
+
+def _index_from_letter(letter: str) -> int:
+    if letter == "X":
+        return 0
+    if letter == "Y":
+        return 1
+    if letter == "Z":
+        return 2
+    raise ValueError("letter must be either X, Y or Z.")
+
+
+def matrix_to_euler_angles(matrix: torch.Tensor, convention: str) -> torch.Tensor:
+    """
+    Convert rotations given as rotation matrices to Euler angles in radians.
+
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+        convention: Convention string of three uppercase letters.
+
+    Returns:
+        Euler angles in radians as tensor of shape (..., 3).
+    """
+    if len(convention) != 3:
+        raise ValueError("Convention must have 3 letters.")
+    if convention[1] in (convention[0], convention[2]):
+        raise ValueError(f"Invalid convention {convention}.")
+    for letter in convention:
+        if letter not in ("X", "Y", "Z"):
+            raise ValueError(f"Invalid letter {letter} in convention string.")
+    if matrix.size(-1) != 3 or matrix.size(-2) != 3:
+        raise ValueError(f"Invalid rotation matrix shape {matrix.shape}.")
+    i0 = _index_from_letter(convention[0])
+    i2 = _index_from_letter(convention[2])
+    tait_bryan = i0 != i2
+    if tait_bryan:
+        central_angle = torch.asin(
+            matrix[..., i0, i2] * (-1.0 if i0 - i2 in [-1, 2] else 1.0)
+        )
+    else:
+        central_angle = torch.acos(matrix[..., i0, i0])
+
+    o = (
+        _angle_from_tan(
+            convention[0], convention[1], matrix[..., i2], False, tait_bryan
+        ),
+        central_angle,
+        _angle_from_tan(
+            convention[2], convention[1], matrix[..., i0, :], True, tait_bryan
+        ),
+    )
+    return torch.stack(o, -1)
+
+
+def random_quaternions(
+    n: int, dtype: Optional[torch.dtype] = None, device: Optional[Device] = None
+) -> torch.Tensor:
+    """
+    Generate random quaternions representing rotations,
+    i.e. versors with nonnegative real part.
+
+    Args:
+        n: Number of quaternions in a batch to return.
+        dtype: Type to return.
+        device: Desired device of returned tensor. Default:
+            uses the current device for the default tensor type.
+
+    Returns:
+        Quaternions as tensor of shape (N, 4).
+    """
+    if isinstance(device, str):
+        device = torch.device(device)
+    o = torch.randn((n, 4), dtype=dtype, device=device)
+    s = (o * o).sum(1)
+    o = o / _copysign(torch.sqrt(s), o[:, 0])[:, None]
+    return o
+
+
+def random_rotations(
+    n: int, dtype: Optional[torch.dtype] = None, device: Optional[Device] = None
+) -> torch.Tensor:
+    """
+    Generate random rotations as 3x3 rotation matrices.
+
+    Args:
+        n: Number of rotation matrices in a batch to return.
+        dtype: Type to return.
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type.
+
+    Returns:
+        Rotation matrices as tensor of shape (n, 3, 3).
+    """
+    quaternions = random_quaternions(n, dtype=dtype, device=device)
+    return quaternion_to_matrix(quaternions)
+
+
+def random_rotation(
+    dtype: Optional[torch.dtype] = None, device: Optional[Device] = None
+) -> torch.Tensor:
+    """
+    Generate a single random 3x3 rotation matrix.
+
+    Args:
+        dtype: Type to return
+        device: Device of returned tensor. Default: if None,
+            uses the current device for the default tensor type
+
+    Returns:
+        Rotation matrix as tensor of shape (3, 3).
+    """
+    return random_rotations(1, dtype, device)[0]
+
+
+def standardize_quaternion(quaternions: torch.Tensor) -> torch.Tensor:
+    """
+    Convert a unit quaternion to a standard form: one in which the real
+    part is non negative.
+
+    Args:
+        quaternions: Quaternions with real part first,
+            as tensor of shape (..., 4).
+
+    Returns:
+        Standardized quaternions as tensor of shape (..., 4).
+    """
+    return torch.where(quaternions[..., 0:1] < 0, -quaternions, quaternions)
+
+
+def quaternion_raw_multiply(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Multiply two quaternions.
+    Usual torch rules for broadcasting apply.
+
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+
+    Returns:
+        The product of a and b, a tensor of quaternions shape (..., 4).
+    """
+    aw, ax, ay, az = torch.unbind(a, -1)
+    bw, bx, by, bz = torch.unbind(b, -1)
+    ow = aw * bw - ax * bx - ay * by - az * bz
+    ox = aw * bx + ax * bw + ay * bz - az * by
+    oy = aw * by - ax * bz + ay * bw + az * bx
+    oz = aw * bz + ax * by - ay * bx + az * bw
+    return torch.stack((ow, ox, oy, oz), -1)
+
+
+def quaternion_multiply(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Multiply two quaternions representing rotations, returning the quaternion
+    representing their composition, i.e. the versor with nonnegative real part.
+    Usual torch rules for broadcasting apply.
+
+    Args:
+        a: Quaternions as tensor of shape (..., 4), real part first.
+        b: Quaternions as tensor of shape (..., 4), real part first.
+
+    Returns:
+        The product of a and b, a tensor of quaternions of shape (..., 4).
+    """
+    ab = quaternion_raw_multiply(a, b)
+    return standardize_quaternion(ab)
+
+
+def quaternion_invert(quaternion: torch.Tensor) -> torch.Tensor:
+    """
+    Given a quaternion representing rotation, get the quaternion representing
+    its inverse.
+
+    Args:
+        quaternion: Quaternions as tensor of shape (..., 4), with real part
+            first, which must be versors (unit quaternions).
+
+    Returns:
+        The inverse, a tensor of quaternions of shape (..., 4).
+    """
+
+    scaling = torch.tensor([1, -1, -1, -1], device=quaternion.device)
+    return quaternion * scaling
+
+
+def quaternion_apply(quaternion: torch.Tensor, point: torch.Tensor) -> torch.Tensor:
+    """
+    Apply the rotation given by a quaternion to a 3D point.
+    Usual torch rules for broadcasting apply.
+
+    Args:
+        quaternion: Tensor of quaternions, real part first, of shape (..., 4).
+        point: Tensor of 3D points of shape (..., 3).
+
+    Returns:
+        Tensor of rotated points of shape (..., 3).
+    """
+    if point.size(-1) != 3:
+        raise ValueError(f"Points are not in 3D, {point.shape}.")
+    real_parts = point.new_zeros(point.shape[:-1] + (1,))
+    point_as_quaternion = torch.cat((real_parts, point), -1)
+    out = quaternion_raw_multiply(
+        quaternion_raw_multiply(quaternion, point_as_quaternion),
+        quaternion_invert(quaternion),
+    )
+    return out[..., 1:]
+
+
+def axis_angle_to_matrix(axis_angle: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as axis/angle to rotation matrices.
+
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+
+    Returns:
+        Rotation matrices as tensor of shape (..., 3, 3).
+    """
+    return quaternion_to_matrix(axis_angle_to_quaternion(axis_angle))
+
+
+def matrix_to_axis_angle(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as rotation matrices to axis/angle.
+
+    Args:
+        matrix: Rotation matrices as tensor of shape (..., 3, 3).
+
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    return quaternion_to_axis_angle(matrix_to_quaternion(matrix))
+
+
+def axis_angle_to_quaternion(axis_angle: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as axis/angle to quaternions.
+
+    Args:
+        axis_angle: Rotations given as a vector in axis angle form,
+            as a tensor of shape (..., 3), where the magnitude is
+            the angle turned anticlockwise in radians around the
+            vector's direction.
+
+    Returns:
+        quaternions with real part first, as tensor of shape (..., 4).
+    """
+    angles = torch.norm(axis_angle, p=2, dim=-1, keepdim=True)
+    half_angles = angles * 0.5
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    quaternions = torch.cat(
+        [torch.cos(half_angles), axis_angle * sin_half_angles_over_angles], dim=-1
+    )
+    return quaternions
+
+
+def quaternion_to_axis_angle(quaternions: torch.Tensor) -> torch.Tensor:
+    """
+    Convert rotations given as quaternions to axis/angle.
+
+    Args:
+        quaternions: quaternions with real part first,
+            as tensor of shape (..., 4).
+
+    Returns:
+        Rotations given as a vector in axis angle form, as a tensor
+            of shape (..., 3), where the magnitude is the angle
+            turned anticlockwise in radians around the vector's
+            direction.
+    """
+    norms = torch.norm(quaternions[..., 1:], p=2, dim=-1, keepdim=True)
+    half_angles = torch.atan2(norms, quaternions[..., :1])
+    angles = 2 * half_angles
+    eps = 1e-6
+    small_angles = angles.abs() < eps
+    sin_half_angles_over_angles = torch.empty_like(angles)
+    sin_half_angles_over_angles[~small_angles] = (
+        torch.sin(half_angles[~small_angles]) / angles[~small_angles]
+    )
+    # for x small, sin(x/2) is about x/2 - (x/2)^3/6
+    # so sin(x/2)/x is about 1/2 - (x*x)/48
+    sin_half_angles_over_angles[small_angles] = (
+        0.5 - (angles[small_angles] * angles[small_angles]) / 48
+    )
+    return quaternions[..., 1:] / sin_half_angles_over_angles
+
+
+def rotation_6d_to_matrix(d6: torch.Tensor) -> torch.Tensor:
+    """
+    Converts 6D rotation representation by Zhou et al. [1] to rotation matrix
+    using Gram--Schmidt orthogonalization per Section B of [1].
+    Args:
+        d6: 6D rotation representation, of size (*, 6)
+
+    Returns:
+        batch of rotation matrices of size (*, 3, 3)
+
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+
+    a1, a2 = d6[..., :3], d6[..., 3:]
+    b1 = F.normalize(a1, dim=-1)
+    b2 = a2 - (b1 * a2).sum(-1, keepdim=True) * b1
+    b2 = F.normalize(b2, dim=-1)
+    b3 = torch.cross(b1, b2, dim=-1)
+    return torch.stack((b1, b2, b3), dim=-2)
+
+
+def matrix_to_rotation_6d(matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Converts rotation matrices to 6D rotation representation by Zhou et al. [1]
+    by dropping the last row. Note that 6D representation is not unique.
+    Args:
+        matrix: batch of rotation matrices of size (*, 3, 3)
+
+    Returns:
+        6D rotation representation, of size (*, 6)
+
+    [1] Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H.
+    On the Continuity of Rotation Representations in Neural Networks.
+    IEEE Conference on Computer Vision and Pattern Recognition, 2019.
+    Retrieved from http://arxiv.org/abs/1812.07035
+    """
+    batch_dim = matrix.size()[:-2]
+    return matrix[..., :2, :].clone().reshape(batch_dim + (6,))
+
+
+if __name__ == "__main__":
+    import torch
+    import torch.nn.functional as F
+    import pytorch3d.transforms
+
+    quats = torch.randn((3000, 4), dtype=torch.float)
+    mat1 = pytorch3d.transforms.quaternion_to_matrix(quats)
+    mat2 = quaternion_to_matrix(quats)
+    assert((mat1 - mat2).abs().max() <= 1e-5)
+
+    quats1 = pytorch3d.transforms.matrix_to_quaternion(mat1)
+    quats2 = matrix_to_quaternion(mat1)
+    assert((quats1 - quats2).abs().max() <= 1e-5)
+
+    mul1 = pytorch3d.transforms.quaternion_raw_multiply(quats[:-1], quats[1:])
+    mul2 = quaternion_raw_multiply(quats[:-1], quats[1:])
+    assert((mul1 - mul2).abs().max() <= 1e-5)
diff --git a/utils/utils_with_calvin.py b/utils/utils_with_calvin.py
new file mode 100644
index 0000000..f2f2dee
--- /dev/null
+++ b/utils/utils_with_calvin.py
@@ -0,0 +1,292 @@
+import numpy as np
+from scipy.signal import argrelextrema
+import torch
+import utils.pytorch3d_transforms as pytorch3d_transforms
+import pybullet as pb
+
+from calvin_env.robot.robot import Robot
+from calvin_env.utils.utils import angle_between_angles
+
+
+def get_eef_velocity_from_robot(robot: Robot):
+    eef_vel = []
+    for i in range(2):
+        eef_vel.append(
+            pb.getJointState(
+                robot.robot_uid,
+                robot.gripper_joint_ids[i],
+                physicsClientId=robot.cid
+            )[1]
+        )
+
+    # mean over the two gripper points.
+    vel = sum(eef_vel) / len(eef_vel)
+
+    return vel
+
+
+def get_eef_velocity_from_trajectories(trajectories):
+    trajectories = np.stack([trajectories[0]] + trajectories, axis=0)
+    velocities = trajectories[1:] - trajectories[:-1]
+
+    V = np.linalg.norm(velocities[:, :3], axis=-1)
+    W = np.linalg.norm(velocities[:, 3:6], axis=-1)
+
+    velocities = np.concatenate(
+        [velocities, [velocities[-1]]],
+        # [velocities[[0]], velocities],
+        axis=0
+    )
+    accelerations = velocities[1:] - velocities[:-1]
+
+    A = np.linalg.norm(accelerations[:, :3], axis=-1)
+
+    return V, W, A
+
+
+def scene_state_changes(scene_states, task):
+    """Return the delta of objects in the scene.
+
+    Args:
+        scene_states: A list of scene_obs arrays.
+            Each array is 24-dimensional:
+                sliding door (1): joint state
+                drawer (1): joint state
+                button (1): joint state
+                switch (1): joint state
+                lightbulb (1): on=1, off=0
+                green light (1): on=1, off=0
+                red block (6): (x, y, z, euler_x, euler_y, euler_z)
+                blue block (6): (x, y, z, euler_x, euler_y, euler_z)
+                pink block (6): (x, y, z, euler_x, euler_y, euler_z)
+
+    Returns:
+        An binary array of shape (batch_size, 24) where `1` denotes
+        significant state change for the object state.
+    """
+    all_changed_inds = []
+    # For lightbul/green light, we select frames when the light turns on/off.
+    if "lightbulb" in task or "switch" in task:
+        obj_inds = [4]
+    elif "led" in task or "button" in task:
+        obj_inds = [5]
+    else:
+        obj_inds = []
+    for obj_ind in obj_inds:
+        light_states = [s[obj_ind] for s in scene_states]
+        light_states = np.stack(
+            [light_states[0]] + light_states, axis=0
+        )  # current frame != previous frame
+        light_changes = light_states[1:] != light_states[:-1]
+        light_changed_inds = np.where(light_changes)[0]
+        if light_changed_inds.shape[0] > 0:
+            all_changed_inds.extend(light_changed_inds.tolist())
+
+    # For sliding door, drawer, button, and switch, we select the frame
+    # before the object is first moved.
+    if "slider" in task:
+        obj_inds = [0]
+    elif "drawer" in task:
+        obj_inds = [1]
+    elif "led" in task or "button" in task:
+        # lightbulb is adjusted by the button
+        obj_inds = [2]
+    elif "lightbulb" in task or "switch" in task:
+        # lightbulb is adjusted by the switch
+        obj_inds = [3]
+    else:
+        obj_inds = []
+    for obj_ind in obj_inds:
+        object_states = [s[obj_ind] for s in scene_states]
+        object_states = np.stack(
+            object_states + [object_states[-1]], axis=0
+        )  # current frame != future frame
+        object_changes = object_states[:-1] != object_states[1:]
+        object_changed_inds = np.where(object_changes)[0]
+        if object_changed_inds.shape[0] > 0:
+            all_changed_inds.append(object_changed_inds.min())
+
+    # For blocks, we subsample the frames where blocks are moved
+    if "slider" in task or "drawer" in task or "block" in task:
+        object_states = [s[-18:] for s in scene_states]
+        object_states = np.stack(
+            object_states + [object_states[-1]], axis=0
+        )  # current frame != future frame
+        object_states = object_states.reshape(-1, 3, 6)
+        delta_xyz = np.linalg.norm(
+            object_states[:-1, :, :3] - object_states[1:, :, :3], axis=-1
+        )
+        delta_orn = np.linalg.norm(
+            object_states[:-1, :, 3:] - object_states[1:, :, 3:], axis=-1
+        )
+        object_changes = np.logical_or(delta_xyz > 1e-3, delta_orn > 1e-1)
+        object_changed_inds = np.where(object_changes)[0]
+
+        # subsample every 4 frames
+        object_changed_inds = object_changed_inds[::6]
+        if object_changed_inds.shape[0] > 0:
+            all_changed_inds.extend(object_changed_inds.tolist())
+
+    return all_changed_inds
+
+
+def gripper_state_changed(trajectories):
+    trajectories = np.stack(
+        [trajectories[0]] + trajectories, axis=0
+    )
+    openess = trajectories[:, -1]
+    changed = openess[:-1] != openess[1:]
+
+    return np.where(changed)[0]
+
+
+def keypoint_discovery(trajectories, scene_states=None, task=None,
+                       buffer_size=5):
+    """Determine way point from the trajectories.
+
+    Args:
+        trajectories: a list of 1-D np arrays.  Each array is
+            7-dimensional (x, y, z, euler_x, euler_y, euler_z, opene).
+        stopping_delta: the minimum velocity to determine if the
+            end effector is stopped.
+
+    Returns:
+        an Integer array indicates the indices of waypoints
+    """
+    V, W, A = get_eef_velocity_from_trajectories(trajectories)
+
+    # waypoints are local minima of gripper movement
+    _local_max_A = argrelextrema(A, np.greater)[0]
+    topK = np.sort(A)[::-1][int(A.shape[0] * 0.2)]
+    large_A = A[_local_max_A] >= topK
+    _local_max_A = _local_max_A[large_A].tolist()
+
+    local_max_A = [_local_max_A.pop(0)]
+    for i in _local_max_A:
+        if i - local_max_A[-1] >= buffer_size:
+            local_max_A.append(i)
+
+    # waypoints are frames with changing gripper states
+    gripper_changed = gripper_state_changed(trajectories)
+    one_frame_before_gripper_changed = (
+        gripper_changed[gripper_changed > 1] - 1
+    )
+
+    # waypoints is the last pose in the trajectory
+    last_frame = [len(trajectories) - 1]
+
+    keyframe_inds = (
+        local_max_A +
+        gripper_changed.tolist() +
+        one_frame_before_gripper_changed.tolist() +
+        last_frame
+    )
+    keyframe_inds = np.unique(keyframe_inds)
+
+    keyframes = [trajectories[i] for i in keyframe_inds]
+
+    return keyframes, keyframe_inds
+
+
+def get_gripper_camera_view_matrix(cam):
+    camera_ls = pb.getLinkState(
+        bodyUniqueId=cam.robot_uid,
+        linkIndex=cam.gripper_cam_link,
+        physicsClientId=cam.cid
+    )
+    camera_pos, camera_orn = camera_ls[:2]
+    cam_rot = pb.getMatrixFromQuaternion(camera_orn)
+    cam_rot = np.array(cam_rot).reshape(3, 3)
+    cam_rot_y, cam_rot_z = cam_rot[:, 1], cam_rot[:, 2]
+    # camera: eye position, target position, up vector
+    view_matrix = pb.computeViewMatrix(
+        camera_pos, camera_pos + cam_rot_y, -cam_rot_z
+    )
+    return view_matrix
+
+
+def deproject(cam, depth_img, homogeneous=False, sanity_check=False):
+    """
+    Deprojects a pixel point to 3D coordinates
+    Args
+        point: tuple (u, v); pixel coordinates of point to deproject
+        depth_img: np.array; depth image used as reference to generate 3D coordinates
+        homogeneous: bool; if true it returns the 3D point in homogeneous coordinates,
+                     else returns the world coordinates (x, y, z) position
+    Output
+        (x, y, z): (3, npts) np.array; world coordinates of the deprojected point
+    """
+    h, w = depth_img.shape
+    u, v = np.meshgrid(np.arange(w), np.arange(h))
+    u, v = u.ravel(), v.ravel()
+
+    # Unproject to world coordinates
+    T_world_cam = np.linalg.inv(np.array(cam.viewMatrix).reshape((4, 4)).T)
+    z = depth_img[v, u]
+    foc = cam.height / (2 * np.tan(np.deg2rad(cam.fov) / 2))
+    x = (u - cam.width // 2) * z / foc
+    y = -(v - cam.height // 2) * z / foc
+    z = -z
+    ones = np.ones_like(z)
+
+    cam_pos = np.stack([x, y, z, ones], axis=0)
+    world_pos = T_world_cam @ cam_pos
+
+    # Sanity check by using camera.deproject function.  Check 2000 points.
+    if sanity_check:
+        sample_inds = np.random.permutation(u.shape[0])[:2000]
+        for ind in sample_inds:
+            cam_world_pos = cam.deproject((u[ind], v[ind]), depth_img, homogeneous=True)
+            assert np.abs(cam_world_pos-world_pos[:, ind]).max() <= 1e-3
+
+    if not homogeneous:
+        world_pos = world_pos[:3]
+
+    return world_pos
+
+
+def convert_rotation(rot):
+    """Convert Euler angles to Quarternion
+    """
+    rot = torch.as_tensor(rot)
+    mat = pytorch3d_transforms.euler_angles_to_matrix(rot, "XYZ")
+    quat = pytorch3d_transforms.matrix_to_quaternion(mat)
+    quat = quat.numpy()
+
+    return quat
+
+
+def to_relative_action(actions, robot_obs, max_pos=1.0, max_orn=1.0, clip=True):
+    assert isinstance(actions, np.ndarray)
+    assert isinstance(robot_obs, np.ndarray)
+
+    rel_pos = actions[..., :3] - robot_obs[..., :3]
+    if clip:
+        rel_pos = np.clip(rel_pos, -max_pos, max_pos) / max_pos
+    else:
+        rel_pos = rel_pos / max_pos
+
+    rel_orn = angle_between_angles(robot_obs[..., 3:6], actions[..., 3:6])
+    if clip:
+        rel_orn = np.clip(rel_orn, -max_orn, max_orn) / max_orn
+    else:
+        rel_orn = rel_orn / max_orn
+
+    gripper = actions[..., -1:]
+    return np.concatenate([rel_pos, rel_orn, gripper])
+
+
+def relative_to_absolute(action, proprio, max_rel_pos=1.0, max_rel_orn=1.0,
+                         magic_scaling_factor_pos=1, magic_scaling_factor_orn=1):
+    assert action.shape[-1] == 7
+    assert proprio.shape[-1] == 7
+
+    rel_pos, rel_orn, gripper = np.split(action, [3, 6], axis=-1)
+    rel_pos *= max_rel_pos * magic_scaling_factor_pos
+    rel_orn *= max_rel_orn * magic_scaling_factor_orn
+
+    pos_proprio, orn_proprio = proprio[..., :3], proprio[..., 3:6]
+
+    target_pos = pos_proprio + rel_pos
+    target_orn = orn_proprio + rel_orn
+    return np.concatenate([target_pos, target_orn, gripper], axis=-1)
diff --git a/utils/utils_with_rlbench.py b/utils/utils_with_rlbench.py
new file mode 100644
index 0000000..fe2d35a
--- /dev/null
+++ b/utils/utils_with_rlbench.py
@@ -0,0 +1,843 @@
+import os
+import glob
+import random
+from typing import List, Dict, Any
+from pathlib import Path
+import json
+
+import open3d
+import traceback
+from tqdm import tqdm
+import numpy as np
+import torch
+import torch.nn.functional as F
+import einops
+
+from rlbench.observation_config import ObservationConfig, CameraConfig
+from rlbench.environment import Environment
+from rlbench.task_environment import TaskEnvironment
+from rlbench.action_modes.action_mode import MoveArmThenGripper
+from rlbench.action_modes.gripper_action_modes import Discrete
+from rlbench.action_modes.arm_action_modes import EndEffectorPoseViaPlanning
+from rlbench.backend.exceptions import InvalidActionError
+from rlbench.demo import Demo
+from pyrep.errors import IKError, ConfigurationPathError
+from pyrep.const import RenderMode
+
+
+ALL_RLBENCH_TASKS = [
+    'basketball_in_hoop', 'beat_the_buzz', 'change_channel', 'change_clock', 'close_box',
+    'close_door', 'close_drawer', 'close_fridge', 'close_grill', 'close_jar', 'close_laptop_lid',
+    'close_microwave', 'hang_frame_on_hanger', 'insert_onto_square_peg', 'insert_usb_in_computer',
+    'lamp_off', 'lamp_on', 'lift_numbered_block', 'light_bulb_in', 'meat_off_grill', 'meat_on_grill',
+    'move_hanger', 'open_box', 'open_door', 'open_drawer', 'open_fridge', 'open_grill',
+    'open_microwave', 'open_oven', 'open_window', 'open_wine_bottle', 'phone_on_base',
+    'pick_and_lift', 'pick_and_lift_small', 'pick_up_cup', 'place_cups', 'place_hanger_on_rack',
+    'place_shape_in_shape_sorter', 'place_wine_at_rack_location', 'play_jenga',
+    'plug_charger_in_power_supply', 'press_switch', 'push_button', 'push_buttons', 'put_books_on_bookshelf',
+    'put_groceries_in_cupboard', 'put_item_in_drawer', 'put_knife_on_chopping_board', 'put_money_in_safe',
+    'put_rubbish_in_bin', 'put_umbrella_in_umbrella_stand', 'reach_and_drag', 'reach_target',
+    'scoop_with_spatula', 'screw_nail', 'setup_checkers', 'slide_block_to_color_target',
+    'slide_block_to_target', 'slide_cabinet_open_and_place_cups', 'stack_blocks', 'stack_cups',
+    'stack_wine', 'straighten_rope', 'sweep_to_dustpan', 'sweep_to_dustpan_of_size', 'take_frame_off_hanger',
+    'take_lid_off_saucepan', 'take_money_out_safe', 'take_plate_off_colored_dish_rack', 'take_shoes_out_of_box',
+    'take_toilet_roll_off_stand', 'take_umbrella_out_of_umbrella_stand', 'take_usb_out_of_computer',
+    'toilet_seat_down', 'toilet_seat_up', 'tower3', 'turn_oven_on', 'turn_tap', 'tv_on', 'unplug_charger',
+    'water_plants', 'wipe_desk'
+]
+TASK_TO_ID = {task: i for i, task in enumerate(ALL_RLBENCH_TASKS)}
+
+
+def task_file_to_task_class(task_file):
+    import importlib
+
+    name = task_file.replace(".py", "")
+    class_name = "".join([w[0].upper() + w[1:] for w in name.split("_")])
+    mod = importlib.import_module("rlbench.tasks.%s" % name)
+    mod = importlib.reload(mod)
+    task_class = getattr(mod, class_name)
+    return task_class
+
+
+def load_episodes() -> Dict[str, Any]:
+    with open(Path(__file__).parent.parent / "data_preprocessing/episodes.json") as fid:
+        return json.load(fid)
+
+
+class Mover:
+
+    def __init__(self, task, disabled=False, max_tries=1):
+        self._task = task
+        self._last_action = None
+        self._step_id = 0
+        self._max_tries = max_tries
+        self._disabled = disabled
+
+    def __call__(self, action, collision_checking=False):
+        if self._disabled:
+            return self._task.step(action)
+
+        target = action.copy()
+        if self._last_action is not None:
+            action[7] = self._last_action[7].copy()
+
+        images = []
+        try_id = 0
+        obs = None
+        terminate = None
+        reward = 0
+
+        for try_id in range(self._max_tries):
+            action_collision = np.ones(action.shape[0]+1)
+            action_collision[:-1] = action
+            if collision_checking:
+                action_collision[-1] = 0
+            obs, reward, terminate = self._task.step(action_collision)
+
+            pos = obs.gripper_pose[:3]
+            rot = obs.gripper_pose[3:7]
+            dist_pos = np.sqrt(np.square(target[:3] - pos).sum())
+            dist_rot = np.sqrt(np.square(target[3:7] - rot).sum())
+            criteria = (dist_pos < 5e-3,)
+
+            if all(criteria) or reward == 1:
+                break
+
+            print(
+                f"Too far away (pos: {dist_pos:.3f}, rot: {dist_rot:.3f}, step: {self._step_id})... Retrying..."
+            )
+
+        # we execute the gripper action after re-tries
+        action = target
+        if (
+            not reward == 1.0
+            and self._last_action is not None
+            and action[7] != self._last_action[7]
+        ):
+            action_collision = np.ones(action.shape[0]+1)
+            action_collision[:-1] = action
+            if collision_checking:
+                action_collision[-1] = 0
+            obs, reward, terminate = self._task.step(action_collision)
+
+        if try_id == self._max_tries:
+            print(f"Failure after {self._max_tries} tries")
+
+        self._step_id += 1
+        self._last_action = action.copy()
+
+        return obs, reward, terminate, images
+
+
+class Actioner:
+
+    def __init__(
+        self,
+        policy=None,
+        instructions=None,
+        apply_cameras=("left_shoulder", "right_shoulder", "wrist"),
+        action_dim=7,
+        predict_trajectory=True
+    ):
+        self._policy = policy
+        self._instructions = instructions
+        self._apply_cameras = apply_cameras
+        self._action_dim = action_dim
+        self._predict_trajectory = predict_trajectory
+
+        self._actions = {}
+        self._instr = None
+        self._task_str = None
+
+        self._policy.eval()
+
+    def load_episode(self, task_str, variation):
+        self._task_str = task_str
+        instructions = list(self._instructions[task_str][variation])
+        self._instr = random.choice(instructions).unsqueeze(0)
+        self._task_id = torch.tensor(TASK_TO_ID[task_str]).unsqueeze(0)
+        self._actions = {}
+
+    def get_action_from_demo(self, demo):
+        """
+        Fetch the desired state and action based on the provided demo.
+            :param demo: fetch each demo and save key-point observations
+            :return: a list of obs and action
+        """
+        key_frame = keypoint_discovery(demo)
+
+        action_ls = []
+        trajectory_ls = []
+        for i in range(len(key_frame)):
+            obs = demo[key_frame[i]]
+            action_np = np.concatenate([obs.gripper_pose, [obs.gripper_open]])
+            action = torch.from_numpy(action_np)
+            action_ls.append(action.unsqueeze(0))
+
+            trajectory_np = []
+            for j in range(key_frame[i - 1] if i > 0 else 0, key_frame[i]):
+                obs = demo[j]
+                trajectory_np.append(np.concatenate([
+                    obs.gripper_pose, [obs.gripper_open]
+                ]))
+            trajectory_ls.append(np.stack(trajectory_np))
+
+        trajectory_mask_ls = [
+            torch.zeros(1, key_frame[i] - (key_frame[i - 1] if i > 0 else 0)).bool()
+            for i in range(len(key_frame))
+        ]
+
+        return action_ls, trajectory_ls, trajectory_mask_ls
+
+    def predict(self, rgbs, pcds, gripper,
+                interpolation_length=None):
+        """
+        Args:
+            rgbs: (bs, num_hist, num_cameras, 3, H, W)
+            pcds: (bs, num_hist, num_cameras, 3, H, W)
+            gripper: (B, nhist, output_dim)
+            interpolation_length: an integer
+
+        Returns:
+            {"action": torch.Tensor, "trajectory": torch.Tensor}
+        """
+        output = {"action": None, "trajectory": None}
+
+        rgbs = rgbs / 2 + 0.5  # in [0, 1]
+
+        if self._instr is None:
+            raise ValueError()
+
+        self._instr = self._instr.to(rgbs.device)
+        self._task_id = self._task_id.to(rgbs.device)
+
+        # Predict trajectory
+        if self._predict_trajectory:
+            print('Predict Trajectory')
+            fake_traj = torch.full(
+                [1, interpolation_length - 1, gripper.shape[-1]], 0
+            ).to(rgbs.device)
+            traj_mask = torch.full(
+                [1, interpolation_length - 1], False
+            ).to(rgbs.device)
+            output["trajectory"] = self._policy(
+                fake_traj,
+                traj_mask,
+                rgbs[:, -1],
+                pcds[:, -1],
+                self._instr,
+                gripper[..., :7],
+                run_inference=True
+            )
+        else:
+            print('Predict Keypose')
+            pred = self._policy(
+                rgbs[:, -1],
+                pcds[:, -1],
+                self._instr,
+                gripper[:, -1, :self._action_dim],
+            )
+            # Hackish, assume self._policy is an instance of Act3D
+            output["action"] = self._policy.prepare_action(pred)
+
+        return output
+
+    @property
+    def device(self):
+        return next(self._policy.parameters()).device
+
+
+def obs_to_attn(obs, camera):
+    extrinsics_44 = torch.from_numpy(
+        obs.misc[f"{camera}_camera_extrinsics"]
+    ).float()
+    extrinsics_44 = torch.linalg.inv(extrinsics_44)
+    intrinsics_33 = torch.from_numpy(
+        obs.misc[f"{camera}_camera_intrinsics"]
+    ).float()
+    intrinsics_34 = F.pad(intrinsics_33, (0, 1, 0, 0))
+    gripper_pos_3 = torch.from_numpy(obs.gripper_pose[:3]).float()
+    gripper_pos_41 = F.pad(gripper_pos_3, (0, 1), value=1).unsqueeze(1)
+    points_cam_41 = extrinsics_44 @ gripper_pos_41
+
+    proj_31 = intrinsics_34 @ points_cam_41
+    proj_3 = proj_31.float().squeeze(1)
+    u = int((proj_3[0] / proj_3[2]).round())
+    v = int((proj_3[1] / proj_3[2]).round())
+
+    return u, v
+
+
+class RLBenchEnv:
+
+    def __init__(
+        self,
+        data_path,
+        image_size=(128, 128),
+        apply_rgb=False,
+        apply_depth=False,
+        apply_pc=False,
+        headless=False,
+        apply_cameras=("left_shoulder", "right_shoulder", "wrist", "front"),
+        fine_sampling_ball_diameter=None,
+        collision_checking=False
+    ):
+
+        # setup required inputs
+        self.data_path = data_path
+        self.apply_rgb = apply_rgb
+        self.apply_depth = apply_depth
+        self.apply_pc = apply_pc
+        self.apply_cameras = apply_cameras
+        self.fine_sampling_ball_diameter = fine_sampling_ball_diameter
+
+        # setup RLBench environments
+        self.obs_config = self.create_obs_config(
+            image_size, apply_rgb, apply_depth, apply_pc, apply_cameras
+        )
+
+        self.action_mode = MoveArmThenGripper(
+            arm_action_mode=EndEffectorPoseViaPlanning(collision_checking=collision_checking),
+            gripper_action_mode=Discrete()
+        )
+        self.env = Environment(
+            self.action_mode, str(data_path), self.obs_config,
+            headless=headless
+        )
+        self.image_size = image_size
+
+    def get_obs_action(self, obs):
+        """
+        Fetch the desired state and action based on the provided demo.
+            :param obs: incoming obs
+            :return: required observation and action list
+        """
+
+        # fetch state
+        state_dict = {"rgb": [], "depth": [], "pc": []}
+        for cam in self.apply_cameras:
+            if self.apply_rgb:
+                rgb = getattr(obs, "{}_rgb".format(cam))
+                state_dict["rgb"] += [rgb]
+
+            if self.apply_depth:
+                depth = getattr(obs, "{}_depth".format(cam))
+                state_dict["depth"] += [depth]
+
+            if self.apply_pc:
+                pc = getattr(obs, "{}_point_cloud".format(cam))
+                state_dict["pc"] += [pc]
+
+        # fetch action
+        action = np.concatenate([obs.gripper_pose, [obs.gripper_open]])
+        return state_dict, torch.from_numpy(action).float()
+
+    def get_rgb_pcd_gripper_from_obs(self, obs):
+        """
+        Return rgb, pcd, and gripper from a given observation
+        :param obs: an Observation from the env
+        :return: rgb, pcd, gripper
+        """
+        state_dict, gripper = self.get_obs_action(obs)
+        state = transform(state_dict, augmentation=False)
+        state = einops.rearrange(
+            state,
+            "(m n ch) h w -> n m ch h w",
+            ch=3,
+            n=len(self.apply_cameras),
+            m=2
+        )
+        rgb = state[:, 0].unsqueeze(0)  # 1, N, C, H, W
+        pcd = state[:, 1].unsqueeze(0)  # 1, N, C, H, W
+        gripper = gripper.unsqueeze(0)  # 1, D
+
+        attns = torch.Tensor([])
+        for cam in self.apply_cameras:
+            u, v = obs_to_attn(obs, cam)
+            attn = torch.zeros(1, 1, 1, self.image_size[0], self.image_size[1])
+            if not (u < 0 or u > self.image_size[1] - 1 or v < 0 or v > self.image_size[0] - 1):
+                attn[0, 0, 0, v, u] = 1
+            attns = torch.cat([attns, attn], 1)
+        rgb = torch.cat([rgb, attns], 2)
+
+        return rgb, pcd, gripper
+
+    def get_obs_action_from_demo(self, demo):
+        """
+        Fetch the desired state and action based on the provided demo.
+            :param demo: fetch each demo and save key-point observations
+            :param normalise_rgb: normalise rgb to (-1, 1)
+            :return: a list of obs and action
+        """
+        key_frame = keypoint_discovery(demo)
+        key_frame.insert(0, 0)
+        state_ls = []
+        action_ls = []
+        for f in key_frame:
+            state, action = self.get_obs_action(demo._observations[f])
+            state = transform(state, augmentation=False)
+            state_ls.append(state.unsqueeze(0))
+            action_ls.append(action.unsqueeze(0))
+        return state_ls, action_ls
+
+    def get_gripper_matrix_from_action(self, action):
+        action = action.cpu().numpy()
+        position = action[:3]
+        quaternion = action[3:7]
+        rotation = open3d.geometry.get_rotation_matrix_from_quaternion(
+            np.array((quaternion[3], quaternion[0], quaternion[1], quaternion[2]))
+        )
+        gripper_matrix = np.eye(4)
+        gripper_matrix[:3, :3] = rotation
+        gripper_matrix[:3, 3] = position
+        return gripper_matrix
+
+    def get_demo(self, task_name, variation, episode_index):
+        """
+        Fetch a demo from the saved environment.
+            :param task_name: fetch task name
+            :param variation: fetch variation id
+            :param episode_index: fetch episode index: 0 ~ 99
+            :return: desired demo
+        """
+        demos = self.env.get_demos(
+            task_name=task_name,
+            variation_number=variation,
+            amount=1,
+            from_episode_number=episode_index,
+            random_selection=False
+        )
+        return demos
+
+    def evaluate_task_on_multiple_variations(
+        self,
+        task_str: str,
+        max_steps: int,
+        num_variations: int,  # -1 means all variations
+        num_demos: int,
+        actioner: Actioner,
+        max_tries: int = 1,
+        verbose: bool = False,
+        dense_interpolation=False,
+        interpolation_length=100,
+        num_history=1,
+    ):
+        self.env.launch()
+        task_type = task_file_to_task_class(task_str)
+        task = self.env.get_task(task_type)
+        task_variations = task.variation_count()
+
+        if num_variations > 0:
+            task_variations = np.minimum(num_variations, task_variations)
+            task_variations = range(task_variations)
+        else:
+            task_variations = glob.glob(os.path.join(self.data_path, task_str, "variation*"))
+            task_variations = [int(n.split('/')[-1].replace('variation', '')) for n in task_variations]
+
+        var_success_rates = {}
+        var_num_valid_demos = {}
+
+        for variation in task_variations:
+            task.set_variation(variation)
+            success_rate, valid, num_valid_demos = (
+                self._evaluate_task_on_one_variation(
+                    task_str=task_str,
+                    task=task,
+                    max_steps=max_steps,
+                    variation=variation,
+                    num_demos=num_demos // len(task_variations) + 1,
+                    actioner=actioner,
+                    max_tries=max_tries,
+                    verbose=verbose,
+                    dense_interpolation=dense_interpolation,
+                    interpolation_length=interpolation_length,
+                    num_history=num_history
+                )
+            )
+            if valid:
+                var_success_rates[variation] = success_rate
+                var_num_valid_demos[variation] = num_valid_demos
+
+        self.env.shutdown()
+
+        var_success_rates["mean"] = (
+            sum(var_success_rates.values()) /
+            sum(var_num_valid_demos.values())
+        )
+
+        return var_success_rates
+
+    @torch.no_grad()
+    def _evaluate_task_on_one_variation(
+        self,
+        task_str: str,
+        task: TaskEnvironment,
+        max_steps: int,
+        variation: int,
+        num_demos: int,
+        actioner: Actioner,
+        max_tries: int = 1,
+        verbose: bool = False,
+        dense_interpolation=False,
+        interpolation_length=50,
+        num_history=0,
+    ):
+        device = actioner.device
+
+        success_rate = 0
+        num_valid_demos = 0
+        total_reward = 0
+
+        for demo_id in range(num_demos):
+            if verbose:
+                print()
+                print(f"Starting demo {demo_id}")
+
+            try:
+                demo = self.get_demo(task_str, variation, episode_index=demo_id)[0]
+                num_valid_demos += 1
+            except:
+                continue
+
+            rgbs = torch.Tensor([]).to(device)
+            pcds = torch.Tensor([]).to(device)
+            grippers = torch.Tensor([]).to(device)
+
+            # descriptions, obs = task.reset()
+            descriptions, obs = task.reset_to_demo(demo)
+
+            actioner.load_episode(task_str, variation)
+
+            move = Mover(task, max_tries=max_tries)
+            reward = 0.0
+            max_reward = 0.0
+
+            for step_id in range(max_steps):
+
+                # Fetch the current observation, and predict one action
+                rgb, pcd, gripper = self.get_rgb_pcd_gripper_from_obs(obs)
+                rgb = rgb.to(device)
+                pcd = pcd.to(device)
+                gripper = gripper.to(device)
+
+                rgbs = torch.cat([rgbs, rgb.unsqueeze(1)], dim=1)
+                pcds = torch.cat([pcds, pcd.unsqueeze(1)], dim=1)
+                grippers = torch.cat([grippers, gripper.unsqueeze(1)], dim=1)
+
+                # Prepare proprioception history
+                rgbs_input = rgbs[:, -1:][:, :, :, :3]
+                pcds_input = pcds[:, -1:]
+                if num_history < 1:
+                    gripper_input = grippers[:, -1]
+                else:
+                    gripper_input = grippers[:, -num_history:]
+                    npad = num_history - gripper_input.shape[1]
+                    gripper_input = F.pad(
+                        gripper_input, (0, 0, npad, 0), mode='replicate'
+                    )
+
+                output = actioner.predict(
+                    rgbs_input,
+                    pcds_input,
+                    gripper_input,
+                    interpolation_length=interpolation_length
+                )
+
+                if verbose:
+                    print(f"Step {step_id}")
+
+                terminate = True
+
+                # Update the observation based on the predicted action
+                try:
+                    # Execute entire predicted trajectory step by step
+                    if output.get("trajectory", None) is not None:
+                        trajectory = output["trajectory"][-1].cpu().numpy()
+                        trajectory[:, -1] = trajectory[:, -1].round()
+
+                        # execute
+                        for action in tqdm(trajectory):
+                            #try:
+                            #    collision_checking = self._collision_checking(task_str, step_id)
+                            #    obs, reward, terminate, _ = move(action_np, collision_checking=collision_checking)
+                            #except:
+                            #    terminate = True
+                            #    pass
+                            collision_checking = self._collision_checking(task_str, step_id)
+                            obs, reward, terminate, _ = move(action, collision_checking=collision_checking)
+
+                    # Or plan to reach next predicted keypoint
+                    else:
+                        print("Plan with RRT")
+                        action = output["action"]
+                        action[..., -1] = torch.round(action[..., -1])
+                        action = action[-1].detach().cpu().numpy()
+
+                        collision_checking = self._collision_checking(task_str, step_id)
+                        obs, reward, terminate, _ = move(action, collision_checking=collision_checking)
+
+                    max_reward = max(max_reward, reward)
+
+                    if reward == 1:
+                        success_rate += 1
+                        break
+
+                    if terminate:
+                        print("The episode has terminated!")
+
+                except (IKError, ConfigurationPathError, InvalidActionError) as e:
+                    print(task_str, demo, step_id, success_rate, e)
+                    reward = 0
+                    #break
+
+            total_reward += max_reward
+            if reward == 0:
+                step_id += 1
+
+            print(
+                task_str,
+                "Variation",
+                variation,
+                "Demo",
+                demo_id,
+                "Reward",
+                f"{reward:.2f}",
+                "max_reward",
+                f"{max_reward:.2f}",
+                f"SR: {success_rate}/{demo_id+1}",
+                f"SR: {total_reward:.2f}/{demo_id+1}",
+                "# valid demos", num_valid_demos,
+            )
+
+        # Compensate for failed demos
+        if num_valid_demos == 0:
+            assert success_rate == 0
+            valid = False
+        else:
+            valid = True
+
+        return success_rate, valid, num_valid_demos
+
+    def _collision_checking(self, task_str, step_id):
+        """Collision checking for planner."""
+        # collision_checking = True
+        collision_checking = False
+        # if task_str == 'close_door':
+        #     collision_checking = True
+        # if task_str == 'open_fridge' and step_id == 0:
+        #     collision_checking = True
+        # if task_str == 'open_oven' and step_id == 3:
+        #     collision_checking = True
+        # if task_str == 'hang_frame_on_hanger' and step_id == 0:
+        #     collision_checking = True
+        # if task_str == 'take_frame_off_hanger' and step_id == 0:
+        #     for i in range(300):
+        #         self.env._scene.step()
+        #     collision_checking = True
+        # if task_str == 'put_books_on_bookshelf' and step_id == 0:
+        #     collision_checking = True
+        # if task_str == 'slide_cabinet_open_and_place_cups' and step_id == 0:
+        #     collision_checking = True
+        return collision_checking
+
+    def verify_demos(
+        self,
+        task_str: str,
+        variation: int,
+        num_demos: int,
+        max_tries: int = 1,
+        verbose: bool = False,
+    ):
+        if verbose:
+            print()
+            print(f"{task_str}, variation {variation}, {num_demos} demos")
+
+        self.env.launch()
+        task_type = task_file_to_task_class(task_str)
+        task = self.env.get_task(task_type)
+        task.set_variation(variation)  # type: ignore
+
+        success_rate = 0.0
+        invalid_demos = 0
+
+        for demo_id in range(num_demos):
+            if verbose:
+                print(f"Starting demo {demo_id}")
+
+            try:
+                demo = self.get_demo(task_str, variation, episode_index=demo_id)[0]
+            except:
+                print(f"Invalid demo {demo_id} for {task_str} variation {variation}")
+                print()
+                traceback.print_exc()
+                invalid_demos += 1
+
+            task.reset_to_demo(demo)
+
+            gt_keyframe_actions = []
+            for f in keypoint_discovery(demo):
+                obs = demo[f]
+                action = np.concatenate([obs.gripper_pose, [obs.gripper_open]])
+                gt_keyframe_actions.append(action)
+
+            move = Mover(task, max_tries=max_tries)
+
+            for step_id, action in enumerate(gt_keyframe_actions):
+                if verbose:
+                    print(f"Step {step_id}")
+
+                try:
+                    obs, reward, terminate, step_images = move(action)
+                    if reward == 1:
+                        success_rate += 1 / num_demos
+                        break
+                    if terminate and verbose:
+                        print("The episode has terminated!")
+
+                except (IKError, ConfigurationPathError, InvalidActionError) as e:
+                    print(task_type, demo, success_rate, e)
+                    reward = 0
+                    break
+
+            if verbose:
+                print(f"Finished demo {demo_id}, SR: {success_rate}")
+
+        # Compensate for failed demos
+        if (num_demos - invalid_demos) == 0:
+            success_rate = 0.0
+            valid = False
+        else:
+            success_rate = success_rate * num_demos / (num_demos - invalid_demos)
+            valid = True
+
+        self.env.shutdown()
+        return success_rate, valid, invalid_demos
+
+    def create_obs_config(
+        self, image_size, apply_rgb, apply_depth, apply_pc, apply_cameras, **kwargs
+    ):
+        """
+        Set up observation config for RLBench environment.
+            :param image_size: Image size.
+            :param apply_rgb: Applying RGB as inputs.
+            :param apply_depth: Applying Depth as inputs.
+            :param apply_pc: Applying Point Cloud as inputs.
+            :param apply_cameras: Desired cameras.
+            :return: observation config
+        """
+        unused_cams = CameraConfig()
+        unused_cams.set_all(False)
+        used_cams = CameraConfig(
+            rgb=apply_rgb,
+            point_cloud=apply_pc,
+            depth=apply_depth,
+            mask=False,
+            image_size=image_size,
+            render_mode=RenderMode.OPENGL,
+            **kwargs,
+        )
+
+        camera_names = apply_cameras
+        kwargs = {}
+        for n in camera_names:
+            kwargs[n] = used_cams
+
+        obs_config = ObservationConfig(
+            front_camera=kwargs.get("front", unused_cams),
+            left_shoulder_camera=kwargs.get("left_shoulder", unused_cams),
+            right_shoulder_camera=kwargs.get("right_shoulder", unused_cams),
+            wrist_camera=kwargs.get("wrist", unused_cams),
+            overhead_camera=kwargs.get("overhead", unused_cams),
+            joint_forces=False,
+            joint_positions=False,
+            joint_velocities=True,
+            task_low_dim_state=False,
+            gripper_touch_forces=False,
+            gripper_pose=True,
+            gripper_open=True,
+            gripper_matrix=True,
+            gripper_joint_positions=True,
+        )
+
+        return obs_config
+
+
+# Identify way-point in each RLBench Demo
+def _is_stopped(demo, i, obs, stopped_buffer, delta):
+    next_is_not_final = i == (len(demo) - 2)
+    # gripper_state_no_change = i < (len(demo) - 2) and (
+    #     obs.gripper_open == demo[i + 1].gripper_open
+    #     and obs.gripper_open == demo[i - 1].gripper_open
+    #     and demo[i - 2].gripper_open == demo[i - 1].gripper_open
+    # )
+    gripper_state_no_change = i < (len(demo) - 2) and (
+        obs.gripper_open == demo[i + 1].gripper_open
+        and obs.gripper_open == demo[max(0, i - 1)].gripper_open
+        and demo[max(0, i - 2)].gripper_open == demo[max(0, i - 1)].gripper_open
+    )
+    small_delta = np.allclose(obs.joint_velocities, 0, atol=delta)
+    stopped = (
+        stopped_buffer <= 0
+        and small_delta
+        and (not next_is_not_final)
+        and gripper_state_no_change
+    )
+    return stopped
+
+
+def keypoint_discovery(demo: Demo, stopping_delta=0.1) -> List[int]:
+    episode_keypoints = []
+    prev_gripper_open = demo[0].gripper_open
+    stopped_buffer = 0
+
+    for i, obs in enumerate(demo):
+        stopped = _is_stopped(demo, i, obs, stopped_buffer, stopping_delta)
+        stopped_buffer = 4 if stopped else stopped_buffer - 1
+        # If change in gripper, or end of episode.
+        last = i == (len(demo) - 1)
+        if i != 0 and (obs.gripper_open != prev_gripper_open or last or stopped):
+            episode_keypoints.append(i)
+        prev_gripper_open = obs.gripper_open
+
+    if (
+        len(episode_keypoints) > 1
+        and (episode_keypoints[-1] - 1) == episode_keypoints[-2]
+    ):
+        episode_keypoints.pop(-2)
+
+    return episode_keypoints
+
+
+def transform(obs_dict, scale_size=(0.75, 1.25), augmentation=False):
+    apply_depth = len(obs_dict.get("depth", [])) > 0
+    apply_pc = len(obs_dict["pc"]) > 0
+    num_cams = len(obs_dict["rgb"])
+
+    obs_rgb = []
+    obs_depth = []
+    obs_pc = []
+    for i in range(num_cams):
+        rgb = torch.tensor(obs_dict["rgb"][i]).float().permute(2, 0, 1)
+        depth = (
+            torch.tensor(obs_dict["depth"][i]).float().permute(2, 0, 1)
+            if apply_depth
+            else None
+        )
+        pc = (
+            torch.tensor(obs_dict["pc"][i]).float().permute(2, 0, 1) if apply_pc else None
+        )
+
+        if augmentation:
+            raise NotImplementedError()  # Deprecated
+
+        # normalise to [-1, 1]
+        rgb = rgb / 255.0
+        rgb = 2 * (rgb - 0.5)
+
+        obs_rgb += [rgb.float()]
+        if depth is not None:
+            obs_depth += [depth.float()]
+        if pc is not None:
+            obs_pc += [pc.float()]
+    obs = obs_rgb + obs_depth + obs_pc
+    return torch.cat(obs, dim=0)