Skip to content

📎 + 🦾 CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

License

Notifications You must be signed in to change notification settings

underthelights/clip-rt

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLIP-RT

[Project Page] [Paper] [Citations]

CLIP-RT (CLIP-based Robotics Transformer) is a vision-language-action (VLA) model for generalist manipulation policies. We seamlessly extend OpenAI's CLIP to robot learning. It learns to predict the robotic action specified in natural language, given an image and natural language instruction. We found CLIP-RT effectively learns end-to-end robotic policies for novel robotic manipulation tasks.

Approach

CLIP-RT

Usage

CLIP-RT is based on an open source implementation of CLIP, OpenCLIP. You can easily use CLIP models with different configurations through a plug-and-play approach. In our project, we used pytorch v2.3.1 and open_clip_torch v2.26.1. For more details, please consult OpenCLIP's directory.

python3 -m venv clip-rt
source clip-rt/bin/activate
pip install -U pip
pip install open_clip_torch
import torch
from PIL import Image
import open_clip

model_name = 'ViT-H-14-378-quickgelu'
model_path = 'clip_rt_ckpt.pt'
prompt = "what motion should the robot arm perform to complete the instruction '{}'?"

model, _, preprocess = open_clip.create_model_and_transforms(model_name=model_name, pretrained=model_path)
model.eval()  # model in train mode by default, impacts some models with BatchNorm or stochastic depth active
tokenizer = open_clip.get_tokenizer(model_name)

image = preprocess(Image.open("docs/example.png")).unsqueeze(0)
inst = tokenizer(prompt.format("close the laptop"))
actions = tokenizer(["lower the arm by 5cm", "rotate the gripper 90 degrees clockwise", ...])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    inst_features = model.encode_text(inst)
    context_features = image_features + inst_features
    action_features = model.encode_text(actions)

    context_features /= context_features.norm(dim=-1, keepdim=True)
    action_features /= action_features.norm(dim=-1, keepdim=True)

    action_probs = (100.0 * context_features @ action_features.T).sigmoid()

print("Action probs:", action_probs)  # prints: [.92, .01, ...]

Pretrained Models

We provide two pretrained models as:

Model Trained Data Link
CLIP-RT (pretrained) Open X-Embodiment data Download
CLIP-RT (fine-tuning) Open X-Embodiment data + In-domain data Download

Training CLIP-RT

Install

You can then install clip for training with pip install 'open_clip_torch[training]'.

Pretraining

We pretrain CLIP-RT using the Open X-Embodiment dataset curated by OpenVLA. Since the dataset does not contain natural language supervision for robot learning, we extract this supervision from the low-level action and save as webdataset:

  1. Download Open X-Embodiment data (see OpenVLA)

  2. Preprocess for pretraining

cd oxe_data_preprocess
python preprocess.py
  1. Train CLIP-RT. If you want to change configurations, please see the shell script below.
cd open_clip/src
./scripts/train.sh

Fine-tuning on in-domain data

  1. Preprocess for fine-tuning

OpenCLIP supports the csv file or the webdataset for training. We construct the csv file as:

import csv

with open(csv_path, 'w', newline='') as f:
    csv_out = csv.writer(f, delimiter=',')
    csv_out.writerow(['filepath', 'caption', 'supervision', 'label'])

    # we assume each sample is a tuple of four data
    for sample in samples:
        item = []

	# a path for raw image 
	item.append(sample['image_path'])

        # natural language instruction
	prompt = "what motion should the robot arm perform to complete the instruction '{}'?" 
	item.append(prompt.format(sample['instruction']))
		
	# natural language supervision (e.g., move the arm forward by 1cm)
	item.append(sample['supervision'])

	# label for natural language supervision.
	# this can be any integer number.
	# just ensure: set the same label for natural language supervisions that share the same low-level action 
        item.append(sample['label'])
        csv_out.writerow(item)

Please check open_clip/src/training/data.py to see how CLIP-RT load data.

  1. Fine-tune CLIP-RT.
cd open_clip/src
./scripts/finetune.sh

Acknowledgements

We use OpenCLIP for model implementation and OpenVLA for data preprocessing. Thanks!

Citing

If you found this repository useful, please consider citing:

@article{kang2024cliprt,
  title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
  author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2411.00508},
  year={2024}
}

About

📎 + 🦾 CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 55.3%
  • Jupyter Notebook 44.0%
  • Other 0.7%