This project uses gaze information from VR-Headset to segment YCB objects from images or to extract saliency information.
We perform some experiments using the Segment Anything Model (SAM) and DINO respectively for image segmentation and saliency extraction.
The code requires python>=3.8
, pytorch>=1.7
and torchvision>=0.8
. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended.
First create a conda environment:
conda create --name sam-ycb
conda activate sam-ycb
Install SegmentAnything and its dependencies (for mask post-processing):
pip install git+https://github.com/facebookresearch/segment-anything.git
pip install opencv-python pycocotools matplotlib onnxruntime onnx
Install Yolov8:
pip install ultralytics
Segment Anything has demonstrated incredible zero-shot generalization capabilities to predict segmentation of objects on images given prompts. YOLOv8 model is a very powerful model for object detection that can also be used for object segmentation. However, while its training for object-detection is really easy to setup, its segmentation training is cumbersome because it needs a specific format not directly applicable to other downstream tasks.
DINO is a Vision Transformer model trained in a self-supervised manner that exhibited powerful zero-shot object-centric prior on images. We believe that combining the information from such pre-trained model with the Gaze information extracted from a VR Headset, we would obtain meaningful saliency maps.
In this project, we want to combine the capacity of YOLOv8 to detect objects with the incredible capacities of SAM to segment objects given a prompt to detect and segment objects from a given dataset (here the YCB-video dataset). We also want to explore the use of saliency maps to help understanding the user intention.
This base project would then be used in combination with VR-Headset gaze in order to extract the objects of interest that a user is looking at.
We first need to pre-process the YCB-Video dataset in order to make it possible to train YOLO on it. We follow the formatting given here to format the dataset. To launch the pre-processing, launch the following script :
python3 process_ycb.py --dataset_path <PATH_TO YCB_Video_Dataset FOLDER> --data_config <PATH_TO_CONFIG_YAML_FILE>
When launched this script takes a long time (~50 minutes), so go grab a cup of coffee while it is running.
Once the data are ready, we can begin to train YOLO to detect YCB objects on the images. We will then train it on the YCB dataset with the train_yolo.py
script.
First of all, download the YOLOv8n model that we will fine-tune on YCB-video dataset : download.
To launch the script you can use the following command line :
python3 train_yolo.py --model_path <PATH_TO_THE_MODEL> --data_config <PATH_TO_CONFIG_YAML_FILE>
Now that Yolo is trained, we will combine its resulting bounding box to extract the associated segmentation mask with the Segment Anything Model. To do so, launch the ycb_segment.py
script by using the following command line :
python3 ycb_segment.py --image_folder <PATH_TO_FOLDER_OF_IMAGES> --yolo_path <PATH_TO_YOLO_MODEL>
The final step of the project is to further combine the information from the bounding box AND a point gaze information to lock an object in a scene and return its segmentation mask.
This mask will then be used as input to the 6D pose estimation model DenseFusion to extract the pose of the detected object.
It would be interesting to further combine SAM with depth as it was done in SegmentAnyRGBD.
@article{kirillov2023segany,
title={Segment Anything},
author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
journal={arXiv:2304.02643},
year={2023}
}
@misc{oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
journal={arXiv:2304.07193},
year={2023}
}