SPINO: Few-Shot Panoptic Segmentation With Foundation Models

arXiv | IEEE Xplore | Website | Video

This repository is the official implementation of the paper:

Few-Shot Panoptic Segmentation With Foundation Models

Markus Käppeler*, Kürsat Petek*, Niclas Vödisch*, Wolfram Burgard, and Abhinav Valada.
*Equal contribution.

IEEE International Conference on Robotics and Automation (ICRA), 2024

If you find our work useful, please consider citing our paper:

@inproceedings{kaeppeler2024spino,
    title={Few-Shot Panoptic Segmentation With Foundation Models},
    author={Käppeler, Markus and Petek, Kürsat and Vödisch, Niclas and Burgard, Wolfram and Valada, Abhinav},
    booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
    year={2024},
    pages={7718-7724}
}

📔 Abstract

Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments.

👩‍💻 Code

🏗 Setup

⚙️ Installation

Create conda environment: conda create --name spino python=3.8
Activate environment: conda activate spino
Install dependencies: pip install -r requirements.txt
Install torch, torchvision and cuda: pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
Compile deformable attention: cd panoptic_segmentation_model/external/ms_deformable_attention & sh make.sh

💻 Development

Install pre-commit githook scripts: pre-commit install
Upgrade isort to 5.12.0: pip install isort
Update pre-commit: pre-commit autoupdate
Linter (pylint) and formatter (yapf, iSort) settings can be set in pyproject.toml.

🏃 Running the Code

🎨 Pseudo-label generation

To generate pseudo-labels for the Cityscapes dataset, please set the path to the dataset in the configuration files (see list below). Then execute run_cityscapes.sh from the root of the panoptic_label_generator folder. This script will perform the following steps:

Train the semantic segmentation module using the configuration file configs/semantic_cityscapes.yaml.
Train the boundary estimation module using the configuration file configs/boundary_cityscapes.yaml.
Generate the panoptic pseudo-labels using the configuration file configs/instance_cityscapes.yaml.

We also support the KITTI-360 dataset. To generate pseudo-labels for KITTI-360, please adapt the corresponding configuration files.

Instead of training the modules from scratch, you can also use the pretrained weights provided at these links:

Cityscapes: https://drive.google.com/file/d/1FjJYpkEO9enpsahevD8PMn3nP_O0sNnT/view?usp=sharing
KITTI-360: https://drive.google.com/file/d/1Eod444VoRLKw6dOeDSLuvfUQlJ5FAwM_/view?usp=sharing

🧠 Panoptic segmentation model

To train a panoptic segmentation model on a given dataset, e.g., the generated pseudo-labels, execute train.sh.

Before running the code, specify all settings:

python_env: Set the name of the conda environment (e.g. "spino")
alias_python: Set the path of the python binary to be used
WANDB_API_KEY: Set the wand API key of your account
CUDA_VISIBLE_DEVICES Specifies the device ids of available GPUs
Set all remaining arguments:
- nproc_per_node: Number of processes per node (usually node=GPU server), this should be equal to the number of devices specified in CUDA_VISIBLE_DEVICES
- master_addr: IP address of GPU server to run the code on
- master_port: Port to be used for server access
- run_name: Name of the current run, a folder will be created with this name including all the files to be created (pretrained weights, config file etc.) and this name will also appear on wandb
- project_root_dir: Path to where the folder with the run name will be created
- mode: Mode of the training, can be "train" or "eval"
- resume: If specified, the training will be resumed from the specified checkpoint
- pre_train: Only load the specified modules from the checkpoint
- freeze_modules: Freeze the specified modules during training
- filename_defaults_config: Filename of the default configuration file with all configuration parameters
- filename_config: Filename of the configuration file that acts relative to the default configuration file
- comment: Some string
- seed: Seed to initialize "torch", "random", and "numpy"
Set available flags:
- eval: Only evaluate the model specified by resume
- debug: Start the training in debug mode

Additionally,

ensure that the dataset path is set correctly in the corresponding config file, e.g., train_cityscapes_dino_adapter.yaml.
set the entity and project parameters for wandb.init(...) in misc/train_utils.py.

💾 Datasets

Cityscapes

Download the following files:

leftImg8bit_sequence_trainvaltest.zip (324GB)
gtFine_trainvaltest.zip (241MB)
camera_trainvaltest.zip (2MB)

After extraction, one should obtain the following file structure:

── cityscapes
   ├── camera
   │    └── ...
   ├── gtFine
   │    └── ...
   └── leftImg8bit_sequence
        └── ...

KITTI-360

Download the following files:

Perspective Images for Train & Val (128G): You can remove "01" in line 12 in download_2d_perspective.sh to only download the relevant images.
Test Semantic (1.5G)
Semantics (1.8G)
Calibrations (3K)

After extraction and copying of the perspective images, one should obtain the following file structure:

── kitti_360
   ├── calibration
   │    ├── calib_cam_to_pose.txt
   │    └── ...
   ├── data_2d_raw
   │   ├── 2013_05_28_drive_0000_sync
   │   └── ...
   ├── data_2d_semantics
   │    └── train
   │        ├── 2013_05_28_drive_0000_sync
   │        └── ...
   └── data_2d_test
        ├── 2013_05_28_drive_0008_sync
        └── 2013_05_28_drive_0018_sync

👩‍⚖️ License

For academic usage, the code is released under the GPLv3 license. For any commercial purpose, please contact the authors.

🙏 Acknowledgment

This work was funded by the German Research Foundation (DFG) Emmy Noether Program grant No 468878300 and the European Union’s Horizon 2020 research and innovation program grant No 871449-OpenDR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SPINO: Few-Shot Panoptic Segmentation With Foundation Models

📔 Abstract

👩‍💻 Code

🏗 Setup

⚙️ Installation

💻 Development

🏃 Running the Code

🎨 Pseudo-label generation

🧠 Panoptic segmentation model

💾 Datasets

Cityscapes

KITTI-360

👩‍⚖️ License

🙏 Acknowledgment

Files

README.md

Latest commit

History

README.md

File metadata and controls

SPINO: Few-Shot Panoptic Segmentation With Foundation Models

📔 Abstract

👩‍💻 Code

🏗 Setup

⚙️ Installation

💻 Development

🏃 Running the Code

🎨 Pseudo-label generation

🧠 Panoptic segmentation model

💾 Datasets

Cityscapes

KITTI-360

👩‍⚖️ License

🙏 Acknowledgment