Pytorch implementation for paper LiVOS: Light Video Object Segmentation with Gated Linear Matching, arXiv 2024.
Qin Liu1,
Jianfeng Wang2,
Zhengyuan Yang2,
Linjie Li2,
Kevin Lin2,
Marc Niethammer1,
Lijuan Wang2
1UNC-Chapel Hill, 2Microsoft
The code is tested with python=3.10
, torch=2.4.0
, torchvision=0.19.0
.
git clone https://github.com/uncbiag/LiVOS
cd LiVOS
Create a new conda environment and install required packages accordingly.
conda create -n livos python=3.10
conda activate livos
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
Download the model weights and store them in the ./weights directory. The directory will be automatically created if it does not already exist.
python ./download.py
Dataset | Description | Download Link |
---|---|---|
DAVIS 2017 | 60 videos (train); 30 videos (val); 30 videos (test) | official site |
YouTube VOS 2019 | 3471 videos (train); 507 videos (val) | official site |
MOSE | 1507 videos (train); 311 videos (val) | official site |
LVOS (v1)* | 50 vidoes (val); 50 videos (test) | official site |
(*) To prepare LVOS, you need to extract only the first annotations for its validation set:
python scripts/data/preprocess_lvos.py ../LVOS/valid/Annotations ../LVOS/valid/Annotations_first_only
Prepare the datasets in the following structure:
├── LiVOS (codebase)
├── DAVIS
│ └── 2017
│ ├── test-dev
│ │ ├── Annotations
│ │ └── ...
│ └── trainval
│ ├── Annotations
│ └── ...
├── YouTube
│ ├── all_frames
│ │ └── valid_all_frames
│ ├── train
│ └── valid
├── LVOS
│ ├── valid
│ │ ├──Annotations
│ │ └── ...
│ └── test
│ ├──Annotations
│ └── ...
└── MOSE
├── JPEGImages
└── Annotations
You should get the following results using our provided models:
Training Dataset |
Model | J&F | |||||
---|---|---|---|---|---|---|---|
MOSE | DAVIS-17 val | DAVIS-17 test | YTVOS-19 val | LVOS val | LVOS test | ||
D17+YT19 | livos-nomose-480p (135 MB) | 59.2 | 84.4 | 78.2 | 79.9 | 50.6 | 44.6 |
D17+YT19 | livos-nomose-ft-480p (135 MB) | 58.4 | 85.1 | 81.0 | 81.3 | 51.2 | 50.9 |
D17+YT19+MOSE | livos-wmose-480p (135 MB) | 64.8 | 84.0 | 79.6 | 82.6 | 51.2 | 47.0 |
- To run the evaluation:
python livos/eval.py dataset=[dataset] weights=[path to model file]
Example for DAVIS 2017 validation set (more dataset options in livos/config/eval_config.yaml
):
python livos/eval.py dataset=d17-val weights=./weights/livos-nomose-480p.pth
- To get quantitative results for DAVIS 2017 validation:
GT_DIR=../DAVIS/2017/trainval/Annotations/480p
Seg_DIR=./results/d17-val/Annotations
python ./vos-benchmark/benchmark.py -g ${GT_DIR} -m ${Seg_DIR}
- For results on other datasets,
- DAVIS 2017 test-dev: CodaLab
- YouTubeVOS 2019 validation: CodaLab
- LVOS val: LVOS
- LVOS test: CodaLab
- MOSE val: CodaLab
We conducted the training on four A6000 48GB GPUs. Without MOSE, the process required approximately 90 hours to complete 125,000 iterations.
OMP_NUM_THREADS=4 torchrun --master_port 25350 --nproc_per_node=4 livos/train.py exp_id=first_try model=base data=base
- The training configuration is located in
livos/config/train_config.yaml
. - By default, the output folder is set to
./model_mmdd_yyyy/${exp_id}
. If needed, this can be modified in the training configuration file.
@article{liu2024livos,
title={LiVOS: Lite Video Object Segmentation with Gated Linear Matching},
author={Liu, Qin and Wang, Jianfeng and Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Niethammer, Marc and Wang, lijuan},
journal={arXiv preprint arXiv:2411.02818},
year={2024}
}
Our project is developed based on Cutie. We appreciate the well-maintained codebase.