[Paper] [中文解读] [Slides] [Video]
The official implementation of the paper "Parameter-Inverted Image Pyramid Networks"
NeurIPS 2024 Spotlight (Top 2.5%)
TL;DR: We introduce the Parameter-Inverted Image Pyramid Networks (PIIP), employing a parameter-inverted paradigm that uses models with different parameter sizes to process different resolution levels of the image pyramid, thereby saving computation cost while improving the performance.
- Support tasks of
object detection
,instance segmentation
,semantic segmentation
andimage classification
. - Surpasses single-branch methods with
higher performance
andlower computation cost
. - Improve the performance of
InternViT-6B
on object detection by 2.0% (55.8%$\rm AP^b$ ) while reducing computation cost by 62%.
Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires significant computational cost. To overcome this issue, we propose a novel network architecture known as the Parameter-Inverted Image Pyramid Networks (PIIP). Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance. Specifically, the input to PIIP is a set of multi-scale images, where higher resolution images are processed by smaller networks. We further propose a feature interaction mechanism to allow features of different resolutions to complement each other and effectively integrate information from different spatial scales. Extensive experiments demonstrate that the PIIP achieves superior performance in tasks such as object detection, segmentation, and image classification, compared to traditional image pyramid methods and single-branch networks, while reducing computational cost. Notably, when applying our method on a large-scale vision foundation model InternViT-6B, we improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation. These results validate the effectiveness of the PIIP approach and provide a new technical direction for future vision computing tasks.
For instructions on installation, pretrained models, training and evaluation, please refer to the readme files under each subfolder:
Note:
- We report the number of parameters and FLOPs of the backbone.
- Results in the paper were obtained with an internal codebase, which may exhibit slightly different performance than this repo (
$\leq\pm0.2$ ). - Experiments involving InternViT-6B do not use window attention, different from those in the paper.
Backbone | Detector | Resolution | Schd | Box mAP | Mask mAP | #Param | #FLOPs | Download |
---|---|---|---|---|---|---|---|---|
ViT-B | Mask R-CNN | 1024 | 1x | 43.7 | 39.7 | 90M | 463G | log | ckpt | cfg |
PIIP-TSB | Mask R-CNN | 1120/896/448 | 1x | 43.6 | 38.7 | 146M | 243G | log | ckpt | cfg |
PIIP-TSB | Mask R-CNN | 1568/896/448 | 1x | 45.0 | 40.3 | 147M | 287G | log | ckpt | cfg |
PIIP-TSB | Mask R-CNN | 1568/1120/672 | 1x | 46.5 | 41.3 | 149M | 453G | log | ckpt | cfg |
ViT-L | Mask R-CNN | 1024 | 1x | 46.7 | 42.5 | 308M | 1542G | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | 1120/672/448 | 1x | 46.5 | 40.8 | 493M | 727G | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | 1344/896/448 | 1x | 48.3 | 42.7 | 495M | 1002G | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | 1568/896/672 | 1x | 49.3 | 43.7 | 497M | 1464G | log | ckpt | cfg |
PIIP-TSBL | Mask R-CNN | 1344/896/672/448 | 1x | 47.1 | 41.9 | 506M | 755G | log | ckpt | cfg |
PIIP-TSBL | Mask R-CNN | 1568/1120/672/448 | 1x | 48.2 | 42.9 | 507M | 861G | log | ckpt | cfg |
PIIP-TSBL | Mask R-CNN | 1792/1568/1120/448 | 1x | 49.4 | 44.1 | 512M | 1535G | log | ckpt | cfg |
InternViT-6B | Mask R-CNN | 1024 | 1x | 53.8 | 48.1 | 5919M | 29323G | log | ckpt | cfg |
PIIP-H6B | Mask R-CNN | 1024/512 | 1x | 55.8 | 49.0 | 6872M | 11080G | log | ckpt | cfg |
Backbone | Detector | Pretrain | Resolution | Schd | Box mAP | Mask mAP | Download |
---|---|---|---|---|---|---|---|
PIIP-SBL | Mask R-CNN | AugReg (384) | 1568/1120/672 | 1x | 48.3 | 42.6 | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | DeiT III (S) + Uni-Perceiver (BL) | 1568/1120/672 | 1x | 48.8 | 42.9 | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | DeiT III (S) + MAE (BL) | 1568/1120/672 | 1x | 49.1 | 43.0 | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | DeiT III | 1568/1120/672 | 1x | 50.0 | 44.4 | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | DeiT III (S) + DINOv2 (BL) | 1568/1120/672 | 1x | 51.0 | 44.7 | log | ckpt | cfg |
PIIP-SBL | Mask R-CNN | DeiT III (S) + BEiTv2 (BL) | 1568/1120/672 | 1x | 51.8 | 45.4 | log | ckpt | cfg |
PIIP-SBL | DINO | DeiT III (384) | 1792/1120/672 | 3x | 57.8 | - | log | ckpt | cfg |
PIIP-H6B | DINO | MAE (H) + InternVL (6B) | 1024/768 | 1x | 60.0 | - | log | ckpt | cfg |
Backbone | Detector | Resolution | Schd | mIoU | #Param | #FLOPs | Download |
---|---|---|---|---|---|---|---|
InternViT-6B | UperNet | 512 | 80k | 58.42 | 5910M | 6364G | log | ckpt | cfg |
PIIP-H6B | UperNet | 512/192 | 80k | 57.81 | 6745M | 1663G | log | ckpt | cfg |
PIIP-H6B | UperNet | 512/256 | 80k | 58.35 | 6745M | 2354G | log | ckpt | cfg |
PIIP-H6B | UperNet | 512/384 | 80k | 59.32 | 6746M | 4374G | log | ckpt | cfg |
PIIP-H6B | UperNet | 512/512 | 80k | 59.85 | 6747M | 7308G | log | ckpt | cfg |
Model | Resolution | #Param | #FLOPs | Top-1 Acc | Config | Download |
---|---|---|---|---|---|---|
PIIP-TSB | 368/192/128 | 144M | 17.4G | 82.1 | config | log | ckpt |
PIIP-SBL | 320/160/96 | 489M | 39.0G | 85.2 | config | log | ckpt |
PIIP-SBL | 384/192/128 | 489M | 61.2G | 85.9 | config | log | ckpt |
- detection code
- classification code
- segmentation code
If you find this work helpful for your research, please consider giving this repo a star ⭐ and citing our paper:
@article{piip,
title={Parameter-Inverted Image Pyramid Networks},
author={Zhu, Xizhou and Yang, Xue and Wang, Zhaokai and Li, Hao and Dou, Wenhan and Ge, Junqi and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2406.04330},
year={2024}
}
This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.
Our code is built with reference to the code of the following projects: InternVL-MMDetSeg, ViT-Adapter, DeiT, MMDetection, MMSegmentation, and timm. Thanks for their awesome work!