From Justin Lazarow (UCSD, now at Apple), Weijian Xu (UCSD, now at Microsoft), and Zhuowen Tu (UCSD).
This repository is an official implementation of the paper Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers presented at CVPR 2022.
BoundaryFormer aims to provide a simple baseline for regression-based instance segmentation. Notably, we use Transformers to regress a fixed number of points along a simple polygonal boundary. This process makes continuous predictions and is thus end-to-end differentiable. Our method differs from previous work in the field in two main ways: our method can match Mask R-CNN in Mask AP for the first time and we impose no additional supervision or ground-truth requirements as Mask R-CNN. That is, our method achieves parity in mask quality and supervision to mask-based baselines. We accomplish this by solely relying on a differentiable rasterization module (implemented in CUDA) which only requires access to ground-truth masks. We hope this can serve to drive further work in this area.
BoundaryFormer uses the same installation process as Detectron2. Please see installation instructions. This should generally require something like:
pip install -ve .
at the root of the source tree (as long as PyTorch, etc are installed correctly.
BoundaryFormer also uses the deformable attention modules introduced in Deformable-DETR. If this is already installed on your system, no action is needed. Otherwise, please build their modules:
git clone https://github.com/fundamentalvision/Deformable-DETR
cd Deformable-DETR/models/ops
sh ./make.sh
# unit test (should see all checking is True)
python test.py
BoundaryFormer follows the general guidelines of Detectron2, however, it lives under projects/BoundaryFormer
.
Please make sure to set two additional environmental variables on your system:
export DETECTRON2_DATASETS=/path/to/datasets
export DETECTRON2_OUTPUTS=/path/to/outputs
For instance, to train on COCO using an R50 backbone at a 1x schedule:
python projects/BoundaryFormer/train_net.py --num-gpus 8 --config-file projects/BoundaryFormer/configs/COCO-InstanceSegmentation/boundaryformer_rcnn_R_50_FPN_1x.yaml COMMENT "hello model"
If you do not have 8 GPUs, adjust --num-gpus and your BATCH_SIZE accordingly. BoundaryFormer is trained with AdamW and we find the square-root scaling law to work well (i.e., a batch size of 8 should only induce a sqrt(2) change in LR).
BoundaryFormer has a few hyperparameter options. Generally, these are configured under cfg.MODEL.BOUNDARY_HEAD
(see projects/BoundaryFormer/boundary_former/config.py
). Please
see the paper for ablations of these values.
cfg.MODEL.BOUNDARY_HEAD.NUM_DEC_LAYERS = 4
We generally find that 4 layers is sufficient for good performance. A small amount of performance is lost by reducing this to 3 and otherwise increasing it doesn't generally change performance.
NOTE: if upsampling is used, this is generally ignored and computed by a combination of cfg.MODEL.BOUNDARY_HEAD.POLY_NUM_PTS
and cfg.MODEL.BOUNDARY_HEAD.UPSAMPLING_BASE_NUM_PTS
.
cfg.MODEL.BOUNDARY_HEAD.POLY_NUM_PTS = 64
This defines the number of points at the final output layer. If upsampling (see next section) is not used, this also constitutes the number of points at any intermediate layer. Generally, we find Cityscapes to benefit from more than 64 points (e.g. 128) but COCO less so.
Upsampling constitutes our coarse-to-fine strategy which can reduce memory and computation. Rather than using the same number of points at each layer, we start off with a small number of points and upsample (2x) the points in a naive manner (midpoints) at each subsequent layer. To enable:
cfg.MODEL.BOUNDARY_HEAD.UPSAMPLING = True
cfg.MODEL.BOUNDARY_HEAD.UPSAMPLING_BASE_NUM_PTS = 8
cfg.MODEL.BOUNDARY_HEAD.POLY_NUM_PTS = 64
This will create a 4-layer (8 * 2 ** 3 = 64) coarse-to-fine model
BoundaryFormer uses differentiable rasterization to transform the predicted polygons into mask space for supervision. To control the resolution:
cfg.MODEL.DIFFRAS.RESOLUTIONS = [64, 64]
is a flattened (e.g. for X and Y resolutions) list. This can be modified per layer by expanding it. For a two-layer model:
cfg.MODEL.DIFFRAS.RESOLUTIONS = [32, 32, 64, 64]
would supervise the first layer at 32 x 32 and the second at 64 x 64.
In the same way as SoftRas, we require some rasterization smoothness to differentiably rasterize the masks.
cfg.MODEL.DIFFRAS.INV_SMOOTHNESS_SCHED = (0.001,)
will produce quite sharp rasterization (larger values will be "blurrier") which seems to work well. This can also be made to be dependent on the current iteration:
cfg.MODEL.DIFFRAS.INV_SMOOTHNESS_SCHED = (0.15, 0.005)
cfg.MODEL.DIFFRAS.INV_SMOOTHNESS_STEPS = (50000,)
to initially start with 0.15 and drop to 0.005 at iteration 50000. This hyperparameter is not particularly sensitive in our experience, however, too large of values will decrease performance.
We release models for MS-COCO and Cityscapes.
Mask head |
Backbone | lr sched |
Control points |
mask AP |
download |
---|---|---|---|---|---|
BoundaryFormer | R50-FPN | 1× | 64 | 36.1 | model |
Mask head |
Backbone | lr sched |
Control points |
initialization | mask AP |
download |
---|---|---|---|---|---|---|
BoundaryFormer | R50-FPN | 1× | 64 | ImageNet | 34.7 | model |
BoundaryFormer | R50-FPN | 1× | 64 | COCO | 38.3 | model |
BoundaryFormer uses Detectron2 and is further released under the Apache 2.0 license.
If you use BoundaryFormer in your research, please use the following BibTeX entry.
@InProceedings{Lazarow_2022_CVPR,
author = {Lazarow, Justin and Xu, Weijian and Tu, Zhuowen},
title = {Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {4382-4391}
}