This document covers frequently asked questions.
A: For training, ADE20K and PSACAL VOC 2012 with crop size 473*473 and batch size 16 require 4*12G GPUs (ResNet50/101 based), and Cityscapes with crop size 713*713 and batch size 16 requires 8*12G GPUs (ResNet50/101 based). A workstation with 8*12G GPUs can run all experiments efficiently. For testing, one GPU with 4GB is enough.
A: Some choices: 1. Reduce the crop size. 2. Reduce the batch size. 3. Fix BN parameters (scale and shift) for pre-trained models and do not add new BN layers in the network (same as MaskRCNN does). In this case, you may need to modify some code and then train on one GPU is fine. These solutions may harm the performance in a certain degree.
A: Mainly for the interface difference of CUDA extensions (syncbn
for multithreading training in this codebase). PyTorch version <=0.4.1 uses FFI that is not supported after 0.5.0, and now JIT is preferred. You need to change the interface of extensions under folderlib
for adapting to former version like 0.4.1.
A: Synchronized batch normalization crosses multiple GPUs is important for high-level version tasks especially when single card's batch size is not large enough (effective batch size as 16 is a good choice). Former PSPNet Caffe version uses OpenMPI based implementation. For multithreading training, this codebase uses synchronized batch normalization from repo EncNet, and for multiprocessing training, NVIDIA/apex is adopted. Another multithreading syncbn module is Synchronized-BatchNorm-PyTorch, some other multiprocessing syncbn modules are inplace_abn and the newly released official implementation in PyTorch 1.1.0.
A: Two possible choices:
- Multiprocessing training is highly recommended over multithreading training.
- Using 1/8 scale ground truth as label guidance, this can slightly speeding up the training and slightly decrease the performance (not as good as 1 scale label guidance is most cases).
A: The provided ResNet.py with pre-trained models differ with the official implementation in the input stem where original 7 × 7 convolution is replaced by three conservative 3 × 3 convolutions. This replacement is the same as the models used in original PSPNet Caffe version. The classification accuracy is slightly better than official models. ResNet50/101/152 comparisons in terms of top1 accuracy: ours vs official = 76.63/78.25/78.59 vs 76.15/77.37/78.31. The pre-trained models have slightly influences on final segmentation models (better). You may have a glance at Sec 4.2 ResNet Tweaks
in this paper. You can also utilize official released models as initialization and you need to modify files under folder model
accordingly.
A: Lots of details, some are listed as:
- Pre-trained models: the used weights are different between this PyTorch codebase and former PSP/ANet Caffe version.
- Pre-processing of images: this PyTorch codebase follows PyTorch official image pre-processing styles (normalized to 0~1 followed by subtracting
mean
as [0.485, 0.456, 0.406] and divided bystd
as [0.229, 0.224, 0.225]), while former Caffe version do normalization simply by subtracting imagemean
as [123.68, 116.779, 103.939]. - Training steps: we use training steps in Caffe version and training epochs in PyTorch for measurement. The transformed optimization steps after conversion is slightly different (e.g., in ade20k 150k with 16 batches equals to 150k*16/20210=119 epochs).
- SGD optimization difference: see
note
in SGD implementation, this difference may has influences onpoly
style learning rate decay especially on the last steps where learning rates are very small. - Weight decay on biases, scale and shift of BN in two training settings, see technical reports 1, 2.
- Label guidance: former Caffe version mainly uses 1/8 scale label guidance (former
interp
layer in Caffe has only CPU implementation thus we avoid using larger label guidance), the released segmentation models in this repository mainly use full scale label guidance (interpolate the final logits to original crop size for loss calculation instead of feature downsampling size as 1/8). - The performance variance for attention based models (e.g., PSANet) is relatively high, this can also be observed in CCNet. Besides, some low frequent classes (e.g, 'bus' in cityscapes) may also affect the performance a lot.
A: Assuming C
as number of classes in the semantic segmentation dataset (e.g., 150 for ADE20K, 21 for PSACAL VOC2012 and 19 for Cityscapes), then valid label ids are from 0
to C-1
. And we tend to set the ignore label as 255 where loss calculation will be ignored and no penalty will be given on the related ground truth regions. If original ground truths ids are not in needed format, you may need to do label id mapping (e.g, ADE20K original ids are 0-150 where 0 stands for void, original Cityscapes labels also need to do mapping).
A: Prepare the $DATASET$_colors.txt
and $DATASET$_names.txt
accordingly. Get the training/testing ground truths and lists ready.
A: Sorted by platforms, you are welcome to add some more.