Name		Name	Last commit message	Last commit date
parent directory ..
SETR_models		SETR_models
TransUNet_models		TransUNet_models
data/inference		data/inference
runs		runs
visualizations		visualizations
README.md		README.md
config.py		config.py
data.py		data.py
inference.py		inference.py
train.py		train.py
unet_model.py		unet_model.py
utils.py		utils.py

README.md

Semantic Segmentation

1 Model Architecture

Three models were experimented:

SETR: A pure transformer encoder model and a variety of decoder unsampling models to perform semantic segmentation tasks. This model was adapted from and implemented based on the paper published in December 2020 by Sixiao Zheng et al., titled Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In particular, the SETR-PUP and SETR-MLA variants, that is, the models with progressive upsampling and multi-level feature aggregation decoders, are selected and implemented based on their state-of-the-art performance on benchmark datasets.
TransUNet: A UNet-transformer hybrid model that uses UNet to extract high-resolution feature maps, a transformer to tokenize and encode images, and a UNet-like mechanism to upsample in decoder using previously-extracted feature maps. This model was adapted from and implemented based on the paper published in February 2021 by Jieneng Chen et al., titled TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation.
UNet: the well-known UNet model. This variant of UNet, which is 4-layers deep in the architecture, is adapted and implemented based on the paper published in November 2018 by Ari Silburt et al., titled Lunar Crater Identification via Deep Learning.

SETR	TransUNet	UNet

Figures are authored in and extracted from the original papers respectively.

2 Classification Loss

Two loss functions were experimented:

Cross Entropy between the predicted and groundtruth class assignments of the pixels in a given frame, as implemented in torch.nn.CrossEntropyLoss.
As suggested in the TransUNet paper, a combination of the Cross Entropy loss as above, and the Dice loss between predicted and groundtruth class assignments of the pixels in a given frame. The loss function is implemented in utils.py. The final loss is a 1:1 weighted sum of the Cross Entropy loss and the Dice loss.

3 Dataset

The models were trained on the Cityscapes dataset. In particular, the dataset used for training is a subset of processed subsample created for the Pix2Pix project and the paper published in Nov 2018 by Phillip et al., titled Image-to-Image Translation with Conditional Adversarial Networks. The dataset contains 2975 training images and 500 validation images in which each image (256 x 256) is attached with an annotated classification label map. The images were extracted from videos recorded in Germany.

4 Training

Due to the intrinsic limitation of the transformer models' gigantic size (> 150M parameters), the available computing power is insufficient and inefficient for SETR and TransUNet training. Although the architecture and training code were implemented, the training task of the two models was omitted.

The UNet was trained on Tesla T4 on Google Colab.

The model was trained with 40 epochs, 1e-3 learning rate. Batchsize was set to be 16. After several training instances, it was observed that although the model trained by the combination of Cross Entropy and Dice loss yields higher IoU during test inference, the model trained by Cross Entropy alone achieves better results from visual observation. The Adam optimizer was used.

	UNet Training
Train
Validation

5 Results and Visualization

The best validation loss observed is 0.2391 CE from UNet. The model consistently achieves an IoU of > 0.995 during image test inference. Yet, the visualization result was flawed since the groundtruth classification matrix is created by K-Means clustering on a 256000 x 3 colour matrix with 13 classes, due to the unstability of JPEG image colour compression. Other possible contributing factors include the intrinsic limitation of model architecture (ie. design and/or complexity), small dataset cardinality, the small variance in dataset, environmental variances during testing, etc.

It can be observed that the model can perform reasonable semantic segmentation task when inferenced on test image and videos.

6 Model training or testing instructions

Checkpoint for UNet (~415MB) can be downloaded via this link. For usage, proceed to place it to ./checkpoints/UNet/U-Net2.pth.

Set up all configurations in config.py.
To train networks, python train.py.
To inference networks on test images or videos, python inference.py. The inference speed is between 15~20 fps on a Tesla T4 GPU.
To observe training history in tensorboard, tensorboard --logdir runs.
Model architectures are defined in SETR_models and TransUNet_models directories, as well as unet_model.py.

7 Acknowledgement and Reference

Sixiao Zheng et al.'s 2020 paper Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, and its affiliated repo by Abhay Gupta.
Jieneng Chen et al.'s 2021 paper TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, and its official affiliated repo.
Ari Silburt et al.'s 2018 paper Lunar Crater Identification via Deep Learning.
Aashu Prasad's Kaggle tutorial Video Segmentation Self Driving Cars OpenCV Pytorch.
The subset of Cityscape dataset for the Pix2Pix project.
Maxime's Inside Machine Learning tutorial What is a Transformer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Segmentation

Semantic Segmentation

README.md

Semantic Segmentation

1 Model Architecture

2 Classification Loss

3 Dataset

4 Training

5 Results and Visualization

6 Model training or testing instructions

7 Acknowledgement and Reference

Files

Semantic Segmentation

Directory actions

More options

Directory actions

More options

Latest commit

History

Semantic Segmentation

Folders and files

parent directory

README.md

Semantic Segmentation

1 Model Architecture

2 Classification Loss

3 Dataset

4 Training

5 Results and Visualization

6 Model training or testing instructions

7 Acknowledgement and Reference