The training data for this model comes from a subset of the ImageNet1k dataset, consisting of only 10 classes to ensure efficient training on a single GPU.
Convolutional neural networks (CNN) dominated the field of computer vision in the years 2012-2020. But in 2020 the paper An Image is Worth 16x16 Words showed that ViT could attain state-of-the-art (SOTA) results with less computational cost.
The architecture of the ViT is shown in Figure 1. A 2D image is split into several patches e.g., 9 2D patches. Each patch is flattened and mapped with a linear projection. The output of this mapping is concatenated with an extra learnable class [cls] embedding. The state of the [cls] embedding is randomly initialized, but it will accumulate information from the other tokens in the transformer and is used as the transformer's output.
Unlike a CNN, a ViT has no inherent way to retrieve position from its input. Therefore a positional embedding is introduced. It could be concatenated with all embedded patches, but that comes with a computational cost, therefore the positional embedding is added to the embedded patches, which empirically gives good results (Dosovitskiy et al., 2020). After the positional encoding is added the embedded patches is fed into the Transformer encoder.
Figure 1: Model overview [1].
The ViT uses the encoder introduced in the famous Attention Is All You Need paper, see Figure 1.
The encoder consists of two blocks, a multiheaded self-attention, and a multilayer perceptron. Before each block, a layernorm is applied and each block is surrounded by a residual connection. A residual connection is not needed in theory, but empirically it is found to make a big difference. The residual connection can help the network to learn a desired mapping
The self-attention used here is a simple function of three matrices
where the scaling factor
Instead of performing a single attention function with
where
where the projections are parameter matrices
and
To regularize the model, dropout and stochastic depth regularization technique is used. The latter is a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. This is accomplished by randomly dropping a subset of layers and bypassing them with the identity function during training.
For data augmentation, two quite recent techniques is used, namely Mixup and RandAugment. Mixup constructs virtual training examples
RandAugment transforms the training data with the following transformations: rotate, shear-x, shear-y, translate-y, translate-x, auto-contrast, sharpness, and identity. Which transformations that are used and the magnitude of each transformation are randomly selected.
The ViT model reached an accuracy of 91.3% on the validation set after it had been trained for
Figure 2: Result after training. The ViT model reached an accuracy of 91.3%.
Aside from the papers cited above, I found the following resources useful
- pytorch-original-transformer - Aleksa Gordić
- Illustrated Transformer - Jay Alammar
- vit-pytorch - Phil Wang