contributors: @GitYCC
[paper]
- WHY??? The idea of masked autoencoders, a form of more general denoising autoencoders, is natural and applicable in computer vision as well. However, despite significant interest in this idea following the success of BERT, progress of autoencoding methods in vision lags behind NLP.
- We ask: what makes masked autoencoding different between vision and language? We attempt to answer this question from the following perspectives:
- Architectures were different.
- Convolutional networks (vision) v.s. Transformer (language)
- This architectural gap has been addressed with the introduction of Vision Transformers (ViT) and should no longer present an obstacle.
- Information density is different between language and vision.
- Languages are human-generated signals that are highly semantic and information-dense.
- Images are natural signals with heavy spatial redundancy.
- e.g. a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes.
- To overcome this difference and encourage learning useful features, we show that a simple strategy works well in computer vision: masking a very high portion of random patches.
- The autoencoder’s decoder, which maps the latent representation back to the input, plays a different role between reconstructing text and images.
- Languages: predicts missing words that contain rich semantic information
- -> the decoder can be trivial (an MLP in BERT)
- Images: reconstructs pixels which is of a lower semantic level
- -> use lightweight Transformer layers
- Languages: predicts missing words that contain rich semantic information
- Architectures were different.
- Key Results:
- Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.
- Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
- Masking
- Random sampling with a high masking ratio largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation from visible neighboring patches.
- MAE Encoder
- Just as in a standard ViT, our encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
- Win-win scenario:
- it optimizes accuracy
- reduce memory consumption because of a small portion of patches to process => enable us to easily scale our MAE to large models
- MAE Decoder
- The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens.
- Mask token: a shared, learned vector (not fix zero vector)
- We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image.
- The decoder has another series of Transformer blocks.
- The MAE decoder is only used during pre-training to perform the image reconstruction task
- drop it when fine-tuning
- We experiment with very small decoders, narrower and shallower than the encoder.
- Reconstruction Target
- the mean squared error (MSE)
- Variant: the normalized pixel values
- Using normalized pixels as the reconstruction target improves representation quality in our experiments.
- Simple Implementation
- First we generate a token for every input patch (by linear projection with an added po- sitional embedding).
- Next we randomly shuffle the list of tokens and remove the last portion of the list, based on the masking ratio.
- After encoding, we append a list of mask tokens to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to align all tokens with their targets.
- We do self-supervised pre-training on the ImageNet-1K (IN1K) training set. Then we do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing.
- What is "linear probing"? Fixed weights of ViT and Train a linear classifier on top of ViT
- Baseline: ViT-Large
- Scratch: 82.5%
- MAE pretrain / finetune: 84.9%
- Properties:
- Masking ratio:
- Decoder design
- Observation: A sufficiently deep decoder is important for linear probing.
- Possible Reason: The last several layers in an autoencoder are more specialized for reconstruction, but are less relevant for recognition. A reasonably deep decoder can account for the reconstruction specialization, leaving the latent representations at a more abstract level.
- Interestingly, our MAE with a single-block decoder can perform strongly with fine-tuning (84.8%). Note that a single Transformer block is the minimal requirement to propagate information from visible tokens to mask tokens. Such a small decoder can further speed up training.
- Data Augmentation
- Surprisingly, our MAE behaves decently even if using no data augmentation (only center-crop, no flipping). This property is dramatically different from contrastive learning and related methods, which heavily rely on data augmentation.
- Comparisons with Previous Results