Richard Zhang, Phillip Isola, Alexei A. Efros (2017)
- Two disjoint sub-networks for unsupervised representation learning
- Each predicting one subset of the data channels from another
- Together: features from entire input signal
- Forcing the network to solve cross-prediction tasks --> induce representation that transfers well to unseen tasks
- Split: grayscale and color channels or RGB-D
- Regular autoencoder (AE): forced abstraction through bottleneck --> information loss
- Context encoder: withhold part of the input (e.g. colorization, inpainting blocks of pixels)
- Last one is difficult: image synthesis is difficult + domain gap (blocks vs full images at test time) + might only require low-level reasoning (bad representation learning)
- However, colorization does work (due to spatial correspondence of gray and color)
- Cross-channel encoders (of which colorization is an example): different channels not treated equally (one is for feature extraction, another is target) --> our solution comes in!
- No need for bottleneck
- Dropout on input to force abstraction
- Pre-trained on full input data