https://arxiv.org/abs/2205.01917
CoCa: Contrastive Captioners are Image-Text Foundation Models (Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu)
generative pretraining과 contrastive pretraining을 하나로 묶은 vision-language 모델이네요.
#vision-language