-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproposal.tex
72 lines (52 loc) · 7.12 KB
/
proposal.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
\documentclass[]{article}
\usepackage{amsmath}
\usepackage{listings}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{indentfirst}
%\usepackage[colorinlistoftodos]{todonotes}
%opening
\title{Unpaired Text-to-Image-to-Text Translation using Cycle Consistent Adversarial Networks}
\author{Jeremy Ma, Satya Krishna Gorti}
\date{}
\begin{document}
\maketitle
\section{Introduction}
Text-to-Image synthesis is a challenging problem that has a lot of room for improvement considering the current state-of-the-art results. Synthesized images from existing methods give a rough sketch of the described image but fail to capture the true essence of what the text describes. The recent success of Generative Adversarial Networks (GANs) \cite{goodfellow2014generative} indicate that they are a good candidate for the choice of architecture to approach this problem.
\\
However, the very nature of this problem is such that a piece of text can map to multiple valid images. The lack of such a direct one-to-one mapping means that traditional conditional GANs \cite{mirza2014conditional} cannot be used directly. We draw our inspiration from the recent works of image-to-image translation \cite{liu2017unsupervised}\cite{zhu2017unpaired} where a cycle consistent GANs have been trained and achieved very impressive results.
\\
We believe that using a cycle GAN for text-to-image-to-text translation will generate better results than existing approaches and give more photo-realistic images. The added advantage of framing the problem in a cycle consistent manner would also mean that the architecture can not only be a text-to-image synthesizing network but also an image captioning network.
\\
Therefore, we have two generators $G$ and $F$. We train a mapping $G: T_{emb} \mapsto Y$ and inverse mapping $F: Y \mapsto T_{emb}$ in a cycle consistent manner, where $T_{emb}$ is an embedding for the text that describes an image. The generators $G$ and $F$ have their corresponding discriminator $D_g$ and $D_f$.
\section{Related work}
Generative Adversarial Networks (GANs) have achieved impressive results in problems such as image generation \cite{DBLP:journals/corr/RadfordMC15}. Conditional GANs introduced in \cite{mirza2014conditional} build on top of GANs by learning to approximate the distribution of data by conditioning on an input.
\\
In the recent past there have been attempts on text to image synthesis using conditional GANs such as \cite{reed2016generative}\cite{dash2017tac}\cite{zhang2017stackgan}\cite{DBLP:journals/corr/abs-1710-10916}. We can see promising results in \cite{reed2016generative} by conditioning the GAN on text descriptions instead of class labels. \cite{zhang2017stackgan} uses a similar approach but breaks the process of generation down into two stage process. Stage-1 produces a low-resolution image based on text description. Stage-2 takes Stage-1 result as input and generates high resolution photo-realistic images. \cite{dash2017tac} additionally conditions its generative process with both text and class information and has produced superior results compared to \cite{reed2016generative}. The embeddings to represent text used in the aforementioned papers is Skip-Thought vectors \cite{kiros2015skip}.
\\
Cycle consistent GANs have showed excellent results for multimodal learning problems, which lack a direct one-to-one correspondence with input and output and allows the network to learn many mappings at the same time as shown in \cite{zhu2017unpaired}\cite{liu2017unsupervised}\cite{zhou2016learning}. But to the best of our knowledge, it still hasn't been used for text-to-image generation, which is a similar multimodal learning problem.
\\
\section{Project plan}
\begin{itemize}
\item An MVP would use only unpaired data (texts and images) for training.
\item An MVP would demonstrate the feasibility of using cycle consistency for text to image translation.
\item An MVP would show image captioning capability with cycle consistency.
\end{itemize}
As a rough plan, our first milestone is to get a text-to-image synthesis GAN network working.\\
\\
As the second milestone (March 12th), we are planning to put everything together and create the first bare bone cycle consistent text-to-image GAN. We should be able to train the GAN and generate results at least as good as \cite{reed2016generative} since it is the first published work of its kind. Some dataset we are planning to test on are (1) Caltech-UCSD Birds (2) VGG Flower Dataset (3) LSUN (large-scale scene understanding challenge) (4) MS COCO dataset.\\
\\
For the remaining time, we are planning to choose a few of the nice-to-haves that are listed in the following section. This will allow us to further explore this topic and possibly come up with innovative results.\\
\\
For experiments, we will use the popular inception score and qualitative judgments.
\section{Nice-to-haves}
\begin{itemize}
\item The first addition is to use multi-stage coarse-to-fine image generation similar to StackGAN and StackGAN++ \cite{zhang2017stackgan}\cite{DBLP:journals/corr/abs-1710-10916}. Their work has demonstrated the possibility of generating high resolution, photo-realistic images. By integrating their work, we want to show the scalability of our cycle consistent text-to-image GAN.
\item The second natural extension would be to explore other consistency or model structure that we can use for possibly better text-to-image results or involving more tasks in the mix. \cite{DBLP:journals/corr/ZhouKAHE16} were able to use a slightly different cycle consistent structure that can generate CAD models from images. We want to explore the possibility of using more tasks in the middle. For example, real image $\to$ instance segmentation $\to$ caption $\to$ instance segmentation $\to$ real image. This would allow us to further explore and understand deeper on this topic.
\item The third possible addition is to use a similar network to CycleGAN - UNIT (unsupervised image-to-image translation networks) \cite{liu2017unsupervised}. They use VAE to project images in latent space but also forces cycle consistency at the same time. We believe by exploring this topic, we will be able to understand what are the shared features of images and texts in the latent space. Maybe we can also reuse the shared latent space for multi-task learning.
\item The fourth topic we choose is data visualization using t-SNE for our generated images. Similar to what StackGAN++ did, we want to show that the generated images are indeed multimodal and there is no mode collapse in our generated images.
\item The last topic that we are really excited about is style transfer. The text-to-image generation process usually involves sampling a noise vector z at the beginning of the entire process. \cite{reed2016generative} have shown that the vector z actually encodes the “style” of the generated image. It encodes features that are not present in the conditional text. We want to explore more on this topic. Similar to generative visual manipulation on natural language manifold \cite{zhu2016generative}, we want to let users manipulate latent vectors and generate images that they like.
\end{itemize}
\bibliographystyle{plain}
\bibliography{ref}
\end{document}