-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathVLMs.qmd
522 lines (423 loc) · 28.9 KB
/
VLMs.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
# Vision and Language {#sec-VLMs-CLIP}
## Introduction
The last few decades have seen enormous improvements in both computer
vision and natural language processing. Naturally, there has been great
interest in putting these two branches of artificial intelligence (AI)
together.
At first it might seem like vision has little to do with language. This
is what you might conclude if you see the role of language as just being
about communication. However, another view on language is that it is a
representation of the world around us, and this representation is one
the brain uses extensively for higher-level
cognition @fodor1975language. Visual perception is also about
representing the world around us and forming mental models that act as a
substrate for reasoning and decision making. From this perspective, both
language and vision are trying to solve more or less the same problem.
It turns out this view is quite powerful, and indeed language and vision
can help each other in innumerable ways. This chapter is about some of
the ways language can improve vision systems, and also how vision
systems can support language systems.
## Background: Representing Text as Tokens
In order to understand **vision-language models** (**VLMs**) we first
need to understand a bit about **language models**. There are many ways
to model language and indeed there is a whole field that studies this,
called **natural language processing** (**NLP**). We will present just
one way, which follows almost exactly the same format as modern computer
vision systems: tokenize the text, then process it with a transformer.
In chapter
@sec-transformers we saw how to apply transformers to
images and now we will see how to apply them to text.
The amazing thing about transformers is that they work well for a wide
variety of data types: in fact, the same architectures that we saw in
chapter
@sec-transformers also works for text processing. To
apply transformers to a new data modality may require nothing more than
figuring out how to represent that data as numbers.
So our first question is, how can we represent text as numbers? One
option, which we have already seen, is to use a word-level encoding
scheme, mapping each word in our vocabulary to a unique one-hot code.
This requires $K$-dimensional one-hot codes for a vocabulary of $K$
different words.
A disadvantage of this approach is that $K$ has to be very large to
cover a language like English, which has a large vocabulary of over
100,000 possible words. This becomes even worse for computer languages,
like Python. Programmers routinely name variables using made up words,
which might never have been used before. We can't define a vocabulary
large enough to cover all the strange strings programmers might come up
with.
Instead, we could use a different strategy, where we assign a one-hot
code to each *character* in our vocabulary, rather than to each unique
word. If we represent all alphanumeric characters and common punctuation
marks in this way, this might only require a few dozen codes. The
disadvantage here is that representing a single word now requires a
sequence of one-hots rather than a single one-hot code, and we will have
to model that sequence.
A happy middle ground can be achieved using **byte-pair**
encodings @gage1994new. In this approach, we assign a one-hot code to
each small chunk of text. The chunks are common substrings found in
text, like "th" or "ing." This strategy is popular in language models
like the **generative pretrained transformer** (**GPT**)
series @brown2020language and the **contrastive language-image
pretraining** (**CLIP**) model @radford2021learning that we will learn
about in @sec-VLMs-CLIP.
Once we have converted text into a sequence of one-hot codes, the next
step is to convert these into tokens via a projection,
$\mathbf{W}_{\texttt{tokenize}}$, from the one-hot dimensionality ($K$)
to a target dimensionality for token code vectors ($d$).
@fig-VLMs-tokenizing_text shows these steps for converting an input
string to tokens. For simplicity, we show word-level one-hot encoding
rather than byte-pair encoding.
![Tokenizing text. Note that these steps can be implemented with a lookup table that simply stores the token code vector for each unique item in our vocabulary.](figures/vision_and_language/tokenizing_text.png){width="60%" #fig-VLMs-tokenizing_text}
These steps map text to tokens, which can be input into transformers. To
handle the output side, where we produce text as an output, we need to
convert tokens back to text. The most common way to do this is to use an
autoregressive model that predicts each next item in our vocabulary
(word/character/byte-pair) given the sequence generated so far. Each
prediction step can be modeled as a softmax over all $K$ possible items
in our vocabulary. We will see a specific architecture that does this in
@sec-VLMs-im2text.
## Learning Visual Representations from Language Supervision {#sec-VLMs-CLIP}
In chapter
@sec-representation_learning, we saw that one approach
of learning visual representations is to set up a pretext task that
requires arriving at a good representation of the scene. In this section
we will revisit this idea, using language association as a pretext task
for representation learning.
In fact this is not a new idea: historically, one of the most popular
pretext tasks has been image classification, where an image is
associated with one of $K$ nouns. In a sense, this is already a way of
supervising visual representations from language. Of course, human
languages are much more than just a set of $K$ nouns. This raises the
question, could we use richer linguistic structure to supervise vision?
The answer is yes! And it works really well. Researchers have tried
using all kinds of language structures to supervise vision systems,
including adjectives and verbs @StatesAndTransformations, words
indicating relationships between objects @visualgenome, and
question-answer pairs @antol2015vqa. What seems to work best, once you
reach a certain scale of data and compute, is to just use *all* the
structure in language, and the simplest way to do this is to train a
vision system that associates images with free-form natural language.
This is the idea of the highly popular CLIP
system @radford2021learning.[^1] CLIP is a contrastive method (see
@sec-representation_learning-contrastive_learning) in which
one data view is the image and the other is a caption describing the
image. Specifically, CLIP is a form of contrastive learning from
co-occurring visual and linguistic views that is formulated as follows:
![](figures/vision_and_language/clip_objective.png)
where $\boldsymbol\ell$ represents an image, $\mathbf{t}$ represents
text, $\mathbb{S}^{d_z}$ represents the space of unit vectors of
dimensionality $d_z$ (i.e., the surface of the $[d_z-1]$-dimensional
hypersphere), $f_{\ell}$ is the image encoder, $f_t$ is the text
encoder, image inputs are represented as $[3 \times N \times M]$ pixel
arrays, text inputs are represented as $d_t$ dimensional tokens, and
$d_z$ is the embedding dimensionality. Notice that the objective is a
symmetric version of the InfoNCE loss defined in
@eq-representation_learning-infonce. Also notice that $f_\ell$ and
$f_t$ both output unit vectors (which is achieved using
$L_2$-normalization before computing the objective); this ensures that
the denominator cannot dominate by pushing negative pairs infinitely far
apart.
@fig-vision_and_language-clip_training visually depicts how CLIP is
trained. First we sample a batch of $N$ language-image pairs,
$\{\mathbf{t}^{(i)}, \boldsymbol\ell^{(i)}\}_{i=1}^N$ ($N=6$ in the
figure). Next we embed measure the dot product between all these text
strings and images using a language encoder $f_t$ and an image encoder
$f_\ell$, respectively. This produces a set of text embeddings
$\{\mathbf{z}^{(i)}_t\}_{i=1}^N$ and a set of image embeddings
$\{\mathbf{z}^{(i)}_\ell\}_{i=1}^N$. To compute the loss, we take the
dot product $\mathbf{z}^{(i)}_\ell\cdot \mathbf{z}^{(j)}_t$ for all
$i, j \in \{1,\ldots,6\}$. Terms for which $i \neq j$ are the negative
pairs and we seek to minimize these dot products (denominator of the
loss); terms for which $i == j$ are the positive pairs and we seek to
maximize these dot product (they appear in both the numerator and
denominator of the loss).
![CLIP training for one batch of six language-image training examples. Green boxes are positive pairings (we seek to increase their value); these are the six training pairs. Red boxes are negative pairings (we seek to decrease their value); these are all other possible pairings within the batch. Inspired by figure 1 of \cite{radford2021learning](figures/vision_and_language/clip_training_fig.png){width="85%" #fig-vision_and_language-clip_training}
After training, we have a text encoder $f_t$ and an image encoder
$f_\ell$ that map to the same embedding space. In this space, the angle
between a text embedding and an image embedding will be small if the
text matches the matches.
:::{.column-margin}
The dot product between
unit vectors is proportional to the angle between the
vectors.
:::
@fig-vision_and_language-clip_mapping_diagram_two_branch shows how a
trained CLIP model maps data to embeddings. This figure is generated in
the same way as @fig-neural_nets-vit_mapping_plot: for each encoder we
reduce dimensionality for visualization using **t-distributed Stochastic
Neighbor Embedding** (**t-SNE**) @tsne. For the image encoder, we show
the visual content using icons to reflect the color and shape depicted
in the image (we previously used the same visualization in @sec-representation_learning-expt_designing_embeddings_with_contrastive_learning).
It's a bit hard to see, but one thing you can notice here is that for
the image encoder, the early layers group images by color and the later
layers group more by shape. The opposite is true for the text encoder.
Why do you think this is? One reason may be that color is a superficial
feature in pixel-space while shape requires processing to extract, but
in text-space, color words and shape words are equally superficial and
easy to group sentences by.
![CLIP @radford2021learning mapping diagram, using the same technique as described in @sec-neural_nets_as_distribution_transformers. To reduce dimensionality we apply t-SNE @tsne separately for the text encoder and the image encoder. Within each encoder, we run t-SNE jointly across all shown layers. ](figures/vision_and_language/clip_mapping_diagram_two_branch.png){width="100%" #fig-vision_and_language-clip_mapping_diagram_two_branch}
After passing the data through these two encoders, the final stage is to
normalize the outputs and compute the alignment between the image and
text embeddings. @fig-vision_and_language-clip_shared_embedding shows
this last step. After normalization, all the embeddings line on the
surface of a hypersphere (the circle in the figure). Notice that the
text descriptions are embedded near the image embeddings (icons) that
match that description. It's not perfect but note that this is partially
due to limitations in the visualization, which projects the
$768$-dimensional CLIP embeddings into a 2D visualization. Here we use
kernel principle component analysis @scholkopf1998nonlinear on the
embeddings, with a cosine kernel, and remove the first principle
component as that component codes for a global offset between the image
embeddings and text embeddings.
![CLIP @radford2021learning joint embedding](figures/vision_and_language/clip_shared_embedding.png){width="80%" #fig-vision_and_language-clip_shared_embedding}
Using language as one of the views may seem like a minor variation on
the contrastive methods we saw previously, but it's a change that opens
up a lot of new possibilities. CLIP connects the domain of images to the
domain of language. This means that many of the powerful abilities of
language become applicable to imagery as well. One ability that the CLIP
paper showcased is making a novel image classifier on the fly. With
language, you can describe a new conceptual category with ease. Suppose
I want a classifier to distinguish striped red circles from polka dotted
green squares. These classes can be described in English just like that:
"striped red circles" versus "polka dotted green square." CLIP can then
leverage this amazing ability of English to compose concepts and
construct an *image* classifier for these two concepts.
Here's how it works. Given a set of sentences
$\{\mathbf{t}^{a}, \mathbf{t}^{b}, \ldots\}$ that describe images of
classes $a$, $b$, and so on,
1. Embed each sentence into a $\mathbf{z}$-vector,
$\mathbf{z}_t^{a} = f_t(\mathbf{t}^{a}), \mathbf{z}_t^{b} = f_t(\mathbf{t}^{a}),$
and so on.
2. Embed your query image into a $\mathbf{z}$-vector,
$\mathbf{z}_{\ell}^{q} = f_{\ell}(\boldsymbol\ell^{q})$.
3. See which sentence embedding is closest to your image embedding;
that's the predicted class.
These steps are visualized in @fig-vision_and_language-clip_inference.
The green outlined dot product is the highest, indicating that the query
image will be classified as class $a$, which is defined as "striped red
circles."
![Making a ``striped red circles'' (class $a$) versus ``polka dotted green square'' (class $b$) classifier using a trained CLIP. The CLIP model parameters are frozen and not updated. Instead the classifier is constructed by embedding the text describing to the two classes and seeing which is closer to the embedding of the query image. Inspired by figure 1 of @radford2021learning](figures/vision_and_language/clip_inference_fig.png){width="75%" #fig-vision_and_language-clip_inference}
@fig-vision_and_language-clip_clocks gives more examples of creating
custom binary classifiers in this way. The two classes are described by
two text strings ("triangle" versus "cricle"; "purple" versus "teal";
and "arrowhead" versus "ball"). The embeddings of the text descriptions
are the red and green vectors. The classifier simply checks where an
image embedding is closer, in angular distance, to the red vector or the
green vector. This approach isn't limited to binary classification: you
can add more text vectors to partition the space in more ways.
:::{layout-ncol="4" #fig-vision_and_language-clip_clocks}
![](figures/vision_and_language/clip_clocks/joint_embedding.png)
![](figures/vision_and_language/clip_clocks/circle_triangle.png)
![](figures/vision_and_language/clip_clocks/purple_teal.png)
![](figures/vision_and_language/clip_clocks/arrowhead_ball.png)
Making custom classifiers using CLIP. (a) Image embeddings in joint embedding space, (b) triangle-circle classifier, (c) purple-teal classifier, and (d) arrowhead-ball classifier. It's a bit like a modern astrolabe where you turn dials to read off the prediction of interest.
:::
## Translating between Images and Text
In chapter
@sec-conditional_generative_models, we saw how to
translate between different image domains. Could we be even more
ambitious and translate between entirely different data *modalities*? It
turns out that this is indeed possible and is an area of growing
interest. For any type of data X, and any other type Y, there is,
increasingly, a method X2Y (and Y2X) that translates between these two
data types. Go ahead and pick your favorite data types---X,Y could be
images and sounds, music and lyrics, amino acid sequences and protein
geometry---then Google "X to Y with a deep net" and you will probably
find some interesting work that addresses this problem. X2Y systems are
powerful because they forge links between disparate domains of
knowledge, allowing tools and discoveries from one domain to become
applicable to other linked domains.
One approach to solving X-to-Y translation problems is do it in a
modular fashion by hooking up an *encoder* for domain X to a decoder for
domain Y. This approach is popular because powerful encoder and decoder
architectures, and pretrained models, exist for many domains. If we have
an encoder $f: \mathcal{X} \rightarrow \mathcal{Z}$ and a decoder
$g: \mathcal{Z} \rightarrow \mathcal{Y}$, then we can create a
translator simply by function composition, $g \circ f$.
Of course, the output of our encoder might not be interpretable as a
proper input to the decoder we selected. For example, suppose $f$ is a
text encoder and $g$ is an image decoder and we have
$\boldsymbol\ell= g(f(\mathbf{t}))$, where $\boldsymbol\ell$ is a
generated image from the text input $\mathbf{t}$. Then nothing
guarantees that $\boldsymbol\ell$ semantically matches $\mathbf{t}$.
To *align* the encoder and decoder we can follow several strategies. One
is to simply finetune $g \circ f$ on paired examples of the target
translation problem,
$\{\mathbf{t}^{(i)}, \boldsymbol\ell^{(i)}\}_{i=1}^N$. Another option is
to freeze $f$ and $g$ and instead train a translation function in latent
space, $h: \mathcal{Z} \rightarrow \mathcal{Z}$, so that
$g \circ h \circ f$ correctly maps each training point
$\mathbf{t}^{(i)}$ to the target output $\boldsymbol\ell^{(i)}$. See
@rombach2022high for an example of this latter option.
In the next sections, we will look at a few specific architectures for
solving this problem of text-to-image translation as well as the
reverse, image-to-text.
### Text-to-Image
One amazing ability of modern generative models is to synthesize an
image just from a text description, a problem called **text-to-image**.
This can be achieved using any of the conditional generative models we
saw in chapter
@sec-conditional_generative_models. We simply have to
find an effective way to use text as the conditioning information to
these models. @fig-vision_and_language-text2im_schematic_mapping shows
how this problem looks as a mapping from the space of text to the space
of images:
![Text-to-image.](figures/vision_and_language/text2im_schematic_mapping.png){width="100%" #fig-vision_and_language-text2im_schematic_mapping}
On the right is an example of such a system translating input text
$\mathbf{t}$ to an output image $\hat{\boldsymbol\ell}$ that matches
what the text describes. Recall the similar diagrams in the chapters on
representation learning and generative modeling and think about how they
relate (see and
@fig-generative_modeling_and_representation_learning-genrep_schematic.
This new problem is not so different from what we have seen before!
Now let's build some concrete architectures to solve this mapping. We
will start with a conditional variational autoencoder (cVAE; @sec-conditional_generative_models-cVAE). This was the
architecture used by the DALL-E 1 model @dalle1, which was one of the
first text-to-image models that caught the public's imagination.
A cVAE for text-to-image can be trained just like a regular cVAE; the
special thing is just that the encoder is a text encoder, that is, a
neural net that maps from text to a vector embedding.
@fig-vision_and_language-text2im_VAE_training shows what the training
looks like; it's conceptually the same as
@fig-conditional_generative_models-cVAE_ball_bouncing_example_nets,
which we saw previously.
![Training a cVAE to perform text-to-image. The dotted lines indicate that the image encoder is only used during training; at test time, usage follows the solid path. In this cartoon we draw the output as slightly distorted to reflect that the autoencoder path might not be able to perfectly reconstruct the input.](./figures/vision_and_language/text2im_VAE_training.png){width="100%" #fig-vision_and_language-text2im_VAE_training}
After training this model, we can sample from it by simply discarding
the image encoder and instead inputting a random Gaussian vector to the
decoder, in addition to providing the decoder with the text embedding.
@fig-vision_and_language-text2im_VAE_sampling shows how to sample an
image in this way.
This is how text-to-image generation can be done with a cVAE. We could
perform the same exercise for any conditional generative model. Let's do
one more, a conditional diffusion model. This is the approach followed
by the second version of the DALL-E system, DALL-E
2 @ramesh2022hierarchical, as well as other popular systems like the one
described in @rombach2022high. The difficulty we will focus on here is
as follows: how do you most effectively insert the text embeddings into
the the decoder, so that it can make best use of this information?
@fig-vision_and_language-text2im_diffusion_model shows one way to do
this, which is roughly how it is done in @rombach2022high: we insert
text embedding vectors into each denoising U-Net, and even into multiple
layers of each U-Net. This makes it very easy for the diffusion model to
condition all of its decisions on the text information.
![Sampling from a trained cVAE.](./figures/vision_and_language/text2im_VAE_sampling.png){width="100%" #fig-vision_and_language-text2im_VAE_sampling}
![A text-to-image diffusion model.](./figures/vision_and_language/text2im_diffusion_model.png){width="100%" #fig-vision_and_language-text2im_diffusion_model}
Notice that @fig-vision_and_language-text2im_diffusion_model is
actually very similar to @fig-vision_and_language-text2im_VAE_sampling.
A conditional diffusion model is in fact very similar to a cVAE. At
sampling time, both take as input text and Gaussian noise. The VAE uses
a network that directly maps these inputs to an image, in a single
forward pass. The diffusion model, in contrast, maps to an image via a
chain of denoising steps, each of which is a forward pass through a
neural network. See @NEURIPS2021_b578f2a5 for a discussion of the
connection between diffusion models and VAEs.
### Image-to-Text {#sec-VLMs-im2text}
Naturally, we can also translate in the opposite direction. We just need
to use an *image encoder* (rather than a text encoder) and a *text
decoder* (rather than a text encoder). In this direction, the problem
can be called **image-to-text** or **image captioning**. This kind of
translation is very useful because text is a ubiquitous and
general-purpose interface between almost all domains of human knowledge
(we use language to describe pretty everything we do and everything we
know). So, if we can translate images to text then we link imagery to
many other domains of knowledge.
As an example of an image-to-text system, we will use a transformer to
map the input image to a set of tokens, and we will then feed these
tokens as input to an autoregressive text decoder, which is also
implemented with a transformer. See @li2023blip2 for an example of a
system that uses this general approach. @fig-VLMs-im2text shows this
architecture.
![Image-to-text using a small transformer with cross-attention. The edges in gray are cross-attention edges. Many variations of this architecture are possible. For example, each layer of the text encoder could cross-attend to the output image tokens. Or, the image encoder could compress all its input tokens into a single output token that represents the entire photo; this output token would then get cross-attended by the text decoder. This approach could save memory and computation time.](figures/vision_and_language/im2text.png){width="100%" #fig-VLMs-im2text}
There is one special layer to point out, the **cross-attention** layer
(`cross-attn`). This layer addresses the question of how to insert image
tokens into a text model. The answer is quite simple: just let the text
tokens attend to the image tokens. The image tokens will emit their
value vectors which will be mixed with the text token value vectors
according to the attention weights to produce the next layer of tokens
in the text decoder.
Here are the equations for a cross-attention layer:
$$\begin{aligned}
\mathbf{Q}_{t} = \mathbf{T}_{t}\mathbf{W}_{t,q}^\mathsf{T}\\
\mathbf{K}_{\ell} = \mathbf{T}_{\ell}\mathbf{W}_{\ell,k}^\mathsf{T}\\
\mathbf{V}_{\ell} = \mathbf{T}_{\ell}\mathbf{W}_{\ell,v}^\mathsf{T}\\
\mathbf{A}_{\texttt{cross}} = \texttt{softmax}(\frac{\mathbf{Q}_{t}\mathbf{K}_{\ell}^\mathsf{T}}{\sqrt{m}})\\
\mathbf{T}_{\texttt{out}, \texttt{cross}} = \mathbf{A}_{\texttt{cross}}\mathbf{V}_{\ell}
\end{aligned}$$
where variables subscripted with $t$ are associated with
text tokens and variables subscripted with $\ell$ are associated with
image tokens.
Then, the tokens output by cross-attention can be combined with tokens
output by self-attention by weighted summation:
$$\begin{aligned}
\mathbf{T} = w_{\texttt{cross}}\mathbf{T}_{\texttt{out}, \texttt{cross}} + w_{\texttt{self}}\mathbf{T}_{\texttt{out}, \texttt{self}}
\end{aligned}
$$ where $\mathbf{T}_{\texttt{out}, \texttt{self}}$ is
regular self-attention on the input text token $\mathbf{T}_t$, and
$w_{\texttt{cross}}$ and $w_{\texttt{self}}$ are optional learnable
weights.
The text decoder is an autoregressive model (@sec-generative_models-autoregressive) that uses causal
attention (@sec-transformers-masked_attention) to predict the next word
in the sentence given the previous words generated so far. In this
strategy, the input to the text decoder is the tokens representing the
previous words in the training data caption that describes the image. At
sampling time, the model is run sequentially, where for each next word
generated it autoregressively repeats, with that word being added to the
input sequence of tokens. The text tokens themselves are produced by
tokenizing the input words, for example, via byte-pair encodings or
word-level encodings (the latter is shown in the figure).
## Text as a Visual Representation
One way of thinking about image-to-text is as yet another visual
encoder, where text is the visual representation. Just like we can
represent an image in terms of its frequencies (Fourier transform), its
semantics and geometry (scene understanding), or its feature embeddings
(DINO @caron2021emerging, CLIP @radford2021learning, and others), we can
also represent it in terms of a textual description of its content.
In @fig-vision_and_language-different_visual_representations we show an
image represented in several ways (the depth map is produced by
MiDaS @Ranftl2022 and the text is produced by BLIP2 @li2023blip2):
![Four representations computed from the input on the left. The input image itself (its pixel values) is in fact a fifth representation of the scene.](figures/vision_and_language/different_visual_representations.png){width="100%" #fig-vision_and_language-different_visual_representations}
Think about the relative merits of each of these representations, and
what advantages or disadvantages text might have over these other kinds
of visual representation.
Text is a powerful visual representation because it interfaces with
language models. This means that after translating an image into text we
can use language models to reason about the image contents.
## Visual Question Answering
As AI systems become more general purpose, one goal is to make a vision
algorithm that is fully general, so that you can ask it any question
about an image. This is akin to a Turing test @turing2009computing for
visual intelligence. **Visual question answering** (**VQA**) is one
formulation of this goal @antol2015vqa. The problem statement in VQA is
to make a system that maps an image and a textual question to a textual
answer, as shown in @fig-vision_and_language-VQA_examples.
![An example VQA system.](figures/vision_and_language/VQA_examples.png){width="100%" #fig-vision_and_language-VQA_examples}
:::{.column-margin}
By now you have seen several ways to build image and text encoders as well as decoders, and how to put them together. As an exercise, think about how you might implement the VQA system shown here, using the tools from elsewhere in this chapter and book. See if you can modify the transformer architecture in @fig-VLMs-im2text to solve this problem.
:::
In this chapter, we have already seen all the tools necessary to create
a VQA system; you just need a text encoder (for the question), an image
encoder (for the image you are asking the question about), and a text
decoder (for the answer). These can be hooked together using the same
kinds of tools we have seen earlier in this chapter. The critical piece
is just to acquire training data of VQA examples, each of which takes
the form of a tuple {\[`question`, `image`\], `answer`}. For a
representative example of how to implement a VQA model using
transformers, see @li2023blip2.
## Concluding Remarks
From the perspective of vision, language is a perceptual representation.
It is a way of describing "what is where"[^2] and much more in a
low-dimensional and abstracted format that interfaces well with other
domains of human knowledge. There are many theories of how human
language evolved, and what it is useful for: communication, reasoning,
abstract thought. In this chapter, we have suggested another: language
is, perhaps, the result of perceptual representation learning over the
course of human evolution and cultural development. It's one of the main
formats brains have arrived at as the best way to represent the
"blooming, buzzing"[^3] sensory array around us.
[^1]: The idea of visual learning by image captioning actually has a
long history before CLIP. One important early paper on this topic is
@ordonez2011im2text.
[^2]: "To know what is where by looking" is how David Marr defined the
vision in his seminal book on the subject @Marr82.
[^3]: This phrasing is from a quote by William James
(https://en.wikiquote.org/wiki/William_James).