-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathobject_recognition_v3.qmd
1073 lines (866 loc) · 56.1 KB
/
object_recognition_v3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Object Recognition {#sec-object_recognition}
## Introduction
There are many tasks that researchers use to address the problem of
recognition. Although the ultimate goal is to tell what something it is
by looking at it, the way that you will answer the question will change
the approach that you will use. For instance, you could say, "That is a
chair" and just point at it, or we could ask you to indicate all of the
visible parts of the chair, which might be a lot harder if the chair is
partially occluded by other chairs producing a sea of legs. It can be
hard to precisely delineate the chair if we cannot tell which legs
belong to it. In other cases, identifying what a chair is might be
difficult, requiring understanding of its context and social conventions
(@fig-what_is_a_chair_with_text).
![What is a chair? The definition is based on affordances (i.e., something you can sit on) and on social conventions (i.e., some surfaces you can sit on are meant to be used to support other objects than yourself). Affordances are likely to be directly accessible through vision, while social conventions might not be.](figures/object_recognition/what_is_a_chair_with_text.jpg){width="50%" #fig-what_is_a_chair_with_text}
In this chapter we will study three tasks related to the recognition of
objects in images (classification, localization, and segmentation). We
will introduce the problem definitions and the formulation that lays the
foundations for most existing approaches.
## A Few Notes About Object Recognition in Humans
Object perception is very rich and diverse area of interdiciplinary
research. Many of the approached in computer vision are rooted on
hypothesis formulated in trying to model the mechanism used by the human
visual system. It is useful to be familiar with the literature in
cognitive science and neuroscience to understand the origin of existing
approaches and to gain insights on how to build new models. Here we will
review only a few findings, and we leave the reader with the task of
going deeper into the field.
### Recognition by Components
One important theory of how humans represent and recognize objects from
images is the recognition-by-components theory proposed by Irving
Biederman in 1987 @Biederman1987. He started motivating his theory with
the experiment illustrated in @fig-do_it_yourself. Can you recognize
the object depicted in the drawing? Does it look familiar? The first
observation one can make when looking at the drawing in
@fig-do_it_yourself is that it does not seem to be any object we know,
instead it seems to be a made-up object that does not correspond to
anything in the world. The second observation is that the object seems
to be decomposable into parts and that different observers will probably
make the same decomposition. The figure can be separated into parts by
breaking the object in "regions of sharp concavity". And the third
observation is that the object resembles other known objects. Does it
look like an ice cream or hot-dog cart to you?
![What is this? This drawing from Biederman @Biederman1987, shows a novel object. After looking at this drawing we can conclude the following: (1) we have never seen this object before, (2) it can be decomposed into parts that most people will agree with, and (3) it resembles some familiar objects. Maybe it looks like an ice cream or hot-dog cart.](figures/object_recognition/do_it_yourself.png){width="50%" #fig-do_it_yourself}
Biederman postulated that objects are represented by decomposing the
object into a small set of geometric components (e.g., blocks,
cylinders, cones, and so on.) that he called **Geons**. He derived 36
different Geons that we can discriminate visually. Geons we defined by
non-accidental attributes such as symmetry, colinearity, cotermination,
parallelism, and curvature. An object is defined by a particular
arrangement of a small set of Geons. Object recognition consists in a
sequence of stages: (1) edge detection, (2) detection of non-accidental
properties, (3) detection of Geons, and (4) matching the detected Geons
to stored object representations.
One important aspect of his theory is that objects are formed by
compositing simpler elements (Geons), which are shared across many
different object classes. **Compositionality** in the visual world is
not as strong as the one found in language, where a fix set of words are
composed to form sentences with different meanings. In the visual world,
components shared across different object classes will have different
visual appearances (e.g., legs of tables and chairs share many
properties but are not identical). Geons are great to represent
artificial objects (e.g., tables, cabinets, and phones) but fail when
applied to represent stuff (e.g., grass, clouds, and water) or highly
textured objects such as trees or food.
Decomposing an object into parts has been an important ingredient of
many computer vision object recognition systems
@Felzenszwalb2010, @Fischler1973, @Weber2000.
### Invariances
Object recognition in humans seems to be quite robust to changes in
viewpoint, illumination, occlusion, deformations, styles, and so on.
However, perception is not completely invariant to those variables.
One well studied property of the human visual system is its invariance
to image and 3D rotations. As we discussed in chapter
@sec-bias_and_shift, studies in human visual perception
have shown that objects are recognized faster when they appear in
canonical poses. In a landmark study, Roger N. Shepard and Jaqueline
Metzler @Shepard1971aa showed to participants pairs of drawings of
three-dimensional objects (@fig-mental_rotation). The pair of drawings
could show the same object with a mirror or the same object with an
arbitrary 3D rotation. The task consisted in deciding if the two objects
were identical (up to a 3D rotation) or mirror images of each other.
During the experiment, they recorded the success rate and the reaction
time (how long did participant took to answer). The result showed that
participants had a reaction time proportional to the rotation angle
between the two views. This supported the idea that participants were
performing a **mental rotation** in order to compare both views. Whether
humans actually perform mental rotations or not still remains
controversial.
![Two figures related by a 3D rotation. Modified from \cite{Shepard1971aa](figures/object_recognition/mental_rotation.png){width="60%" #fig-mental_rotation}
Michael Tarr and Steven Pinker @Tarr1989MentalRA showed that similar
results where obtained on a learning task. When observers learn to
recognize a novel object from a single viewpoint they are able to
generalize to new viewpoints but they do that at a cost: recognition
time increases with the angle of rotation as if they had to mentally
rotate the object in order to compare it with the view viewed during
training. When trained with multiple viewpoints, participants recognized
equally quickly all the familiar orientations.
The field of cognitive science has formulated a number hypothesis about
the mechanisms used for recognizing objects: geometry based models,
view-based templates, prototypes, and so on. Many of those hypothesis
have been the inspiration for computer vision approaches for object
recognition.
### Principles of Categorization
When working on object recognition, one typical task is to train a
classifier to classify images of objects into categories. A category
represents a collection of equivalent objects. Categorization offers a
number of advantages. As described in @Rosch1976BasicOI "It is to the
organism's advantage not to differentiate one stimulus from others when
that differentiation is irrelevant for the purposes at hand." Eleanor
Rosch and her collaborators proposed that categorization occurred at
there different levels of abstraction: superordinate level, basic-level,
and subordinate level. The following example, from @Rosch1978,
illustrates the three different levels of categorization:
:::{#tbl-levels_of_categorization}
| Superordinate | Basic Level | Subordinate |
|---------------|--------------|--------------------|
| Furniture | Chair | Kitchen chair |
| | | Living-room chair |
| | Table | Kitchen table |
| | | Dining-room table |
| | Lamp | Floor lamp |
| | | Desk lamp |
| Tree | Oak | White oak |
| | | Red oak |
| | Maple | Silver maple |
| | | Sugar maple |
| | Birch | River birch |
| | | White birch |
:::
Superordinate categories are very general and provide the a high degree
of abstraction. Basic level categories correspond to the most common
level of abstraction. The subordinate level of categorization is the
most specific one. Two objects that belong to the same superordinate
categories share fewer attributes than two objects that belong to the
same subordinate category.
Object categorization into discrete classes has been a dominant approach
in computer vision. One example is the ImageNet dataset that organizes
object categories into the taxonomy provided by WordNet @Fellbaum1998.
However, as argued by E. Rosch @Rosch1978, "natural categories tend to
be fuzzy at their boundaries and inconsistent in the status of their
constituent members." Rosch suggested that categories can be represented
by **prototypes**. Prototypes are the clearest members of a category.
Rosch hypothesized that humans recognize categories by measuring the
similarity to prototypes. By using prototypes, Rosch @Rosch1978
suggested that it is easier to work with continuous categories, as the
categories are defined by the "clear cases rather than its boundaries."
But object recognition is not as straightforward as a categorization
task and using categories can result in a number of issues. One example
is dealing with the different affordances that an object might have as a
function of context. In language, a word or a sentence can have a
standing meaning and an occasion meanings. An standing meaning
corresponds to the conventional meaning of an expression. The occasion
meaning is the particular meaning that the same expression might have
when used in a specific context. Therefore, when an expression is being
used, it has an occasion meaning that differs from its conventional
meaning. The same thing happens with visual objects as illustrated in
@fig-box_table.
:::{layout-ncol="2" #fig-box_table}
![](figures/object_recognition/box_box2.jpg){width="47.5%"}
![](figures/object_recognition/box_table.jpg){width="47.5%"}
The same object can have different occasion meanings depending on the context. (a) A box. (b) A box used as a table.
:::
Avoiding categorization all together has been also the focus on some
computer vision approaches to image understanding like **the Visual
Memex** by T. Malisiewicz and A. Efros @malisiewicz-nips09.
We will now study a sequence of object recognition models with outputs
of increasing descriptive power.
## Image Classification {#sec-image_classification}
:::{.column-margin}
In @sec-image_classification we will repeat material already presented in the book. Read this section as if it were written by Pierre Menard, from the short story \textit{Pierre Menard, Author of the Quixote} by Jorges Luis Borges~\cite{borges1939pierre}. This character rewrote \textit{Don Quixote} word for word, but the same words did not mean the same thing, because the context of the writing (who wrote it, why they wrote it, and when they wrote it) was entirely different.
:::
Image classification is one of the simplest tasks in object recognition.
The goal is to answer the following, seemingly simple, question: Is
object class $c$ present anywhere in the image ${\bf x}$? We also
covered image classification as a case study of machine learning in
section
@sec-intro_to_learning-image_classification. Even though we
will use the same words now, and even the same sentences, you should
read them in a different way. The same words will mean something
different. Before, we emphasized the problem of learning; now we are
focusing on the problem of object understanding. So, when we say
$\mathbf{y} = f(\mathbf{x})$, before the focus of our attention was $f$,
now the focus is $\mathbf{y}$.
Image classification typically assumes that there is a closed vocabulary
of object classes with a finite number of predefined classes. If our
vocabulary contains $K$ classes, we can loop over the classes and ask
the same question for each class. This task does not try to localize the
object in the image, or to count how many instances of the object are
present.
### Formulation
We can answer the previous question in a mathematical form by building a
function that maps the input image ${\bf x}$ into an output vector
$\hat {\bf y}$: $$\hat {\bf y} = f({\bf x})$$ The output,$\hat {\bf y}$, of the function will be a binary response: $1=yes$, the
object is present; and $0=no$, the object is not present.
![Image classification: Is there a car in this image?](figures/object_recognition/image_classification.png){width="50%" #fig-classification_training_set}
In general we want to classify an image according to multiple classes.
We can do this by having as output a vector $\hat {\bf y}$ of length
$K$, where $K$ is the number of classes. Component $c$ of the vector
$\hat {\bf y}$ will indicate whether class $c$ is present or absent in
the image. For instance, the training set will be composed of examples
of images and the corresponding class indicator vectors as shown in
@fig-classification_training_set (in this case, $K=3$).
![Training set for image classification with $K=3$ classes.](figures/object_recognition/classification_training_set_v2.png){width="100%" #fig-classification_training_set}
We want to enrich the output so that it also represents the uncertainty
in the presence of the object. This uncertainty can be the result of a
function $f_{\theta}$ that does not work very well, or from a noisy or
blurry input image $\mathbf{x}$ where the object is difficult to see.
One can formulate this as a function $f_{\theta}$ that takes as input
the image ${\bf x}$ and outputs the probability of the presence of each
of the $K$ classes. Letting $Y_c \in \{0,1\}$ be a binary random
variable indicating the presence of class $c$, we model uncertainty as,
$$\begin{aligned}
p(Y_c=1 \bigm | \mathbf{x}) &= \hat{y}_c
\end{aligned}$${#eq-object_recognition-classification_probability_model}
#### Exclusive classes
One common assumption is that only one class, among the $K$ possible
classes, is present in the image. This is typical in settings where most
of the images contain a single large object. Multiclass classification
with exclusive classes results is given in the following constraint:
$$\sum_{c=1}^K \hat{y}_c=1
$${#eq-object_recognition-sum_to_one_constraint}
where $0>\hat{y}_c>1$. In the toy example with three classes shown
previously, the valid solutions for
$\hat{\mathbf{y}}=[\hat{y}_1, \hat{y}_2, \hat{y}_3]^\mathsf{T}$ are
constrained to lie within a **simplex**.
![](figures/object_recognition/simplex_v2.png){width="30%"}
Under this formulation, the function $f$ is a mapping
$f:\mathbb{R}^{N \times M \times 3} \rightarrow \vartriangle^{K-1}$ from
the set of red-green-blue (RGB) images to the $(K-1)$-dimensional
simplex, $\vartriangle^{K-1}$.
The function $f$ is constrained to belong to a family of possible
functions. For instance, $f$ might belong to the space of all the
functions that can be built with a neural network. In such a case the
function is specified by the network parameters $\theta: f_\theta$. When
using a neural net, the vector $\hat{\mathbf{y}}$ is usually computed as
the output of a softmax layer.
This yields the softmax regression model we saw previously in section
@sec-intro_to_learning-image_classification. Given the
constraint of @eq-object_recognition-sum_to_one_constraint, the output
of $f_{\theta}$ can be interpreted as a probability mass function over
$K$ classes:
$$\begin{aligned}
p_{\theta}(Y \bigm | \mathbf{x}) &= \hat{\mathbf{y}} = f_{\theta}(\mathbf{x})
\end{aligned}$$
where $Y$ is the random variable indicating the single
class per image. This relates to our previous model in
@eq-object_recognition-classification_probability_model as:
$$\begin{aligned}
p(Y_c=1 \bigm | \mathbf{x}) &= p_{\theta}(Y=c \bigm | \mathbf{x})
\end{aligned}$$
#### Multilabel classification
When the image can have multiple labels simultaneously, the
classification problem can be formulated as $K$ binary classification
problems (where $K$ is the number of classes).
In such a case, the function $f_\theta$ can be a neural network shared
across all classes (@fig-class_architecture), and the output vector
$\hat{\mathbf{y}}$, of length $K$, is computed using a sigmoid
nonlinearity for each output class. In this case the output can still be
interpreted as the probability
$\hat{y}_c = p_\theta(Y_c=1 \bigm | \mathbf{x})$, but without the
constraint that the sum across classes is 1.
![Multilabel image classification system.](figures/object_recognition/class_architecture.png){width="70%" #fig-class_architecture}
### Classification Loss
In order to learn the model parameters, we need to define a loss
function that will capture the task that we want to solve. One natural
measure of classification error is the **missclassication error**. The
errors that the function $f$ makes are the number of misplaced ones
(i.e., the misclassification error) over the dataset, which we can write
as:
$$\begin{aligned}
\mathcal{L}(\hat{\mathbf{y}},\mathbf{y}) =
\sum_{t=1}^T \sum_{c=1}^{K}
\mathbb{1}(\hat{y}^{(t)}_c \neq y^{(t)}_c ).
\end{aligned}$$ where $T$ is the number of images in the dataset.
However, it is hard to learn the function parameters using gradient
descent with this loss as it is not differentiable. Finding the optimal
parameters of function $f_\theta$ requires defining a loss function that
we will be tractable to optimize during the training stage.
#### Exclusive classes
If we interpret the function $f$ as estimating the probability
$p(Y_c=1 \bigm | {\bf x}) = \hat{y}_c$, with $\sum_c \hat{y}_c =1$. The
likelihood of the ground truth data is:
$$
\prod_{t=1}^T \prod_{c=1}^K \left( \hat{y}^{(t)}_c \right) ^{y^{(t)}_c}
$${#eq-object_recognition-classification_likelihood}
where $y^{(t)}_c$ is the ground truth value of $Y_c$ for image $t$, and
the first product loops over all $T$ training examples while the second
product loops over all the classes $K$. We want to find the parameters
$\theta$ that maximize the likelihood of the ground truth labels for
whole training image. As we have seen in chapter
@sec-convolutional_neural_nets, maximizing the
likelihood corresponds to minimizing the **cross-entropy loss** (which
we get by just taking the negative log of the equation
@eq-object_recognition-classification_likelihood, resulting in a total classification loss:
$$
\mathcal{L}_{cls}(\hat{\mathbf{y}},\mathbf{y})
= -\sum_{t=1}^{T} \sum_{c=1}^{K} y^{(t)}_c \log(\hat{y}^{(t)}_c)
$$
The cross-entropy loss is differentiable and it is commonly used in
image classification tasks.
#### Multilabel classification
If classes are not exclusive, the likelihood of the ground truth data
is:
$$\prod_{t=1}^T \prod_{c=1}^K \left( \hat{y}^{(t)}_c \right) ^{y^{(t)}_c}
\left( 1-\hat{y}^{(t)}_c \right) ^{1-y^{(t)}_c}$$ where the outputs
$\hat{y}_c$ are computed using a sigmoid layer. Taking the negative log
of the likelihood gives the **multiclass binary cross-entropy loss**:
$$\mathcal{L}_{cls}(\hat{\mathbf{y}},\mathbf{y})
= -\sum_{t=1}^{T} \sum_{c=1}^{K} y^{(t)}_c \log(\hat{y}^{(t)}_c)
+
(1-y^{(t)}_c) \log(1-\hat{y}^{(t)}_c)$$
### Evaluation
Once the function $f_\theta$ has been trained, we have to evaluate its
performance over a holdout test set. There are several popular measures
of performance.
Classification performance uses the output of the classifier, $f$, as a
continuous value, the score $\hat{y}_c$, for each class. We can rank the
predicted classes according to that value. The **top-1** performance
measures the percentage of times that the true class-label matches the
highest scored prediction,
$\hat{y} = \underset{c}{\mathrm{argmax}} ( \hat{y}_c )$, in the test
set:
$$\text{TOP-1} = \frac{100}{T}
\sum_{t=1}^T \mathbb{1}( \hat{y}^{(t)} = y^{(t)})
$$
where $y^{(t)}$ is the ground truth class of image $t$. This evaluation metric only works for the exclusive class case since we assume a single ground truth class per image. For the multilabel case, we may instead check if each label's is predicted correctly.
In some benchmarks (such as ImageNet), researchers also measure the
**top-5** performance, which is the percentage of times that the true
label is within the set of five highest scoring predictions. Top-5 is
useful in settings where the labels might be ambiguous, or where
multiple labels for one image might be possible. In the case of multiple
labels, ideally, the test set should specify which labels are possible
and evaluate only using those. The top-5 measure is less precise and it
is used when the test set only contains one label despite that multiple
labels might be correct.
For multiclass prediction problems it is often useful to look at what is
the structure of mistakes made by the model. The **confusion matrix**
summarizes both the overall performance of the classifier and also the
percentage of times that two classes are confused by the classifier. For
instance, the following matrix shows the evaluation of a three-way
classifier. The diagonal elements are the same as the top-1 performance
for each class. The off-diagonal elements show the confusions. In this
case, it seems that the classifier confuses cats as dogs the most.
![](figures/object_recognition/confusion_matrix.png){width="50%" #tbl-confusion_matrix}
The elements of the confusion matrix are:
$$C_{i,j} = 100 \frac{
\sum_{i=1}^N \mathbb{1}( \hat{y}^{(t)} = j) \mathbb{1} (y^{(t)} = i)
}{
\sum_{i=1}^N \mathbb{1}(y^{(t)} = i)
}
$$ where $C_{i,j}$ measures the percentage of times that true class
$i$ is classified as class $j$.
In this toy example (
@tbl-confusion_matrix), we can see that 80 percent of the
cat images are correctly classified, while 5 percent of cat images are
classified as car images, and 15 percent are classified as pictures of
dogs.
### Shortcomings
The task of image classification is plagued with issues. Although it is
a useful task to measure progress in computer vision and machine
learning, one has to be aware of its limitations when trying to use it
to produce a meaningful description of a scene, or when developing a
classifier for consumption in the real world.
One very important shortcoming of this task is that it assumes that we
can actually answer unambiguously the question *does the image contains
object c?* But what happens if class definitions are ambiguous or class
boundaries are soft? Even in cases were we believe that the boundaries
might be well defined, it is east to find images that will challenge our
assumption, as shown in @fig-whatisacar.
:::{layout-ncol="3" #fig-whatisacar}
![](figures/object_recognition/IMG_1260.jpeg){width="32%" #fig-whatisacar}
![](figures/object_recognition/IMG_7095.jpeg){width="32%"}
![](figures/object_recognition/IMG_9933.jpeg){width="32%"}
![](figures/object_recognition/tesla.jpg){width="32%"}
![](figures/object_recognition/IMG_4467.jpeg){width="32%"}
![](figures/object_recognition/IMG_9325.jpeg){width="32%"}
Which of these images contains a car? This is a simple question that does not have a simple answer.
:::
Which of this images of @fig-whatisacar contains a car? A simple
question that does not have a simple answer. The top three images
contain cars, although with increasing difficulty. However, for the
three images on the second row, it is not clear what the desired answer
it is. We can clearly recognize the content of those images, but it
feels as if using a single word for describing those images leaves too
much ambiguity. Is a car under construction already a car? Is the toy
car a car? And if we play with food and make a car out of watermelon,
would that be a car? Probably not. But if a vision system classifies
that shape as a car, would that be a mistake like any other?
Although it is not clearly stated in the problem definition, language
plays a big role in object classification. The objects in
@fig-object_recognition_pepper are easily recognizable to us, but if we
ask the question \"Is there a fruit in this picture?,\" the answer is a
bit more complex than it might appear at first glance.
![A pepper is a fruit according to botanics, and it is a vegetable according to the culinary classification.](figures/object_recognition/pepper.jpg){width="32%" #fig-object_recognition_pepper}
What if the object is present in the scene but invisible in the image?
What if there are infinite classes? In our formulation, $\hat {\bf y}$
is a vector of a fixed length. Therefore, we can only answer the
question "is object class $c$ present in the image?" for a finite set of
classes. What if we want to be able to answer an infinite set of
questions? The next sections introduce more sophisticated formulations
of object recognition that address some of these questions.
Is it possible to classify an image without localizing the object? How
can we answer the question "Is object class $c$ present in the image?"
without localizing the object? One possibility is that the function $f$
has learned to localize the object internally, but the information about
the location is not being recorded in the output. Another possibility is
that the function has learned to use other cues present in the image
(biases in the dataset). As it is not trying to localize an object, the
function $f$ will not get penalized if it uses shortcuts such as only
recognizing one part of the object (e.g., such as the head of a dog, or
car wheels), or if it makes use of contextual biases in the dataset
(e.g., all images with grass and blue skies have horses, or all streets
have cars) or detects unintended correlations between low-level image
properties and the image content (e.g., all images of insects have a
blurry background). As a consequence, image classification performance
could mislead us into believing that the classifier works well and that
it has learned a good representation of the object class it classifies.
## Object Localization
For many applications, saying that an object is present in the image is
not enough. Suppose that you are building a visual system for an
autonomous vehicle. If the visual system only tells the navigation
system that there is a person in the image it will be insufficient for
deciding what to do. Object localization consists of localizing where
the object is in the image. There are many ways one can specify where
the object is. The most traditional way of representing the object
location is using a **bounding box** (@fig-object_detection_bb), that
is, finding the image-coordinates of a tight box around each of the
instances of class $c$ in the image $\bf x$.
### Formulation
How you look at an object does not change the object itself (whether you
see it or feel it, etc.). This induces translation and scale invariance.
Let's build a system that has this property.
We will formulate object detection as a function that maps the input
image ${\bf x}$ into a list of bounding boxes $\hat {\bf b}_i$ and the
associated classes to each bounding box encoded by a vector
$\hat {\bf y}_i$, as illustrated in @fig-object_detection_bb.
![Car localization using bounding boxes. Each instance is shown with a different color.](figures/object_recognition/object_detection_bb.png){width="40%" #fig-object_detection_bb}
The class vector, $\hat {\bf y}_i$, has the same structure as in the
image classification task but now it is applied to describe each
bounding box. Bounding boxes are usually represented as a vector of
length 4 using the coordinates of the two corners,
${\bf b} = \left[ x_1, y_1, x_2, y_2 \right]$, or with the center
coordinates, width, and height,
${\bf b} = \left[ x_c, y_c, w, h \right]$. Our goal is a function $f$
that outputs a set of bounding boxes, $\mathbf{b}$, and their classes
$\mathbf{y}$: $$\{\hat {\bf y}_i, \hat {\bf b}_i\} = f({\bf x})$$
Most approaches for object detection have three main steps. In the first
step, a set of candidate bounding boxes are proposed. In the second
step, we loop over all proposed bounding boxes, and for each one, we
apply an classifier the image patch inside the bounding box. In the
third and final step, the selected bounding boxes are postprocessed to
remove any redundant detections.
#### Window scanning approach
In its simplest form, the problem of object localization is posed as a
binary classification task, namely distinguishing between a single
object class and background class. Such a classification task can be
turned into a detector by sliding it across the image (or image
pyramid), and classifying each local window as shown in
@fig-scanningWindow2.
![Window scanning approach using a multiscale image pyramid.](figures/object_recognition/scanningWindow2.png){width="100%" #fig-scanningWindow2}
The window scanning algorithm is the following:
![Window scanning algorithm.](figures/object_recognition/window_scanning.png){width="100%" #fig-window_scanning}
In this approach, location and translation invariance are achieved by
the bounding box proposal mechanism. We can add another invariance, such
as rotation invariance, by proposing rotated bounding boxes.
#### Selective search
The window scanning approach can be slow as the same classifier needs to
be applied to tens of thousands of image patches. Selective search makes
the process more efficient by proposing an initial set of bounding boxes
that are good candidates to contain an object
(@fig-selective_search_pipeline). This proposal mechanism is performed
by a selection mechanism simpler than the object classifier. This
approach was motivated by the strategy used by the visual system in
which attention is first directed toward image regions likely to contain
the target @wolfe2007, @TreismanGelade1980, @Koch_Ullman_1985. This
first attentional mechanism is very fast but might be wrong. Therefore,
a second, more accurate but also more expensive, processing stage is
required in order to take a reliable decision. The advantage of this
**cascade** of decisions is that the most expensive classifier is only
applied to a sparse set of locations (the ones selected by the cheap
attentional mechanism) dramatically reducing the overall computational
cost.
The algorithm for selective search only differs from the window scanning
approach on how the list of candidate bounding boxes is generated. Some
approaches also add a bounding box refinement step. The overall approach
is described in algorithm
@fig-selective_search.
![Selective search algorithm.](figures/object_recognition/selective_search_algorithm.png){width="100%" #fig-selective_search}
This algorithm has four main steps. In the first step, the algorithm
uses an efficient window scanning approach to produce a set of candidate
bounding boxes. In the second step, we loop over all the candidate
bounding boxes, we crop out the box and resize it to a canonical size,
and then we apply an object classifier to classify the cropped image as
containing the object we are looking for or not. In the third step, we
refine the bounding box for the crops that are classified as containing
the object. And finally, in the fourth step, we remove low-scoring boxes
and we use **nonmaximum suppression** (NMS) to discard overlapping
detections likely to correspond to the same object, so as to output only
one bounding box for each instance present in the image.
:::{.column-margin}
The nonmaximum suppression algorithm takes as input a set of object bounding boxes and confidences, $S=\{\hat{\mathbf{b}}_i, \hat{\mathbf{y}}_i\}_{i=1}^B$\, and outputs a smaller set removing overlapping bounding boxes. It is an iterative algorithm as follows:
(1) Take the highest confidence bounding box from the set $S$ and add it to the final set $S^*$.
(2) Remove from $S$ the selected bounding box and all the bounding boxes with an IoU larger than a threshold.
(3) Go to step 1 until $S$ is empty.
:::
Each step can be implemented in several different ways, giving rise to
different approaches. The whole pipeline is summarized in
@fig-selective_search_pipeline.
![A first classifier selects candidate bounding boxes. A second, stronger, classifier makes the final detections (e.g., cars). Nonmaximum suppression (NMS) removes overlapping detections.](figures/object_recognition/selective_search.png){width="100%" #fig-selective_search_pipeline}
Bounding box proposal (represented as $f_0$ in
@fig-selective_search_architecture) can be implemented in several ways
(e.g., using image segmentation @Uijlings2013, a neural network
@Ren2015, or a window scanning approach with a low-cost classifier). The
classification and bounding box refinement, $f_1$, can be implemented by
a classifier and a regression function.
![Sketch of the selective search architecture. The image is first broken into candidate regions. Then each region is processed individually using a classifier and a regression.](figures/object_recognition/selective_search_architecture.png){width="40%" #fig-selective_search_architecture}
#### Cascade of classifiers
Trying to localize an object in an image is like finding a needle in a
haystack: the object is usually small and might be surrounded by a
complex background. Selective search reduced the complexity of the
search by dividing the search in two steps: a first, fast, and cheap
classification function that detects good candidate locations; and a
second, slow, and expensive classification function capable of
accurately classifying the object and that only needs to be applied in a
subset of all possible locations and scales. Cascade of classifiers
pushes this idea to the limit by dividing the search in a sequence of
classifiers of increasing computational complexity and accuracy.
The algorithm for the cascade of classifiers is:
![Cascade of classifiers algorithm.](figures/object_recognition/cascade_algorithm.png){width="100%" #fig-cascade_algorithm}
Cascades of classifiers became popular in computer vision when Paul
Viola and Michael Jones @Viola01 introduced it in 2001 with a
ground-breaking real-time face detector based on a cascade of boosted
classifiers. In parallel, Fleuret and Geman @Fleuret2001 also proposed a
system that performs a sequence of binary tests at each location. Each
binary test checks for the presence of a particular image feature. Early
versions of this strategy were also inspired by the game "Twenty
Questions" @Geman1994.
In a cascade, computational power is allocated into the image regions
that are more likely to contain the target object while regions that are
flat or contain few features are rejected quickly and almost no
computations are allocated in them. The following figure, from
@Fleuret2001, shows a beautiful illustration of how a cascaded
classifier allocates computing power in the image when trained to detect
faces. The intensity shown in the heat map is proportional to the number
of levels in the cascade applied to each location.
![(left) Input image with the output of a face detector from 1999. (right) The heat map reports the computational cost at each location. Most of the computation was allocated into the wrong detections in the top left corner. Figure from @Fleuret2001](figures/object_recognition/cascade.png){width="80%" #fig-iou_l2_comparison}
It is interesting to point out that the scanning window approach
($Levels = 1$) and the selective search procedure ($Levels = 2$) are
special cases of this algorithm. The cascade of classifiers usually does
not have a bounding box refinement stage as it already started with a
full list of all bounding boxes, but it could be added if we wanted to
add new transformations not available in the initial set (e.g.,
rotations).
#### Other approaches
Object localization is an active area of research and there are a number
of different formulations that share some elements with the approaches
we shared previously. One example is YOLO @Redmon2016, which makes
predictions by looking at the image globally. It is not as accurate as
some of the scanning methods but it can be computationally more
efficient. There are many other approaches that we will not summarize
here as the list will be obsolete shortly. Instead, we will continue
focusing on general concepts that should help the reader understand
other approaches.
### Object Localization Loss
The object localization loss has to take into account two complementary
tasks: classification and localization.
- Classification loss ($\mathcal{L}_{\text{cls}}$): For each detected
bounding box, does the predicted label matches the ground truth at
that location?
- Localization loss ($\mathcal{L}_{\text{loc}}$): How close is the
detected location to the ground truth object location?
For each image, the output of our object detector is a set of bounding
boxes $\{\hat{\mathbf{b}}_i, \hat{\mathbf{y}}_i\}_{i=1}^B$, where $B$ is
likely to be larger than the number of ground truth bounding boxes in
the training set. The first step in the evaluation is to associate each
detection with a ground truth label so that we have a set
$\{\mathbf{b}_i, \mathbf{y}_i\}_{i=1}^B$. For each of the predicted
bounding boxes that overlaps with the ground truth instances, we want to
optimize the system parameters to improve the predicted locations and
labels. The remaining predicted bounding boxes that do not overlap with
ground truth instances are assigned to the background class,
$\mathbf{y}_i=0$. For those bounding boxes, we want to optimize the
model parameters in order to reduce their predicted class score,
$\hat{\mathbf{y}}_i$. This process is illustrated in the following
figure. The detector produces a set of bounding boxes for candidate car
locations. Those are compared with the ground truth data. Each detected
bounding box is assigned one ground truth label (indicated by the color)
or assigned to the background class (indicated in white). Note that
several detections can be assigned to the same ground truth bounding
box.
![(left) Detector output (using a low threshold to generate many detections). (middle) Ground truth bounding boxes (one color per instance). (right) Detector outputs that overlap with ground truth annotations, which are color coded](figures/object_recognition/detections.png){width="100%"}
Now that we have assigned detections to ground truth annotations, we can
compare them to measure the loss. For the classification loss, as each
bounding box can only have one class, we can use the cross-entropy loss:
$$\mathcal{L}_{\text{cls}}(\hat{\mathbf{y}}_i, \mathbf{y}_i)
= - \sum_{c=1}^{K} y_{c,i} \log(\hat{y}_{c,i})
$$ where $K$ is the number of classes.
Let's now focus on the second part; how do we measure the localization
loss $\mathcal{L}_{\text{loc}} (\hat{\mathbf{b}}_i, \mathbf{b}_i)$? One
typical measure of similarity between two bounding boxes is the
**Intersection over Union** (IoU) as shown in the following drawing:
![Sketch of the computation of intersection over union for two bounding boxes.](figures/object_recognition/iou.png){width="55%"}
The IoU is a quantity between 0 and 1, with 0 meaning no overlap and 1
meaning that both bounding boxes are identical. As the IoU is a
similarity measure, the loss is defined as:
$$\mathcal{L}_{\text{loc}} (\hat{\mathbf{b}}, \mathbf{b})= 1-IoU (\hat{\mathbf{b}}, \mathbf{b}).$$
The IoU is translation and scale invariant, and it is frequently used to
evaluate object detectors. A simpler loss to optimize is the L2
regression loss:
$$\mathcal{L}_{\text{loc}} (\hat{\mathbf{b}}, \mathbf{b}) = (\hat{x}_1 - x_1)^2 + (\hat{x}_2 - x_2)^2 + (\hat{y}_1 - y_1)^2 + (\hat{y}_2 - y_2)^2$$
The L2 regression loss is translation invariant, but it is not scale
invariant. The L2 loss is larger for big bounding boxes. The next graph
(@fig-iou_l2_comparison) compares the IoU loss, and the L2 regression
loss for two square bounding boxes, with an area equal to 1, as a
function of the relative $x$-displacement.
![Comparison of the the IoU loss and the L2 regression loss for two square bounding boxes as a function of the relative $x$-displacement.](figures/object_recognition/iou_l2_comparison.png){width="90%" #fig-iou_l2_comparison}
The IoU loss becomes 1 when the two bounding boxes do not overlap, which
means that there will be no gradient information when running
backpropagation.
Now we can put together the classification and the localization loss to
compute the overall loss:
$$
\mathcal{L}( \{\hat{\mathbf{b}}_i, \hat{\mathbf{y}}_i\}, \{\mathbf{b}_i, \mathbf{y}_i\})
= \mathcal{L}_{\text{cls}}(\hat{\mathbf{y}}_i, \mathbf{y}_i) +
\lambda \mathbb{1} (\mathbf{y}_i \neq 0)
\mathcal{L}_{\text{loc}}
(\hat{\mathbf{b}}_i, \mathbf{b}_i)
$$ where the indicator function
$\mathbb{1} (\mathbf{y}_i \neq 0)$ sets to zero the location loss for
the bounding boxes that do not overlap with any of the ground truth
bounding boxes. The parameter $\lambda$ can be used to balance the
relative strength of both losses. This loss makes it possible to train
the whole detection algorithm end-to-end. It is possible to train the
localization and the classification stages independently and some
approaches follow that strategy.
### Evaluation
There are several ways of evaluating object localization approaches. The
most common approach is measuring the average precision-recall.
Just as we did when defining the localization loss, we need to assign
detection outputs to ground truth labels. We can do this in several
ways, and they can result in different measures of performance. The
methodology introduced in the PASCAL challenge used the following
procedure.
**Assign detections to ground truth labels**: For each image, sort all
the detections by their score, $\hat{\mathbf{y}}_i$. Then, loop over the
sorted list in decreasing order. For each bounding box, compute the IoU
with all the ground truth bounding boxes. Select the ground truth
bounding box with the highest IoU. If the IoU is larger than a
predefined threshold, (a typical value is 0.5) mark the detection as
correct and remove the ground truth bounding box from the list to avoid
double counting the same object. If the IoU is below the threshold, mark
the detection as incorrect and do not remove the ground truth label.
Repeat this operation until there are no more detections to evaluate.
Mark remaining ground-truth bounding boxes as missed detections.
**Precision-recall curve** measures the performance of the detection as
a function of the decision threshold. As each bounding box comes with a
confidence score $\hat{\mathbf{y}}_i$, we need to use a threshold,
$\beta$, to decide if an object is present at the bounding box location
$\hat{\mathbf{b}}_i$ or not. Given a threshold $\beta$, the number of
detections is $\sum_i \mathbb{1} (\hat{\mathbf{y}}_i > \beta)$ and the
number of correct detections is
$\sum_i \mathbb{1} (\hat{\mathbf{y}}_i > \beta) \times \mathbf{y}_i$.
From these two quantities we compute the **precision** as the percentage
of correct detections:
$$
Precision(\beta) = \frac{\sum_i \mathbb{1} (\hat{\mathbf{y}}_i > \beta) \times \mathbf{y}_i} {\sum_i \mathbb{1} (\hat{\mathbf{y}}_i > \beta)}
$$
The precision only gives a partial view on the detector performance as
it does not account for the number of ground truth instances that are
not detected (misdetections). The **recall** measures the proportion of
ground truth instances that are detected for a given decision threshold
$\beta$:
$$Recall (\beta) = \frac{\sum_i \mathbb{1} (\hat{\mathbf{y}}_i > \beta) \times \mathbf{y}_i} {\sum_i \mathbf{y}_i}$$
Both, the precision and recall, are quantities between 0 and 1. High
values of precision and recall correspond to high performance. The next
graph (@fig-example_precision_recall) shows the precision-recall curve
as a function of $\beta$ (decision threshold).
![Example of a precision-recall curve. This is a standard plot to evaluate detection algorithms. It is usually summarized by the area under the curve. In this example, note that recall stops around 0.5. This is because this algorithm fails to detect the other 50 percent of the test examples for all thresholds $\beta$.](figures/object_recognition/example_precision_recall.png){width="45%" #fig-example_precision_recall}
The precision-recall curve is non-monotonic. The average precision (AP)
summarizes the entire curve with one number. The AP is the area under
the precision-recall curve, and it is a number between 0 and 1.
### Shortcomings
Bounding boxes can be very powerful in many applications. For instance,
digital cameras detect faces and use bounding boxes to encode their
location. The camera uses the pixels inside the bounding box to
automatically set the focus and exposure time.
But using bounding boxes to represent the location objects will not be
appropriate for objects with long and thin structures or for regions
that do not have well-defined boundaries (@fig-segmenting_trees). It is
useful to differentiate between **stuff** and **objects**. Stuff refers
to things that do not have well-defined boundaries such as grass, water,
and sky (we already discussed this in chapter
@sec-textures). But the distinction between stuff and
objects is not very sharp. In some images, object instances might be
easy to separate like the two trees in the left image below, or become a
texture where instances cannot be detected individually (as shown in the
right image). In cases with lots of instances, bounding boxes might not
appropriate and it is better to represent the region as "trees" than to
try to detect each instance.
![Bounding boxes are not an appropriate representation of the location of trees. This is true for many object classes with extended shapes that are hard to approximate with a box.](figures/object_recognition/segmenting_trees.png){width="100%" #fig-segmenting_trees}
Even when there are few instances, bounding boxes can provide an
ambiguous localization when two objects overlap because it might not be
clear which pixels belong to each object.
Bounding boxes are also an insufficient object description if the task
is robot manipulation. A robot will need a more detailed description of
the pose and shape of an object to interact with it.
Let's face it, localizing objects with bounding boxes does not address
most of the shortcomings present in the image classification
formulation. In fact, it adds a few more.
## Class Segmentation
An object is something localized in space. There are other things that
are not localized such as fog, light, and so on. Not everything is
well-described by a bounding box (stuff, wiry objects, etc.). Instead we
can try to classify each pixel in an image with an object class.
Per-pixel classification of object labels, illustrated in
@fig-semantic_class_segmentation, is referred to as **semantic
segmentation**.
![Semantic segmentation. Every pixel is annotated with a semantic tag (red = car, blue = building, green = road, yellow = sidewalk, and cyan = sign).](figures/object_recognition/semantic.png){width="60%" #fig-semantic_class_segmentation}
In the most generic formulation, semantic segmentation takes an image as
input, $\mathbf{x}(n,m)$, and it outputs a classification vector at each
location $\mathbf{y}(n,m)$:
$$\hat{\mathbf{y}} \left[n,m \right] = f(\mathbf{x} \left[n,m \right])$$
There are many ways in which such a function can be implemented.
Generally the function $f$ is a neural network with an encoder-decoder
structure (@fig-segmentation_architecture), first introduced in
@Badrinarayanan2015.
![Sketch of the encoder-decoder architecture for semantic segmentation.](figures/object_recognition/segmentation_architecture.png){width="30%" #fig-segmentation_architecture}
Another formulation consists of implementing a window-scanning approach
where the function $f$ scans the image and takes as input an image patch
and it outputs the class of the central pixel of each patch. This is
like a convolutional extension of the image classifier that we discussed
in @sec-image_classification and several architectures are
variations over this same theme.
Training this function requires access to a dataset of segmented images.
Each image in the training set will be associated with a ground truth
segmentation, $y^{(t)}_c \left[n,m \right]$ for each class $c$. The
multiclass semantic segmentation loss is the cross-entropy loss applied
to each pixel:
$$\mathcal{L}_{seg}(\hat{\mathbf{y}},\mathbf{y})
= -\sum_{t=1}^{T} \sum_{c=1}^{K} \sum_{n,m} y^{(t)}_c \left[n,m \right] \log(\hat{y}^{(t)}_c \left[n,m \right])$$
where the sum is over all the training examples, all classes, and all
pixel locations.
This loss might focus too much on the large objects and ignore small
objects. Therefore, it is useful to add a weight to each class that
normalizes each class according to its average area in the training set.
In order to evaluate the quality of the segmentation, we can measure the
percentage of correctly labeled pixels. However, this measure will be
dominated by the large objects (sky, road, etc.). A more informative
measure is the average IoU across classes. The IoU is a measure that is
invariant to scale and will provide a better characterization of the
overall segmentation quality across all classes.
### Shortcomings
In this representation we have lost something that we had with the
bounding boxes: this representation cannot count instances. The semantic
segmentation representation, as formulated in this section, cannot
separate two instances of the same class that are in contact; these will
appear as a single segment with the same label.
Another important limitation is that each pixel is labeled as belonging
to a single class. This means it cannot deal with transparencies. In
such a cases, we would like to be able to provide multiple labels to
each pixel and also provide a sense of depth ordering.
## Instance Segmentation
We can combine the ideas of object detection and semantic segmentation
into the problem of instance segmentation. Here the idea is to output a
set of localized objects, but rather than representing them as bounding
boxes we represent them as pixel-level masks as shown in
@fig-instance_segmentation.
![Instance segmentation. Each instance of the class car is assigned to a different color.](figures/object_recognition/instance_segmentation.png){width="60%" #fig-instance_segmentation}
The difference from semantic segmentation is that if there are $K$
objects of the same type, we will output $K$ masks; conversely, semantic
segmentation has no way to telling us how many objects of the same type
there are, nor can it delineate between two objects of the same type
that overlap; it just gives us an aggregate region of pixels that are
all the pixels of the same object type.
One approach to instance segmentation is to follow the object
localization pipeline but, after the bounding boxes have been proposed
and labeled, feed the cropped image within each bounding box to a binary
semantic segmentation function that simply predicts, for each pixel in
box $\hat{\mathbf{b}}_i$ whether it belongs to the object or not. This
way we get an instance mask for each box. Such an approach was
introduced in @he2017. The architecture is illustrated in
@fig-instance_segmentation_architecture.
![Architecture for instance segmentation. This architecture combines several of the components we have seen previously. The image is first broken into overlapping regions, and then class segmentation is applied to each region individually.](figures/object_recognition/instance_segmentation_architecture.png){width="50%" #fig-instance_segmentation_architecture}
As before, $f_0$ is a low-cost detector that proposes regions likely to
contain instances of the target object. The function $f_1$ performs
classification and segmentation of the instance:
$$\left[\hat{\mathbf{y}}_i, \hat{\mathbf{s}}_i \right]= f_j(\mathbf{x}, \hat{\mathbf{b}}_i)$$
where $\hat{\mathbf{y}}_i$ is the binary label confidence, and
$\hat{s}_i \left[ n,m \right]$ is the instance segmentation mask.
The loss can be written as a combination of the classification,
localization, and segmentation losses that we discussed before:
$$
\mathcal{L}( \{\hat{\mathbf{b}}_i, \hat{\mathbf{y}}_i\}, \{\mathbf{b}_i, \mathbf{y}_i\})
= \mathcal{L}_{\text{cls}}(\hat{\mathbf{y}}_i, \mathbf{y}_i) +
\mathbb{1} (\mathbf{y}_i \neq 0)
\left(
\lambda_1 \mathcal{L}_{\text{loc}} (\hat{\mathbf{b}}_i, \mathbf{b}_i)
+
\lambda_2 \mathcal{L}_{\text{seg}} (\hat{\mathbf{s}}_i, \mathbf{s}_i)
\right)
$$
The first classification loss corresponds to the
classification of the crop as containing the object or not; if the
object is present, the second term measures the accuracy of the bounding
box, and the third term measures whether the segmentation,
$\hat{\mathbf{s}}_i$, inside each bounding box matches the ground truth
instance segmentation, $\mathbf{s}_i$. The parameters $\lambda_1$ and
$\lambda_2$ control how much weight each loss has on the final loss.
What is interesting about this formulation is that it can be extended to
provide a rich representation of the detected objects. For instance, in
addition to the segmentation mask, we could also regress the $u-v$
coordinates for each pixel using an instance-centered coordinate frame,