-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathclassification.tex
1037 lines (897 loc) · 47.3 KB
/
classification.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Classification}
\label{chap-classification}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Classification}
Classification is a machine learning problem seeking to map from
inputs $\R^d$ to outputs in an unordered set.\note{in contrast to a continuous real-valued
output, as we saw for linear regression} \index{classification}
Examples of classification output sets could be $\{{\text apples}, {\text oranges}, {\text pears}\}$ if we're
trying to figure out what type of fruit we have, or
$\{{\text heart attack}, {\text no heart attack}\}$ if we're working in an emergency
room and trying to give the best medical care to a new patient.
We focus on an essential simple
case, {\em binary classification}, where we aim to find a mapping from $\R^d$ to
two outputs. \index{classification!binary classification}
While we should think of the outputs as not having an order, it's
often convenient to encode them as $\{-1, +1\}$. As before, let the letter $h$ (for
hypothesis) represent a classifier, so the classification process
looks like:
$$ x \rightarrow \boxed{h} \rightarrow y \;\;.$$
%
% As with regression, real life rarely gives us vectors of real numbers; the $x$ we really
% want to classify is often some rich visual, audio, or other physical data.
% In that case, we'll have to define a function $\varphi(x)$, whose
% domain is $\R^d$, where $\varphi$ represents
% {\em features} of $x$, like a person's height or the amount of bass in
% a song, and then let the $h: \varphi(x) \rightarrow \{-1, +1\}$.
% In much of the following, we'll omit explicit mention of $\varphi$ and
% assume that the $\ex{x}{i}$ are in $\R^d$, but you should always have
% in mind that some additional process was almost surely required to go
% from the actual input examples to their feature representation.
%
Like regression, classification is a {\em{supervised learning}} problem, in which we
are given a training data set of the form
\[ \data_n = \left\{\left(\ex{x}{1}, \ex{y}{1}\right), \dots, \left(\ex{x}{n},
\ex{y}{n}\right)\right\} \;\;.\]
We will assume that each $\ex{x}{i}$ is a $d \times 1$ {\em column
vector}. The intended use of this data is that, when given an input
$\ex{x}{i}$, the learned hypothesis should generate output
$\ex{y}{i}$.
What makes a classifier useful? As in regression, we want it to work well on
% \anchorednote{{\em new}}{My favorite analogy is to problem sets. We evaluate a
% student's ability to {\em generalize} by putting questions on
% the exam (test set) that were not on the homework (training set).}
new data, making good predictions on examples it hasn't seen.
But we don't know exactly what data this classifier might be tested on
when we use it in the real world. So, we have to {\em{assume}} a
connection between the training data and testing data; typically, they
are drawn independently from the same probability distribution.
In classification, we will often use 0-1 loss for evaluation
(as discussed in Section~\ref{sec-evaluation}). For that choice, we can
write the training error and the testing error.
In particular, given a training set $\data_n$ and a classifier $h$, we define the
{\em{training error}} of $h$ to be
\begin{eqnarray*}
\trainerr(h) = \frac{1}{n}\sum_{i = 1}^{n}\begin{cases} 1 &
h(\ex{x}{i}) \ne \ex{y}{i} \\ 0 & \text{otherwise}\end{cases}
\;\;.
\end{eqnarray*}
For now, we will try to find a classifier with small training error
(later, with some added criteria) and hope it {\em{generalizes well}}
to new data, and has a small {\em test error}
\begin{eqnarray*}
\mathcal{E}(h) = \frac{1}{n'}\sum_{i = n + 1}^{n + n'}\begin{cases}
1 & h(x^{(i)}) \ne y^{(i)} \\ 0 & \text{otherwise}\end{cases}
\end{eqnarray*}
on $n'$ new examples that were not used in the process of finding the
classifier.
% {\bf What is the role of hypothesis classes for classifiers?}
% A {\em hypothesis class} $\hclass$ is a set (finite or infinite) of
% possible classifiers, each of which represents a mapping from
% $\R^d \rightarrow \{-1, +1\}$.
% A {\em learning algorithm} is a
% procedure that takes a data set $\data_n$ as input and returns an
% element $h$ of $\hclass$; it looks like
% \begin{equation*}
% \data_n \longrightarrow \boxed{\text{learning alg ($\hclass$)}} \longrightarrow h
% \end{equation*}
% We will find that the choice of $\hclass$ can have a big impact on the
% test error of the $h$ that results from this process. One way to get
% $h$ that generalizes well is to restrict the size, or
% ``expressiveness'' of $\hclass$.
We begin by introducing
the hypothesis class of {\em linear classifiers}\index{linear classifier}
(Section~\ref{sec-linear_classifiers}) and then define
an optimization framework to learn {\em linear logistic
classifiers} (Section~\ref{sec-lin_log_classifiers}).
% We then
% return to feature representation, in the context of classifiers
% (Section~\ref{sec-features_classifiers}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Linear classifiers}
\label{sec-linear_classifiers}
We start with the hypothesis class of {\em linear classifiers}. They
are (relatively) easy to understand, simple in a mathematical sense,
powerful on their own, and the basis for many other more sophisticated
methods. Following their definition, we present a simple
learning algorithm for classifiers.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Linear classifiers: definition}
A linear classifier in $d$ dimensions is
defined by a vector of parameters $\theta \in \R^d$ and scalar
$\theta_0 \in \R$. So, the hypothesis class $\hclass$ of linear
classifiers in $d$ dimensions is parameterized by the {\em set} of all vectors in
$\R^{d+1}$. We'll assume that $\theta$ is a $d \times 1$ column
vector.
Given particular values for $\theta$ and $\theta_0$, the classifier is
defined by
% {Let's be careful about dimensions. We have assumed that $x$ and
% $\theta$ are both $d \times 1$ column vectors. So $\theta^T x$ is $1
% \times 1$, which in math (but not necessarily numpy) is the same as a
% scalar.}
\begin{eqnarray*}
h(x; \theta, \theta_0) = \text{sign}(\theta^T x + \theta_0)
= \begin{cases} +1 & \text{if $\theta^Tx + \theta_0 > 0$} \\ -1 &
\text{otherwise}\end{cases} \;\;.
\end{eqnarray*}
\index{linear classifier!hypothesis class}\index{sign function}Remember that we can think of $\theta, \theta_0$ as specifying a
$d$-dimensional hyperplane (compare the above with
Eq.~\ref{eq:linear_reg_hypothesis}). But this time, rather than
being interested in that hyperplane's values at particular points $x$,
we will focus on the {\em separator} that it induces. The separator
is the set of $x$ values such that $\theta^T x + \theta_0 = 0$. This
is also a hyperplane, but in $d-1$ dimensions!\index{separator}
We can interpret $\theta$ as a
vector that is perpendicular to the separator. (We will also say
that $\theta$ is {\em normal to} the separator.)
For example, in two dimensions ($d=2$) the separator has dimension 1,
which means it is a line, and
the two components of $\theta = [\theta_1, \theta_2]^T$ give the orientation
of the separator, as illustrated in the following example.
\subsection{Linear classifiers: examples}
%%%%%%%%%%%%%%%%%%%%
\begin{examplebox}{\bf Example:}
Let $h$ be the linear classifier defined by $\theta = \protect\twodcol{1}{-1}, \theta_0 = 1$.
\bigskip
\noindent
The diagram below shows the $\theta$ vector (in green) and the separator
it defines:
\begin{center}
\begin{tikzpicture}
\draw [thin, gray!40] (-3,-3) grid (4,4);
\draw [<->] (-3,0) -- (4,0);
\draw [<->] (0,-3) -- (0,4);
\draw [ultra thick] (-3, -2) -- (3, 4)
node [above] {\large$\theta^Tx + \theta_0 = 0$};
\node [xshift=0em] at (4.3,0) {\large$x_1$};
\node [xshift=0em] at (0,4.2) {\large$x_2$};
% \draw [thick,teal,-latex] (0, 1) -- (0.9, 0.1)
\draw [thick,teal,-latex] (0, 1) -- (1.0, 0.0)
% node [black,below right] {$\theta$};
node [black,midway, xshift=0.2cm,yshift=0.1cm] {$\theta$};
\draw [decorate,decoration={brace,amplitude=3pt, mirror},xshift=2pt,yshift=0pt]
(1,0) -- (1.0,1.0) node [black,midway,xshift=0.3cm]
{\footnotesize $\theta_2$};
\draw [decorate,decoration={brace,amplitude=3pt, mirror},xshift=0pt,yshift=-2pt]
(0,0) -- (1.0,0.0) node [black,midway,yshift=-0.3cm]
{\footnotesize $\theta_1$};
\end{tikzpicture}
\end{center}
What is $\theta_0$? We can solve for it by plugging a point
on the line into the equation for the line. It is often convenient to choose a point on one of the axes, e.g., in this case,
$x=[0, 1]^T$, for which $\theta^T \left[ \begin{array}{c} 0 \\ 1 \end{array} \right] + \theta_0 = 0$, giving $\theta_0 = 1$.
\end{examplebox}
%%%%%%%%%%%%%%%%%%%%
In this example, the separator divides $\R^d$, the space our $\ex{x}{i}$ points live
in, into two half-spaces. The one that is on the same side as the
normal vector is the {\em positive} half-space, and we classify all
points in that space as positive. The half-space on the other side is
{\em negative} and all points in it are classified as negative.
Note that we will call a separator a {\em linear separator}\index{separator!linear separator} of a data set
if all of the data with one label falls on one side of the separator and all of the
data with the other label falls on the other side of the separator. For instance, the
separator in the next example is a linear separator for the illustrated data. If there exists a linear separator on a dataset, we call this dataset {\em linearly separable}\index{linearly separable}.
\begin{examplebox}{\bf Example:}
Let $h$ be the linear classifier defined by
$\theta = \protect\twodcol{-1}{1.5}, \theta_0 = 3$.
\noindent The diagram below shows several points classified by $h$.
In particular, let $\ex{x}{1} = \twodcol{3}{2}$ and
$\ex{x}{2} = \twodcol{4}{-1}$.
\begin{align*}
h(\ex{x}{1}; \theta, \theta_0) & = \sign\left(\twodrow{-1}{1.5}\twodcol{3}{2} + 3\right) = \sign(3) = +1 \\
h(\ex{x}{2}; \theta, \theta_0) & = \sign\left(\twodrow{-1}{1.5}\twodcol{4}{-1} + 3\right) = \sign(-2.5) = -1
\end{align*}
Thus, $\ex{x}{1}$ and $\ex{x}{2}$ are given positive and negative classifications,
respectively.
\begin{tikzpicture}
\draw [thin, gray!40] (-2,-3) grid (5,4);
\draw [<->] (-2,0) -- (5,0);
\draw [<->] (0,-3) -- (0,4);
\draw [ultra thick] (-1.5,-3) -- (5,1.3333)
node [above right] {\large$\theta^Tx + \theta_0 = 0$};
\draw [thick,teal,-latex] (2,-0.6667) -- (1,0.8883)
node [black,below left] {$\theta$};
\node [above right,xshift=.3em] at (3,2) {\large$\ex{x}{1}$};
\node [above right,xshift=.3em] at (4,-1) {\large$\ex{x}{2}$};
\foreach \Point in {(3,2), (1.8,3.5), (-1,3), (.8,-.6), (-1.4,-.9),
(1.5, 1.5)}{
\pic at \Point {plus};
}
\foreach \Point in {(4,-1), (2,-2.7), (4.4,-2.1)}{
\pic at \Point {minus};
}
\end{tikzpicture}
\end{examplebox}
\question{What is the green vector normal to the separator? Specify it
as a column vector.}
\question{
What change would you have to make to $\theta, \theta_0$ if you wanted
to have the separating hyperplane in the same place, but to classify
all the points labeled '+' in the diagram as negative and all the
points labeled '-' in the diagram as positive?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\subsection{Learning linear classifiers}
%% Now, given a data set and the hypothesis class of linear classifiers,
%% our goal will be to find the linear classifier that optimizes an
%% objective function relating its predictions to the training data.
%% {\em This is a well-formed optimization problem. But it's not
%% computationally easy!}
% We'll start by considering a very simple learning algorithm. \note{It's
% a good idea to think of the ``stupidest possible'' solution to a
% problem, before trying to get clever. Here's a fairly (but not
% completely) stupid algorithm.} The idea is to generate $k$ possible
% hypotheses by generating their parameter vectors at random. Then, we
% can evaluate the training-set error on each of the hypotheses and
% return the hypothesis that has the lowest training error (breaking
% ties arbitrarily).
% \begin{codebox}
% \Procname{$\proc{Random-Linear-Classifier}(\dataTrain, k)$}
% \li \For $j \gets 1$ \To $k$
% \li \Do
% randomly sample $\left(\ex{\theta}{j},
% \ex{\theta_0}{j}\right)$ from $(\R^d, \R)$
% \End
% \li $j^* \gets \argmin{j \in \{1, \ldots, k\}} \mathcal{E}_n \left(\ex{\theta}{j}, \ex{\theta_0}{j}\right)$
% \li \Return $\left(\ex{\theta}{j^*}, \ex{\theta_0}{j^*}\right)$
% \end{codebox}
% A note about terminology and
% \anchorednote{notation.}
% {The notation within the algorithm above might be new to you:
% $\argmin{x} f(x)$
% means the value of $x$ for which $f(x)$ is the smallest. Sometimes
% we write $\argmin{x \in {\cal X}} f(x)$ when we want to explicitly
% specify the set ${\cal X}$ of values of $x$ over which we want to
% minimize.}
% The training data $\dataTrain$ is an input to the learning algorithm,
% and the output of this learning algorithm will be a hypothesis $h$, where
% $h$ has parameters $\theta$ and $\theta_{0}.$
% In contrast to $\theta$ and $\theta_{0}$, the input $k$ is a
% parameter of the learning algorithm itself; such
% parameters are often called {\em hyperparameters}.
% \question{
% What do you think will be observed for $\trainerr(h)$, where $h$ is the
% hypothesis returned by \proc{Random-Linear-Classifier}, if this learning
% algorithm is run multiple times? And as $k$ is increased? }
% \question{
% What properties of $\dataTrain$ do you think will have an
% effect on $\trainerr(h)$?
% }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Evaluating a learning algorithm -- moved to regression chapter 2021-08-04
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \section{Evaluating a learning algorithm}
% How should we evaluate the performance of a {\em classifier} $h$? The best
% method is to measure {\em test error} on data that was not used to
% train it.
%
% How should we evaluate the performance of a {\em learning algorithm}?
% This is trickier. There are many potential sources of variability in
% the possible result of computing test error on a learned hypothesis $h$:
% \begin{itemize}
% \item Which particular {\em training examples} occurred in $\dataTrain$
% \item Which particular {\em testing examples} occurred in $\dataTest$
% \item Randomization inside the learning {\em algorithm} itself
% \end{itemize}
% Generally, we would like to execute the following process multiple
% times:
% \begin{itemize}
% \item Train on a new training set
% \item Evaluate resulting $h$ on a testing set {\em that does not
% overlap the training set}
% \end{itemize}
% Doing this multiple times controls for possible poor choices of
% training set or unfortunate randomization inside the algorithm itself.
%
% One concern is that we might need a lot of data to do this, and in
% many applications data is expensive or difficult to acquire. We can
% re-use data with {\em{cross validation}} (but it's harder to do theoretical
% analysis). \\
% \begin{codebox}
% \Procname{$\proc{Cross-Validate}(\data, k)$}
% \li divide $\data$ into $k$ chunks $\data_1, \data_2, \ldots \data_k$ (of roughly equal size)
% \li \For $i \gets 1$ \To $k$
% \li \Do
% train $h_i$ on $\data \setminus \data_i$ (withholding chunk $\data_i$)
% \li compute ``test'' error $\mathcal{E}_i (h_i)$ on withheld data $\data_i$
% \End
% \li \Return $\frac{1}{k} \sum_{i=1}^k \mathcal{E}_i (h_i)$
% \end{codebox}
%
% It's very important to understand that cross-validation neither
% delivers nor evaluates a single particular hypothesis $h$. It
% evaluates the {\em algorithm} that produces hypotheses.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% LPK: Try to keep \mathcal{L}(g, a) as much as possible
% LPK: Use cap Theta for all params
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Linear logistic classifiers}
\label{sec-lin_log_classifiers}
Given a data set and the hypothesis class of linear classifiers,
our goal will be to find the linear classifier that optimizes an
objective function relating its predictions to the training data.
To make this problem computationally reasonable, we will need to
take care in how we formulate the optimization problem to achieve
this goal.
For classification, it is natural to make predictions in $\{+1, -1\}$
and use the 0-1 loss function, $\mathcal{L}_{01}$, as introduced in Chapter~\ref{chap-intro}:
\[\mathcal{L}_{01}(g, a) = \begin{cases}
0 & \text{if $g = a$} \\
1 & \text{otherwise}
\end{cases} \; .
\]
However, even for simple linear
classifiers, it is very difficult to
find values for $\theta, \theta_0$ that minimize simple 0-1 training error
\[J(\theta, \theta_0) = \frac{1}{n} \sum_{i=1}^n \mathcal{L}_{01}(\text{sign}(\theta^T\ex{x}{i} + \theta_0),
\ex{y}{i})\;\;.\]
This problem is NP-hard, which probably%
\note{The ``probably'' here is not because we're too lazy to look it
up, but actually because of a fundamental unsolved problem in
computer-science theory, known as ``P vs. NP.''}
implies
that solving the most difficult instances of this problem would
require computation time {\em exponential} in the number of training
examples, $n$.
What makes this a difficult optimization problem is its lack of
``smoothness'':
\begin{itemize}
\item There can be two hypotheses, $(\theta, \theta_0)$ and
$(\theta', \theta_0')$, where
one is closer in parameter space to the optimal parameter values
$(\theta^*, \theta_0^*)$, but they make the same number of
misclassifications so they have the same $J$ value.
\item All predictions are categorical: the classifier can't express a
degree of certainty about whether a particular input $x$ should have
an associated value $y$.
\end{itemize}
For these reasons, if we are considering a hypothesis $\theta,\theta_0$
that makes five incorrect predictions, it is difficult to see how we
might change $\theta,\theta_0$ so that it will perform better, which
makes it difficult to design an algorithm that searches in a sensible
way through the
space of hypotheses for a good one.
For these reasons, we investigate another hypothesis class: {\em
linear logistic classifiers}, providing their definition, then an
approach for learning such classifiers using optimization.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Linear logistic classifiers: definition}
The hypotheses in a linear logistic classifier (LLC) are
parameterized by a $d$-dimensional vector $\theta$ and a scalar
$\theta_0$, just as is the case for linear classifiers. However,
instead of making predictions in $\{+1, -1\}$, LLC hypotheses
generate real-valued outputs in the interval $(0, 1)$. An LLC has the form
\[h(x; \theta, \theta_0) = \sigma(\theta^T x + \theta_0)\;\;.\]
\index{linear logistic classifier}This looks familiar! What's new?
The {\em logistic} function, also known as the {\em sigmoid} function,
is defined as
\[\sigma(z) = \frac{1}{1+e^{-z}}\;\;,\] and is plotted below, as a
function of its input $z$.
Its output can be interpreted as a probability, because for any value of
$z$ the output is in $(0, 1)$.
\begin{center}
\begin{tikzpicture}
\begin{axis}[
axis lines=middle,
% axis equal image,
scale only axis,
width=0.6\textwidth,
height=0.3\textwidth,
xmin=-4, xmax=4,
ymin=-0.25, ymax=1.25,
xlabel={$z$}, ylabel={$\sigma(z)$},
]
\addplot [domain=-4:4, samples=100, ultra thick] {1/(1+exp(-x))};
\addplot [domain=-4:4, samples=2, dashed] {1};
\addplot [domain=-4:4, samples=2, dashed] {0};
\end{axis}
\end{tikzpicture}
\end{center}
\question{Convince yourself the output of $\sigma$ is always in the
interval $(0, 1)$. Why can't it equal 0 or equal 1? For what value
of $z$ does $\sigma(z) = 0.5$?}
\subsection{Linear logistic classifier: examples}
What does an LLC look like? Let's consider the
simple case where $d = 1$, so our input points simply lie along the
$x$ axis. Classifiers in this case have dimension $0$, meaning that
they are points.
The plot below shows LLCs for three different parameter
settings: $\sigma(10x + 1)$, $\sigma(-2x + 1)$, and $\sigma(2x - 3).$
\begin{center}
\begin{tikzpicture}
\begin{axis}[
axis lines=middle,
% axis equal image,
scale only axis,
width=0.6\textwidth,
height=0.3\textwidth,
xmin=-4, xmax=4,
ymin=-0.25, ymax=1.25,
xlabel={$x$}, ylabel={$\sigma(\theta^T x + \theta_0)$},
]
\addplot [domain=-4:4, samples=100, ultra thick, red] {1/(1+exp(-(10*x + 1)))};
\addplot [domain=-4:4, samples=100, ultra thick, blue] {1/(1+exp(-(-2*x
+ 1)))};
\addplot [domain=-4:4, samples=100, ultra thick, green] {1/(1+exp(-(2*x
- 3)))};
\addplot [domain=-4:4, samples=2, dashed] {1};
\addplot [domain=-4:4, samples=2, dashed] {0};
\end{axis}
\end{tikzpicture}
\end{center}
\question{Which plot is which? What governs the steepness of the
curve? What governs the $x$ value where the output is equal to
0.5?}
But wait! Remember that the definition of a classifier is that it's a mapping from $\R^d
\rightarrow \{-1, +1\}$ or to some other discrete set. So, then, it
seems like an LLC is actually not a classifier!
Given an LLC, with an output value in $(0, 1)$, what should we do if
we are forced to make a prediction in $\{+1, -1\}$? A default answer
is to predict $+1$ if $\sigma(\theta^T x + \theta_0) > 0.5$ and $-1$
otherwise. The value $0.5$ is sometimes called a {\em prediction
threshold}.\index{logistic linear classifier!prediction threshold}
In fact, for different problem settings, we might prefer to pick a
different prediction threshold. The field of {\em decision theory}
considers how to make this choice.
For example, if the consequences of predicting $+1$ when
the answer should be $-1$ are much worse than the consequences of
predicting $-1$ when the answer should be $+1$, then we might set the
prediction threshold to be greater than $0.5$.
\question{Using a prediction threshold of 0.5, for what values of $x$
do each of the LLCs shown in the figure above predict $+1$?}
When $d = 2$, then our inputs $x$ lie in a two-dimensional space with
axes $x_1$ and $x_2$, and the output of the LLC is a surface, as shown
below, for $\theta = (1, 1), \theta_0 = 2$.
\includegraphics[width=0.7\textwidth]{figures/logreg3d}
\question{Convince yourself that the set of points for which
$\sigma(\theta^T x + \theta_0) = 0.5$, that is, the ``boundary''
between positive and negative predictions with prediction threshold
$0.5$, is a line in $(x_1, x_2)$ space. What particular line is it
for the case in the figure above?
How would the plot change for $\theta = (1, 1)$, but now
with $\theta_0 = -2$? For $\theta = (-1, -1), \theta_0 = 2$?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Learning linear logistic classifiers}
\label{logistic}
Optimization is a key approach to solving machine learning
problems; this also applies to learning linear logistic classifiers (LLCs) by
defining an appropriate loss function for optimization.
A first attempt might be to use the simple 0-1 loss function $\mathcal{L}_{01}$
that gives a value of 0 for a correct prediction, and a 1 for an
incorrect prediction. As noted earlier, however, this gives rise
to an objective function that is very difficult to optimize, and so
we pursue another strategy for defining our objective.
For learning LLCs, we'd have a class of hypotheses
whose outputs are in $(0, 1)$, but for which we have training data with $y$
values in $\{+1, -1\}$. How can we define an appropriate loss
function? We start by changing our interpretation of the output to be
{\em the probability that the input should map to output value 1} (we
might also say that this is the probability that the input is in class 1 or
that the input is `positive.')
\question{If $h(x)$ is the probability that $x$ belongs to class $+1$,
what is the probability that $x$ belongs to the class $-1$, assuming there are only these two classes?}
Intuitively, we would like to have
{\em low loss if we assign a high probability to the correct class.}
We'll define a loss function, called {\em negative log-likelihood} (NLL)\index{negative log-likelihood},
that does just this. In addition, it has the cool property that it
extends nicely to the case where we would like to classify our inputs
into more than two classes.
In order to simplify the description, we assume that (or transform our
data so that) the labels
in the training data are $y \in \{0, 1\}$.%
\note{\bf Remember to be sure your $y$
values have this form if you try to learn an LLC using NLL!}
We would like to pick the parameters of our classifier to maximize the
probability assigned by the LLC to the correct $y$ values, as
specified in the training set. Letting guess $\ex{g}{i} =
\sigma(\theta^T\ex{x}{i} + \theta_0)$,
that probability is\note{That crazy huge $\Pi$ represents taking the
product over a bunch of factors just as huge $\Sigma$ represents
taking the sum over a bunch of terms.}
\begin{equation*}
\prod_{i = 1}^n \begin{cases} \ex{g}{i} & \text{if $\ex{y}{i} =
1$} \\ 1 - \ex{g}{i} & \text{otherwise}
\end{cases}\;\;,
\end{equation*}
under the assumption that our predictions are independent. This can
be cleverly rewritten, when $\ex{y}{i} \in \{0, 1\}$, as
\begin{equation*}
\prod_{i = 1}^n {\ex{g}{i}}^{\ex{y}{i}}(1 - \ex{g}{i})^{1 - \ex{y}{i}}\;\;.
\end{equation*}
\question{Be sure you can see why these two expressions are the
same.}
The big product above is kind of hard to deal with in practice, though.
So what can we do?
Because the
log function is monotonic, the $\theta, \theta_0$ that maximize the
quantity above will be the same as the $\theta, \theta_0$ that
maximize its log, which is the following:
\begin{equation*}
\sum_{i = 1}^n \left( {\ex{y}{i}}\log {\ex{g}{i}} +
(1 - \ex{y}{i})\log(1 - \ex{g}{i})\right)\;\;.
\end{equation*}
Finally, we can turn the maximization problem above into a minimization problem by taking the negative
of the above expression, and writing in terms of minimizing a loss
\begin{equation*}
\sum_{i = 1}^n \mathcal{L}_\text{nll}(\ex{g}{i}, \ex{y}{i})
\end{equation*}
where $\mathcal{L}_\text{nll}$ is the {\em negative log-likelihood}
loss function:
\begin{equation*}
\mathcal{L}_\text{nll}(\text{guess},\text{actual}) =
-\left(\text{actual}\cdot \log (\text{guess}) + (1 - \text{actual})\cdot\log (1 -
\text{guess})\right) \;\;.
\end{equation*}
This loss function is also sometimes referred to as the {\em log loss}
or {\em cross entropy}\index{negative log-likelihood!loss function}.% \note{You can use any base for the logarithm
and it won't make any real difference. If we ask you for numbers,
use log base $e$.
{\bf What is the objective function for linear logistic classification?}
We can finally put all these pieces together and develop an objective
function for optimizing regularized negative log-likelihood for a
linear logistic classifier.\note{That's a lot of fancy words!} In
fact, this process is usually called ``logistic regression,'' so
we'll call our objective $J_\text{lr}$, and define it as
\begin{equation}
J_\text{lr}(\theta, \theta_0; {\cal D}) =
\left(\frac{1}{n} \sum_{i=1}^n
\mathcal{L}_\text{nll}(\sigma(\theta^T \ex{x}{i} + \theta_0), \ex{y}{i})\right) +
\lambda \norm{\theta}^2\;\;.
\label{eq:lr_with_reg}
\end{equation}
\question{Consider the case of linearly separable data. What will
the $\theta$ values that optimize this objective be like if
$\lambda = 0$? What will they be like if $\lambda$ is very big?
Try to work out an example in one dimension with two data points.}
{\bf What role does regularization play for classifiers?}
This objective function has the same structure as the one we used for
regression, Eq.~\ref{eq:ml_objective_loss}, where the first term (in
parentheses) is the average loss, and the second term is for regularization.
Regularization is needed for building classifiers that can generalize
well (just as was the case for regression). The parameter $\lambda$ governs
the trade-off between the two terms as illustrated in the following example.
Suppose we wish to obtain a linear logistic classifier for this one-dimensional dataset:
\centerline{\includegraphics[width=0.45\textwidth]{figures/linear_logistic_classifier_and_regularization_data}}
%
\noindent
Clearly, this can be fit very nicely by a hypothesis
$h(x)=\sigma(\theta x)$, but what is the best value for $\theta$?
Evidently, when there is no regularization ($\lambda=0$), the objective function $J_{lr}(\theta)$ will approach zero for large
values of $\theta$, as shown in the plot on the left, below.
However, would the best hypothesis really have an infinite (or very
large) value for $\theta$? Such a hypothesis would suggest that the
data indicate strong certainty that a sharp transition between $y=0$
and $y=1$ occurs exactly at $x=0$, despite the actual data having a wide
gap around $x=0$.
%
\centerline{\includegraphics[width=0.45\textwidth]{figures/linear_logistic_classifier_and_regularization_objective_noreg}
\includegraphics[width=0.45\textwidth]{figures/linear_logistic_classifier_and_regularization_objective_withreg}
}
%
\question{Be sure this makes sense. When the $\theta$ values are
very large, what does the sigmoid curve look like? Why do we say
that it has a strong certainty in that case?
}
In absence of other beliefs about the solution, we might prefer that
our linear logistic classifier not be overly certain about its
predictions, and so we might prefer a smaller $\theta$ over a large
$\theta.$ By not being overconfident, we might expect a somewhat
smaller $\theta$ to perform better on future examples drawn from this
same distribution.\note{To refresh on some vocabulary, we say that in
this example, a very large $\theta$ would be {\em overfit} to the
training data.} This preference can be realized using a nonzero
value of the regularization trade-off parameter, as illustrated in the
plot on the right, above, with $\lambda=0.2$.
Another nice way of thinking about regularization is that we
would like to prevent our hypothesis from being too dependent on the
particular training data that we were given: we would like for it to
be the case that if the training data were changed slightly, the
hypothesis would not change by much.
% Hypothesis $h_1$ has zero training loss, but is
% very complicated. Hypothesis $h_2$ mis-classifies two points, but is
% very simple. In absence of other beliefs about the solution, it is
% often better to prefer that the solution be ``simpler,'' and so we
% might prefer $h_2$ over $h_1$, expecting it to perform better on
% future examples drawn from this same distribution. \note{To establish
% some vocabulary, we say that $h_1$ is {\em overfit} to the training
% data.}
%
% \begin{center}
% \begin{tikzpicture}
% \draw [->] (-4,0) -- (4,0);
% \draw [->] (0,-4) -- (0,4);
% \draw[red,thick] (-2.5,-3) .. controls (-4,1) and (-3.8,1.2) .. (-2.4,-.5)
% .. controls (-2,-1) .. (-1,-1) .. controls (-.5,-1) .. (.5,-2)
% .. controls (.65, -2.1) .. (.7,-1.8) .. controls (.7,-1.5)
% .. (.6,-1.2) .. controls (.2,-.4) .. (1.2, -.3)
% .. controls (1.6,-.2) .. (1,4);
%
% \foreach \Point in {(-3.4, .3), (2.5, .2), (-1.5,-2), (-1,-3),(.7,-.7),
% (3, -2), (2.5, -3.2), (-.7,-1.2), (0.1,-1.8), (1.5, -1.5)}{
% \draw \Point circle[radius=2pt];
% }
% \foreach \Point in {(.5,-1.7),(-.2,.3),(-2.8,.2),(-2.4,-.3),(-1,2.7),
% (-1.5, 2), (-1.6, .9), (-3.8,-1), (-3.5,1.2),(.5, 1.2)}{
% \fill \Point circle[radius=2pt];
% }
% \draw[blue, thick] (-3.5,-1.8) -- (3.5,1.3);
% \node[right,blue] at (3.5,1.3) {$h_2$};
% \node[below,right,red] at (1,4) {$h_1$};
%
% \end{tikzpicture}
% \end{center}
\section{Gradient descent for logistic regression}
\label{sec-gd_lrr}
% Recall that choosing a loss function is the
% first step in formulating a machine-learning problem as an
% optimization problem, and for regression we studied the mean square
% loss, Eq.~\ref{eq:reg_mse}, which captures loss as
% $({\rm guess} - {\rm actual})^2$. The derivative of this,
% Eq.~\ref{eq:reg_gd_deriv}, was used previously to obtain an
% analytical solution to the linear regression problem. Gradient
% descent could also be applied to numerically compute a solution, using
% the update rule
% \begin{equation}
% \ex{\Theta}{t} = \ex{\Theta}{t-1} - \eta \frac{2}{n} \sum_{i=1}^{n} \left( \left[ \ex{\Theta}{t-1}\right]^T x^{(i)} - y^{(i)} \right) x^{(i)}
% \,.
% \end{equation}
% {~\hfill ~\note{Beware double superscripts! $\left[ \Theta \right]^T$ is the transpose of the vector $\Theta$}}
% But what happens when the models are nonlinear, and the loss function
% is more complicated than being merely mean squared error? Analytical
% solutions rarely exist for more complex situations, but gradient
% descent can still be applied, especially when the derivatives of the
% loss function can be computed analytically. We will see this by
% applying gradient descent to $J_\text{lr}$, the logistic regression
% objective.
% For classification, we
% are interested in the {\em negative log-likelihood} loss function
% (written more compactly as):
% \begin{equation}
% \mathcal{L}_\text{nll}(\ex{g}{i}, \ex{y}{i}) = - {\ex{y}{i}}\log {\ex{g}{i}} - (1 - \ex{y}{i})\log(1 - \ex{g}{i})
% \;\;
% \end{equation}
% where the correct values $\ex{y}{i}$ are now assumed to be either $0$
% or $1$, and $\ex{g}{i}$ are our guesses. These guesses are computed
% using the model parameters $\Theta = (\theta, \theta_0)$ as
% \begin{equation}
% \ex{g}{i} = \sigma(\theta^T \ex{x}{i} + \theta_0)
% \,,
% \end{equation}
% where $\sigma$ is some {\em nonlinear} function which maps $\R$ to
% real numbers in the range $0$ to $1$ (inclusive of the endpoints).
% In what follows, we use the function $\sigma(z) = \frac{1}{1+e^{-z}}$. (See the next chapter for
% more intuition about why.)
% Combining this with a ridge regularization term, we obtain this
% ``logistic regression'' objective function:
% \begin{equation}
% J_\text{lr}(\theta,\theta_0) = \frac{1}{n}\sum_{i=1}^n
% \mathcal{L}_\text{nll}
% (\ex{g}{i}, \ex{y}{i}) + \frac{\lambda}{2}\norm{\theta}^2
% \end{equation}
Now that we have a hypothesis class (LLC) and a loss function (NLL),
we need to take some data and find parameters!
Sadly, there is no lovely analytical solution like the one we obtained
for regression, in Section~\ref{sec-ridge_regression}. Good thing we
studied gradient descent! We can perform gradient descent on the
$J_{lr}$ objective, as we'll see next. We can also apply stochastic
gradient descent to this problem.
Luckily, $J_{lr}$ has enough nice properties that gradient descent and
stochastic gradient descent should generally ``work''. We'll soon see
some more challenging optimization problems though -- in the
context of neural networks, in Section~\ref{sec-make_nn_work}.
First we need derivatives with respect to both $\theta_0$ (the scalar
component) and $\theta$ (the vector component) of $\Theta$.
Explicitly, they are:\note{Some passing familiarity with matrix
derivatives is helpful here. A foolproof way of computing them is
to compute partial derivative of $J$ with respect to each component
$\theta_i$ of $\theta$.}
\begin{align*}
\nabla_\theta J_\text{lr}(\theta, \theta_0) & = \frac{1}{n}\sum_{i=1}^n
\left(\ex{g}{i} -
\ex{y}{i}\right) \ex{x}{i}
+ 2\lambda\theta \\
\frac{\partial J_\text{lr}(\theta, \theta_0)}{\partial \theta_0} & =
\frac{1}{n}\sum_{i=1}^n
\left(\ex{g}{i} -
\ex{y}{i} \right) \;\;.
\end{align*}
Note that $\nabla_\theta J_\text{lr}$ will be of shape $d \times 1$ and
$\frac{\partial J_\text{lr}}{\partial \theta_0}$ will be a scalar since
we have separated $\theta_0$ from $\theta$ here.
\question{Convince yourself that the dimensions of all these
quantities are correct, under the assumption that $\theta$ is $d
\times 1$.
}
\question{Compute $\nabla_\theta \norm{\theta}^2$ by finding the
vector of partial derivatives $(\partial \norm{\theta}^2 / \partial
\theta_1, \ldots, \partial \norm{\theta}^2 / \partial
\theta_d)$. What is the shape of $\nabla_\theta \norm{\theta}^2$?
}
\question{Compute $\nabla_\theta \mathcal{L}_\text{nll}(
\sigma(\theta^T x + \theta_0), y)$ by finding the
vector of partial derivatives $(\partial \mathcal{L}_\text{nll}(
\sigma(\theta^T x + \theta_0), y)/ \partial \theta_1, \ldots,
\partial \mathcal{L}_\text{nll}(
\sigma(\theta^T x + \theta_0), y) / \partial
\theta_d)$.
}
\question{Use these last two results to verify our derivation above.}
Putting everything together, our gradient descent algorithm for logistic
regression becomes:\index{gradient descent!applied to logistic regression}
\begin{codebox}
\Procname{$\proc{LR-Gradient-Descent}(\theta_{\it init}, \theta_{0
{\it init}},\eta,\epsilon)$}
\li $\theta^{(0)} \gets \theta_{\it init}$
\li $\theta_0^{(0)} \gets \theta_{0 {\it init}}$
\li $t \gets 0$
\li \Repeat
\li $t \gets t+1$
\li $\theta^{(t)} = \theta^{(t-1)} - \eta\left(\frac{1}{n}\sum_{i=1}^n
\left(\sigma\left({\ex{\theta}{t-1}}^T \ex{x}{i} + \ex{\theta_0}{t-1}\right) -
\ex{y}{i}\right) \ex{x}{i}
+ 2\lambda\ex{\theta}{t-1}
\right)$
\li $\theta_0^{(t)} = \theta_0^{(t-1)} - \eta\left(\frac{1}{n}\sum_{i=1}^n
\left(\sigma\left({\ex{\theta}{t-1}}^T \ex{x}{i} + \ex{\theta_0}{t-1}\right) -
\ex{y}{i} \right)
\right)$
\li \Until $\left| J_{\text{lr}}(\theta^{(t)},\theta_0^{(t)}) - J_{\text{lr}}(\theta^{(t-1)},
\theta_0^{(t-1)}) \right| <\epsilon$
\li \Return $\theta^{(t)},\theta_0^{(t)}$
\end{codebox}
Logistic regression, implemented using batch or stochastic gradient
descent, is a useful and fundamental machine learning technique. We
will also see later that it corresponds to a one-layer neural network
with a sigmoidal activation function, and so is an important step
toward understanding neural networks.
\subsection{Convexity of the NLL Loss Function}
Much like the squared-error loss function that we saw for linear regression,
the NLL loss function for linear logistic regression is a convex function.
This means that running gradient descent with a reasonable set of hyperparameters
will converge arbitrarily close to the minimum of the objective function.
We will use the following facts to demonstrate that the NLL loss function is a convex function:
\begin{itemize}
\item if the derivative of a function of a scalar argument is monotonically increasing, then it is a convex function,
\item the sum of convex functions is also convex,
\item a convex function of an affine function is a convex function.
\end{itemize}
Let $z=\theta^T x + \theta_0$; $z$ is an affine function of $\theta$ and $\theta_0$.
It therefore suffices to show that the functions $f_1(z) = - \log (\sigma(z))$ and
$f_2 (z) = -\log (1 - \sigma(z))$ are convex with respect to $z$.
First, we can see that since,
\begin{align*}
\frac{d}{dz} f_1(z) & = \frac{d}{dz} \left[ -\log (1/(1 + \exp (-z))) \right], \\
& = \frac{d}{dz} \left[ \log (1 + \exp (-z)) \right], \\
& = - \exp (-z) / (1 + \exp (-z)), \\
& = -1 + \sigma(z),
\end{align*}
the derivative of the function $f_1 (z)$ is a monotonically increasing function and
therefore $f_1$ is a convex function.
Second, we can see that since,
\begin{align*}
\frac{d}{dz} f_2(z) & = \frac{d}{dz} \left[ -\log (\exp(-z)/(1 + \exp (-z))) \right], \\
& = \frac{d}{dz} \left[ \log (1 + \exp (-z)) + z \right], \\
& = \sigma(z),
\end{align*}
the derivative of the function $f_2(z)$ is also monotonically increasing and
therefore $f_2$ is a convex function.
\section{Handling multiple classes}
So far, we have focused on the {\em binary} classification case, with
only two possible classes. But what can we do if we have multiple
possible classes (e.g., we want to predict the genre of a movie)?
There are two basic strategies:
\begin{itemize}
\item Train multiple binary classifiers using different subsets of
our data and combine their outputs to make a class prediction.
\item Directly train a multi-class classifier using a hypothesis
class that is a generalization of logistic regression, using a
{\em one-hot} output encoding and NLL loss.
\end{itemize}
%% We will explore both methods below, but the method based on NLL is
%% in wider use, especially in the context of neural networks.
%% \subsection{Reducing multi-class to binary}
%% There are two standard strategies for reducing multi-class
%% classification to binary classification.\index{classification!multi-class classification}
%% In the {\em one versus all} (OVA) strategy\index{classification!multi-class classification!one versus all}, we will train $K$
%% different binary classifiers. To train classifier $k$, we
%% \begin{itemize}
%% \item Create a training set $\data_k$ in which all $\ex{x}{i} \in D$
%% for which $\ex{y}{i} = k$ are assigned label $+1$ and all other
%% $\ex{x}{i}$ are assigned label $-1$.
%% \item Use logistic regression (or any other classifier-learning
%% method) to obtain a hypothesis $h_k$ by training on $\data_k$.
%% \end{itemize}
%% Then, we create a new multi-class hypothesis
%% \[h(x) = \argmax{k} h_k(x)\;\;,\]
%% that predicts the class $k$ to which this example was
%% assigned with highest confidence by the individual hypotheses.
%% In the {\em one versus one} (OVO) strategy\index{classification!multi-class classification!one versus one}, we will train
%% $K(K-1)/2$ classifiers, one for each unique pair $j, k$ (where $j \not= k$) of classes. To
%% train classifier $j, k$, we
%% \begin{itemize}
%% \item Create a training set $\data_{jk}$ in which all $\ex{x}{i} \in D$
%% for which $\ex{y}{i} = j$ are assigned label $+1$ and
%% all $\ex{x}{i} \in D$ for which $\ex{y}{i} = k$ are assigned label $-1$.
%% \item Use logistic regression (or any other classifier-learning
%% method) to obtain hypotheses $h_{jk}$ by training on $\data_{jk}$.
%% \end{itemize}
%% Then, given a new input $x$, we feed it into each of the classifiers $h_{jk}$
%% and obtain a {\em binary} decision about whether it is more likely to
%% be in class $j$ or class $k$. We generate as output the class that is
%% selected as the ``winner'' in the majority of the cases.
%% \subsection{Multi-class classification and log likelihood}
The method based on NLL is in wider use, especially in the context of
neural networks, and is explored here.
In the following, we will assume that we have a data set $\data$ in
which the inputs
$\ex{x}{i} \in R^d$ but the outputs $\ex{y}{i}$ are drawn from a set
of $K$ classes $\{c_1, \ldots, c_K\}$.
Next, we extend the idea of NLL directly to multi-class
classification\index{negative log-likelihood!multi-class
classification} with $K$ classes, where the training label is
represented with what is called a {\em one-hot} vector
$y=\begin{bmatrix} y_1, \ldots, y_K \end{bmatrix}^T$, where $y_k=1$ if
the example is of class $k$ and $y_k = 0$ otherwise.
Now, we have a problem of mapping an input $\ex{x}{i}$ that is in
$\R^d$ into a $K$-dimensional output. Furthermore, we would like this
output to be interpretable as a discrete probability distribution over
the possible classes, which means the elements of the output vector have to
be {\em non-negative} (greater than or equal to 0) and sum to 1.
We will do this in two steps. First, we will map our input
$\ex{x}{i}$ into a vector value $\ex{z}{i} \in \R^K$ by letting $\theta$ be a
whole $d \times K$ {\em matrix} of parameters, and $\theta_0$ be a $K
\times 1$ vector, so that
\[z = \theta^T x + \theta_0\;\;.\]%
\note{Let's check dimensions! $\theta^T$ is $K \times d$ and $x$ is
$d \times 1$, and $\theta_0$ is $K \times 1$, so $z$ is $K \times
1$ and we're good!}%
Next, we have to extend our use of the sigmoid function to the
multi-dimensional {\em softmax} function, that takes
a whole vector $z \in \R^K$ and generates
\[g = \textit{softmax}(z) =
\begin{bmatrix}
\exp(z_1) / \sum_{i} \exp(z_i) \\
\vdots \\
\exp(z_K) / \sum_{i} \exp(z_i)
\end{bmatrix}\;\;.\]
which can be interpreted as a probability distribution over $K$ items. To make the final prediction of the class label, we can then look at $g,$ find the most likely probability over these $K$ entries in $g,$ (i.e. find the largest entry in $g,$) and return the corresponding index as the ``one-hot'' element of $1$ in our prediction.
\question{Convince yourself that the vector of $g$ values will be
non-negative and sum to 1.}
Putting these steps together, our hypotheses will be
\[h(x; \theta, \theta_0) = \textit{softmax}(\theta^T x + \theta_0)\;\;.\]
Now, we retain the goal of maximizing the probability that our
hypothesis assigns to the correct output $y_k$ for each input $x$. We
can write this probability, letting $g$ stand for our ``guess'', $h(x)$, for a
single example $(x, y)$ as $\prod_{k = 1}^K g_k^{y_k}$.
\question{How many elements that are not equal to 1 will there be in this product?}
The negative log of the probability that we are making a correct guess is, then, for {\em one-hot} vector $y$ and {\em probability distribution}
vector $g$,