-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathperceptron.tex
396 lines (350 loc) · 16.1 KB
/
perceptron.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
\chapter{The Perceptron}
\label{chap-perceptron}
First of all, the coolest algorithm name! \note{Well, maybe
``neocognitron,'' also the name of a real ML algorithm, is cooler.}
It is based on the 1943 model of neurons made by McCulloch and Pitts
and by Hebb. It was developed by Rosenblatt in 1962. At the time, it
was not interpreted as attempting to optimize any particular criteria;
it was presented directly as an algorithm. There has, since, been a
huge amount of study and analysis of its convergence properties and
other aspects of its behavior.
\section{Algorithm}
Recall that we have a training dataset $\dataTrain$ with $x \in \R^d$,
and $y\in \{-1, +1\}$. The Perceptron algorithm trains a binary
classifier $h(x; \theta, \theta_0)$ using the following algorithm to
find $\theta$ and $\theta_0$ using (at most) $\tau$ \anchorednote{iterative
steps:}{We use Greek letter $\tau$ here instead of $T$ so we don't
confuse it with transpose!}
\begin{codebox}
\Procname{$\proc{Perceptron}(\tau, \dataTrain)$}
\li $\theta \gets
\begin{bmatrix}
0 & 0 & \cdots & 0
\end{bmatrix}^T$
\li $\theta_0 \gets 0$
\li \For $t \gets 1$ \To $\tau$
\li \Do
changed = False
\li \For $i \gets 1$ \To $n$
\li \Do
\If $\ex{y}{i}\left(\theta^T\ex{x}{i} + \theta_0\right) \le 0$
\li \Then
$\theta \gets \theta + \ex{y}{i}\ex{x}{i}$
\li $\theta_0 \gets \theta_0 + \ex{y}{i}$
\li changed = True
\End
\End
\li \If {\sc not} changed
\li \Then
break
\End
\End
\li \Return $\theta, \theta_0$
\end{codebox}
\note{Let's check dimensions. Remember that $\theta$ is $d \times 1$,
$\ex{x}{i}$ is $d \times 1$, and $\ex{y}{i}$ is a scalar. Does
everything match?}
Intuitively, on each step, if the current hypothesis
$\theta, \theta_0$ classifies example $\ex{x}{i}$ correctly, then no
change is made. If it classifies $\ex{x}{i}$ incorrectly, then it
moves $\theta, \theta_0$ so that it is ``closer'' to classifying
$x^{(i)}, y^{(i)}$ correctly.
Note that if the algorithm ever steps once through every value of $i$
without making an update, it will never make any further updates (this
is true even if the ``break'' statement were removed; verify that you
believe this!) and so it should just terminate at that point.
\question {\em What is true about $\trainerr$ if that happens?}
\begin{examplebox}{\bf Example:}
Let $h$ be the linear classifier defined by $\ex{\theta}{0} =
\protect\twodcol{1}{-1}, \ex{\theta_0}{0} = 1$.
\noindent
The diagram below shows several points classified by $h$. However, in
this case, $h$ (represented by the bold line and normal vector $\ex{\theta}{0}$) misclassifies the point
$\ex{x}{1} = \twodcol{1}{3}$ which has label $\ex{y}{1} = 1$. Indeed,
$$ \ex{y}{1}\left(\theta^T \ex{x}{1} + \theta_0\right) = \twodrow{1}{-1}
\twodcol{1}{3} + 1 = -1 < 0 $$
By running an iteration of the Perceptron algorithm, we update
$$ \ex{\theta}{1} = \ex{\theta}{0} + \ex{y}{1}\ex{x}{1} = \twodcol{2}{2} $$
$$ \ex{\theta_0}{1} = \ex{\theta_0}{0} + \ex{y}{1} = 2 $$
The new classifier (represented by the dashed line and normal vector $\ex{\theta}{1}$) now correctly
classifies that point, but now makes a mistake on the negatively
labeled point.
\begin{center}
\begin{tikzpicture}
\draw [thin, gray!40] (-3,-3) grid (4,4);
\draw [<->] (-3,0) -- (4,0);
\draw [<->] (0,-3) -- (0,4);
\draw [ultra thick] (-3, -2) -- (3, 4)
node [above right] {\large${\ex{\theta}{0}}^Tx + \ex{\theta_0}{0} = 0$};
\draw [thick,teal,-latex] (0.5, 1.5) -- (1.4, 0.6)
node [black,above right] {$\ex{\theta}{0}$};
% \draw [ultra thick, dashed] (-2, 4) -- (4, -2)
\draw [ultra thick, dashed] (-3, 2) -- (2, -3)
node [black,above right] {\large${\ex{\theta}{1}}^Tx + \ex{\theta_0}{1} = 0$};
% \draw [thick,teal,-latex] (1, 1) -- (1.9, 1.9)
% \draw [thick,teal,-latex] (-1, -1) -- (-0.1, -0.1)
\draw [thick,teal,-latex] (0.0, -1.0) -- (0.9, -0.1)
node [black,below right] {$\ex{\theta}{1}$};
\node [above right,xshift=.3em] at (1,3) {\large$\ex{x}{1}$};
\foreach \Point in {(1, 3), (2.5, 1.5)}{
\pic at \Point {plus};
}
\foreach \Point in {(-1.5, 1.5)}{
\pic at \Point {minus};
}
\end{tikzpicture}
\end{center}
\end{examplebox}
A really important fact about the perceptron algorithm is that, if
there is a linear classifier with 0 training error, then
for large enough $\tau$
this algorithm will (eventually) find it! We'll look at a proof of this in
detail, next.
\section{Offset}
\label{sec-offset}
Sometimes, it can be easier to implement or analyze classifiers of the
form
\begin{eqnarray*}
h(x; \theta) =
\begin{cases}
+1 & \text{if } \theta^Tx > 0 \\
-1 & \text{otherwise.}
\end{cases}
\end{eqnarray*}
Without an explicit offset term ($\theta_0$), this separator must pass
through the origin, which may appear to be limiting. However, we can
convert any problem involving a linear separator \emph{with} offset
into one with \emph{no} offset (but of higher dimension)!
Consider the $d$-dimensional linear separator defined by
$\theta = \begin{bmatrix} \theta_1 & \theta_2 & \cdots
& \theta_d\end{bmatrix}$ and offset $\theta_0$.
\begin{itemize}
\item for each data point
$(x, y) \in \dataTrain$, append a coordinate with value +1 to $x$, yielding
$$x_{\rm new} = \begin{bmatrix} x_1 & \cdots & x_d & +1 \end{bmatrix}^T$$
\item define $$\theta_{\rm new} = \begin{bmatrix} \theta_1 & \cdots &
\theta_d & \theta_0\end{bmatrix}^T$$
\end{itemize}
Then,
\begin{align*}
\theta_{\rm new}^T \cdot x_{\rm new} & = \theta_1x_1 + \cdots + \theta_dx_d
+ \theta_0 \cdot 1 \\
& = \theta^Tx + \theta_0
\end{align*}
Thus, $\theta_{\rm new}$ is an equivalent ($(d+1)$-dimensional) separator to
our original, but with no offset.
Consider the data set:
\begin{align*}
X & = [[1], [2], [3], [4]] \\
Y & = [[+1], [+1], [-1], [-1]] \\
\end{align*}
It is linearly separable in $d = 1$ with $\theta = [-1]$ and $\theta_0
= 2.5$. But it is not linearly separable through the origin! Now,
let
\[X_{\rm new} = \begin{bmatrix}
\begin{bmatrix} 1 \\ 1 \end{bmatrix}
\begin{bmatrix} 2 \\ 1 \end{bmatrix}
\begin{bmatrix} 3 \\ 1 \end{bmatrix}
\begin{bmatrix} 4 \\ 1 \end{bmatrix}
\end{bmatrix}\]
This new dataset is separable through the origin, with $\theta_{\rm
new} = [-1, 2.5]^T$.
% \lpknote{TAs: Can you add the figure / example from the handwritten
% notes? (Just the example with 4 points going from 1 to 2
% dimensions.)}
We can make a simplified version of the perceptron algorithm if we
restrict ourselves to separators through the origin: \note{We list it
here because this is the version of the algorithm we'll study in
more detail.}
\begin{codebox}
\Procname{$\proc{Perceptron-Through-Origin}(\tau, \dataTrain)$}
\li $\theta \gets
\begin{bmatrix}
0 & 0 & \cdots & 0
\end{bmatrix}^T$
\li \For $t \gets 1$ \To $\tau$
\li \Do
changed = False
\li \For $i \gets 1$ \To $n$
\li \Do
\If $\ex{y}{i}\left(\theta^T\ex{x}{i}\right) \le 0$
\li \Then
$\theta \gets \theta + \ex{y}{i}\ex{x}{i}$
\li changed = True
\End
\End
\li \If {\sc not} changed
\li \Then
break
\End
\li \Return $\theta$
\end{codebox}
\section{Theory of the perceptron}
Now, we'll say something formal about how well the perceptron
algorithm really works. We start by characterizing the set of
problems that can be solved perfectly by the perceptron algorithm, and
then prove that, in fact, it can solve these problems. In addition,
we provide a notion of what makes a problem difficult for perceptron
and link that notion of difficulty to the number of updates the
algorithm will make.
\subsection{Linear separability}
A training set $\dataTrain$ is {\em{linearly separable}} if there exist
$\theta, \theta_0$ such that, for all $i = 1, 2, \dots, n$:
$$ \ex{y}{i}\left(\theta^T\ex{x}{i} + \theta_0\right) > 0 \;\;.$$
Another way to say that $\dataTrain$ is linearly separable
is that there exists an $h$ such that all predictions on the training set
are correct,
$$ h(\ex{x}{i}; \theta, \theta_0) = \left(\theta^T\ex{x}{i} + \theta_0\right) =
\ex{y}{i} \;\;,$$
in which case we call $h$ a {\em linear separator}.
And, another way to say that $\dataTrain$ is linearly separable
is that there exists an $h$ of the form above such that
the training error is zero,
$$\trainerr(h) = 0 \;\;.$$
\subsection{Convergence theorem}
The basic result about the perceptron is that, if the training data
$\dataTrain$ is linearly separable, then the perceptron algorithm is
guaranteed to find a linear separator, for large
enough $\tau.$ \note{If the training data is
{\em not} linearly separable, the algorithm will not be able to tell
you for sure, in finite time, that it is not linearly separable.
There are other algorithms that can test for linear separability
with run-times $O(n^{d/2})$ or $O(d^{2n})$ or $O(n^{d-1}\log{n})$.}
We will more specifically characterize the linear separability of the
dataset by the \emph{margin} of the separator. We'll start by
defining the margin of a point with respect to a hyperplane.
First, recall that the signed distance from the
hyperplane $\theta, \theta_0$ to a point $x$ is
\[ \frac{\theta^Tx + \theta_0}{\norm{\theta}}\;\;. \]
Then, we'll define the {\em margin} of a {\em labeled point} $(x, y)$
with respect to hyperplane $\theta, \theta_0$ to be
\[ y \cdot \frac{\theta^Tx + \theta_0}{\norm{\theta}}\;\;. \]
This quantity will be positive if and only if the point $x$ is
classified as $y$ by the linear classifier represented by this
hyperplane.
\question{What sign does the margin have if the point is incorrectly
classified? Be sure you can explain why.}
Now, the {\em margin} of a {\em dataset} $\dataTrain$ with respect to the hyperplane
$\theta, \theta_0$ is the {\em minimum} margin of any point with respect
to $\theta, \theta_0$:
\[\min_i \left(\ex{y}{i} \cdot \frac{\theta^T \ex{x}{i} + \theta_0}{\norm{\theta}}\right)\;\;.\]
The margin is positive if and only if all of the points in the data-set
are classified correctly. In that case (only!) it represents the
distance from the hyperplane to the closest point.
\begin{examplebox}{\bf Example:}
Let $h$ be the linear classifier defined by $\theta = \protect\twodcol{1}{-1}, \theta_0 = 1$.
\bigskip
\noindent
The diagram below shows several points classified by $h$, one of which is misclassified. We compute the margin for each point:
\begin{center}
\begin{tikzpicture}
\draw [thin, gray!40] (-3,-3) grid (4,4);
\draw [<->] (-3,0) -- (4,0);
\draw [<->] (0,-3) -- (0,4);
\draw [ultra thick] (-3, -2) -- (3, 4)
node [above] {\large$\theta^Tx + \theta_0 = 0$};
\draw [thick,teal,-latex] (0, 1) -- (0.9, 0.1)
node [black,below right] {$\theta$};
\node [above right,xshift=.3em] at (1,3) {\large$\ex{x}{1}$};
\node [above right,xshift=.3em] at (2.5, 1.5) {\large$\ex{x}{2}$};
\node [above right,xshift=.3em] at (-1.5, 1.5) {\large$\ex{x}{3}$};
\foreach \Point in {(1, 3), (2.5, 1.5)}{
\pic at \Point {plus};
}
\foreach \Point in {(-1.5, 1.5)}{
\pic at \Point {minus};
}
\end{tikzpicture}
$$ \ex{y}{1} \cdot \frac{\theta^T \ex{x}{1} + \theta_0}{\norm{\theta}} = 1 \cdot \frac{-2 + 1}{\sqrt{2}} = -\frac{\sqrt{2}}{2} $$
$$ \ex{y}{2} \cdot \frac{\theta^T \ex{x}{2} + \theta_0}{\norm{\theta}} = 1 \cdot \frac{1 + 1}{\sqrt{2}} = \sqrt{2} $$
$$ \ex{y}{3} \cdot \frac{\theta^T \ex{x}{3} + \theta_0}{\norm{\theta}} = -1 \cdot \frac{-3 + 1}{\sqrt{2}} = \sqrt{2} $$
\end{center}
Note that since point $\ex{x}{1}$ is misclassified, its margin is negative. Thus the margin for the whole data set is given by $-\frac{\sqrt{2}}{2}$.
\end{examplebox}
\begin{theorem}[Perceptron Convergence]
For simplicity, we consider the case where the linear separator must
pass through the origin, and $\tau$ is infinite. If the following conditions hold:
\begin{enumerate}[(a)]
\item there exist $\theta^*$ and $\gamma$ such that $\gamma > 0$, and
$\ex{y}{i} \frac{\theta^{*T}\ex{x}{i}}{\norm{\theta^*}}
\geq \gamma$
for all $i = 1, \ldots, n$,
\item all the examples have bounded magnitude: $\norm{\ex{x}{i}} \leq R$
for all $i = 1, \ldots n$,
\end{enumerate}
then the perceptron algorithm will make at most $\left(\frac{R}{\gamma}
\right)^2$ updates to its starting hypothesis. At this point, its hypothesis
will be a linear separator of the data.
%\note{this convergence guarantee is independent of $d$ and $n$!!}
\end{theorem}
\begin{proof}
We initialize $\ex{\theta}{0} = 0$, and let $\ex{\theta}{k}$ define our
hyperplane after the perceptron algorithm has made $k$ updates to its
starting hypothesis.
We are going to think about the angle between the hypothesis we have
now, $\ex{\theta}{k}$ and the assumed good separator $\theta^*$.
Since they both go through the origin, if we can show that the angle
between them is decreasing usefully across iterations, then we will
get close to that separator.
So, let's think about the $\cos$ of the angle between them, and
recall, by the definition of dot product:
\[\cos\left(\ex{\theta}{k}, \theta^*\right)
= \frac{\ex{\theta}{k} \cdot \theta^*}{
\norm{\theta^*}\norm{\ex{\theta}{k}}}\]
We'll divide this up into two factors,
\begin{equation}
\cos\left(\ex{\theta}{k}, \theta^*\right)
= \left(\frac{\ex{\theta}{k} \cdot
\theta^*}{\norm{\theta^*}}\right)\cdot
\left(\frac{1}{\norm{\ex{\theta}{k}}}\right)\;\;,
\label{eq:cos}
\end{equation}
and start by focusing on the first factor.
Without loss of generality, assume that the $k^{th}$
update occurs on the $i^{th}$ example $\left(\ex{x}{i}, \ex{y}{i}\right)$.
\begin{align*}
\frac{\ex{\theta}{k} \cdot \theta^*}{\norm{\theta^*}}
& = \frac{\left(\ex{\theta}{k-1} + \ex{y}{i}\ex{x}{i}\right)\cdot \theta^*}
{\norm{\theta^*}} \\
& = \frac{\ex{\theta}{k-1}\cdot \theta^*}{\norm{\theta^*}} + \frac{
\ex{y}{i}\ex{x}{i}\cdot\theta^*}{\norm{\theta^*}} \\
& \geq \frac{\ex{\theta}{k-1}\cdot \theta^*}{\norm{\theta^*}} + \gamma \\
& \geq k\gamma
\end{align*}
where we have first applied the margin condition from $(a)$ and then
applied simple induction.
Now, we'll look at the second factor in Eq.~\ref{eq:cos}.
We note that since $\left(\ex{x}{i},\ex{y}{i}\right)$ is classified
incorrectly, $\ex{y}{i}\left({\ex{\theta}{k-1}}^T\ex{x}{i}\right) \leq 0$.
Thus,
\begin{align*}
\norm{\ex{\theta}{k}}^2 & = \norm{\ex{\theta}{k-1} + \ex{y}{i}\ex{x}{i}}^2 \\
& = \norm{\ex{\theta}{k-1}}^2 + 2\ex{y}{i}
{\ex{\theta}{k-1}}^T\ex{x}{i} + \norm{\ex{x}{i}}^2 \\
& \leq \norm{\ex{\theta}{k-1}}^2 + R^2 \\
& \leq kR^2
\end{align*}
where we have additionally applied the assumption from $(b)$ and then
again used simple induction.
Returning to the definition of the dot product, we have
\[\cos\left(\ex{\theta}{k}, \theta^*\right)
= \frac{\ex{\theta}{k} \cdot \theta^*}{\norm{\ex{\theta}{k}}
\norm{\theta^*}}
= \left(\frac{\ex{\theta}{k} \cdot \theta^*}{\norm{\theta^*}}\right)
\frac{1}{\norm{\ex{\theta}{k}}}
\geq (k\gamma)\cdot \frac{1}{\sqrt{k}R} = \sqrt{k}\cdot\frac{\gamma}{R}\]
Since the value of the cosine is at most 1, we have
\begin{align*}
1 & \geq \sqrt{k} \cdot \frac{\gamma}{R} \\
k & \leq \left(\frac{R}{\gamma}\right)^2.
\end{align*}
\end{proof}
This result endows the margin $\gamma$ of $\dataTrain$ with an
operational meaning: when using the Perceptron algorithm for
classification, at most $(R/\gamma)^2$ updates will be
made, where $R$ is an upper bound on the magnitude of the training
vectors.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: